Queued Spinlocks in the WRK
Windows Research Kernel @ HPIA few days ago, we came up with a discussion on the advantages of queued spinlocks over normal spinlocks. The biggest advantage in our oppinion is that queued spinlocks guarantee a FIFO ordering among competing processors while normal spinlocks don't. In this article, we show the implementation of queued spinlocks in the WRK. We present the source code of the 64-bit version for two reasons: first, the 64-bit version contains the implementation in plain C and not in the Assembly language, and second, in the 32-bit version, queued spinlocks are implemented in the HAL, which is not available as source code. The rational behind the implementation remains however the same.
Queued spinlocks come in two flavors in the WRK: numbered queued spinlocks and in stack queued spinlocks. In this article, we will only cover the latter as those are part of the exposed kernel API and can as well be used by driver developers.
The basic idea of a queued spinlock is that all competing CPUs or cores form a queue and upon release the first core gets the lock. Queued spinlocks use the following fundamental data structures:
typedef struct _KSPIN_LOCK_QUEUE { struct _KSPIN_LOCK_QUEUE * volatile Next; PKSPIN_LOCK volatile Lock; } KSPIN_LOCK_QUEUE, *PKSPIN_LOCK_QUEUE; typedef struct _KLOCK_QUEUE_HANDLE { KSPIN_LOCK_QUEUE LockQueue; KIRQL OldIrql; } KLOCK_QUEUE_HANDLE, *PKLOCK_QUEUE_HANDLE;
The KSPIN_LOCK_QUEUE data structure is defined in
      the public\inc\sdk\ntkeapi.h file, so it comes with
      every Windows Driver Development Kit. We will come to the meaning of
      those fields a bit later.
The WRK and later Windows versions expose the following API for in stack queued spinlocks (for a complete list on the synchronization API please click here):
KeAcquireInStackQueuedSpinLockKeAcquireInStackQueuedSpinLockAtDpcLevelKeReleaseInStackQueuedSpinLockKeReleaseInStackQueuedSpinLockFromDpcLevel
In the XxxDpcLevel functions, the WRK will not
      change the IRQL. So, if your driver or kernel code is already
      executing at an IRQL of DPC or higher, you may want to use those
      API. Otherwise the WRK first raises the IRQL to DPC level on
      acquisition and lowers it to the previous level upon release. As
      raising and lowering the IRQL is not of interest here, we only cover
      the XxxDpcLevel functions.
Acquiring a Queued Spinlock
The signature for acquiring a queud spinlock is as follows:
VOID KeAcquireInStackQueuedSpinLockAtDpcLevel ( __inout PKSPIN_LOCK SpinLock, __out PKLOCK_QUEUE_HANDLE LockHandle )
The first parameter is a pointer to the spinlock that should be acquired while the second parameter is a pointer to the lock queue entry that will be used to enqueue the CPU if necessary. The second parameter may and should point to a stack location of the current thread.
In a first part, the KLOCK_QUEUE_HANDLE data
      structure will be initialized
      (base\ntos\ke\amd64\queuelock.c:729):
{ // // Acquire the specified in stack queued spin lock at the current // IRQL. // #if !defined(NT_UP) LockHandle->LockQueue.Lock = SpinLock; LockHandle->LockQueue.Next = NULL; #else UNREFERENCED_PARAMETER(SpinLock); #endif KxAcquireQueuedSpinLock(&LockHandle->LockQueue, SpinLock); return; }
The actual acquisition is done in
      KxAcquireQueuedSpinLock, an internal helper function.
      It should be clear that on uni-processor machines spinlocks are
      unnecessary. If the IRQL is raised to DPC level, no dispatching
      takes place, which means the currently running thread gets the lock
      and runs until completion. That's the reason for the
      #ifdefs in the above and following code fragments. The
      implementation of KxAcquireQueuedSpinLock
      (base\ntos\ke\amd64\queuelock.c:114) is as follows:
__forceinline VOID KxAcquireQueuedSpinLock ( __inout PKSPIN_LOCK_QUEUE LockQueue, __inout PKSPIN_LOCK SpinLock ) { // // [...] // #if !defined(NT_UP) PKSPIN_LOCK_QUEUE TailQueue; TailQueue = InterlockedExchangePointer((PVOID *)SpinLock, LockQueue); if (TailQueue != NULL) { KxWaitForLockOwnerShip(LockQueue, TailQueue); } #else UNREFERENCED_PARAMETER(LockQueue); UNREFERENCED_PARAMETER(SpinLock); #endif return; }
First, the value of SpinLock is atomically exchanged
      with the address of the lock queue entry
      (PKSPIN_LOCK_QUEUE) and the result is stored into
      TailQueue. If TailQueue is
      NULL, it means that the lock was free and the caller is
      the first to attempt the acquire. In that case we are done.
If TailQueue is not NULL, we see the
      address of our predecessor in the waiter queue. It does not
      necessarily mean, that we are the last in the queue, but we have at
      least one predecessor. However, we must now wait for the lock to be
      released. This is done in the function
      KxWaitForLockOwnerShip
      (base\ntos\ke\amd64\queuelock.c:62)
DECLSPEC_NOINLINE ULONG64 KxWaitForLockOwnerShip ( __inout PKSPIN_LOCK_QUEUE LockQueue, __inout PKSPIN_LOCK_QUEUE TailQueue ) { ULONG64 SpinCount; // // [...] // *((ULONG64 volatile *)&LockQueue->Lock) |= LOCK_QUEUE_WAIT; TailQueue->Next = LockQueue; // // [...] // SpinCount = 0; do { KeYieldProcessor(); } while ((*((ULONG64 volatile *)&LockQueue->Lock) & LOCK_QUEUE_WAIT) != 0); KeMemoryBarrier(); return SpinCount; }
What happens here is that we first set the
      LOCK_QUEUE_WAIT bit in the Lock field of
      the KSPIN_LOCK_QUEUE structure. Then, we update the
      Next field in our predecessor's queue entry. This is
      necessary to inform the preceding CPU that someone else is waiting.
      Finally, we spin on the Lock until the
      LOCK_QUEUE_WAIT bit is cleared. If the bit is cleared,
      the spinlock has been acquired. Please note that the actual spinning
      is done on a local variable. This is in contrast to normal
      spinlocks, where all competing CPUs are spinning on a global
      variable, which may decrease performance due to cache coherency
      issues.
Interestingly, the SpinCount variable that should
      return the number of spin-loops is defined, initialized, and
      returned, but it is never updated. Maybe this was due to performance
      issues or it is enabled only in debug version of the kernel.
Releasing a Queued Spinlock
When KeReleaseInStackQueuedSpinLockFromDpcLevel is
      called, it basically redirects control to
      KxReleaseQueuedSpinLock
      (base\ntos\ke\amd64\queuelock.c:227):
__forceinline VOID KxReleaseQueuedSpinLock ( __inout PKSPIN_LOCK_QUEUE LockQueue ) { // // [...] // #if !defined(NT_UP) PKSPIN_LOCK_QUEUE NextQueue; NextQueue = ReadForWriteAccess(&LockQueue->Next); if (NextQueue == NULL) { if (InterlockedCompareExchangePointer((PVOID *)LockQueue->Lock, NULL, LockQueue) == LockQueue) { return; } NextQueue = KxWaitForLockChainValid(LockQueue); } ASSERT(((ULONG64)NextQueue->Lock & LOCK_QUEUE_WAIT) != 0); InterlockedXor64((LONG64 volatile *)&NextQueue->Lock, LOCK_QUEUE_WAIT); LockQueue->Next = NULL; #else UNREFERENCED_PARAMETER(LockQueue); #endif return; }
What happens here is that we first determine, whether there are
      any processors waiting for the lock. Therefore, the
      Next field is examined. If it is NULL, we
      are the last (and only) core and must free the lock. The lock is
      released by setting it to NULL. However, we must do
      this in an atomic compare-and-exchange operation to make sure that
      no other processor updated the spinlock meanwhile.
      InterlockedCompareExchangePointer will set the data
      where LockQueue->Lock points to to
      NULL, if and only if it still contains the address of
      LockQueue. If it succeeds, it returns the
      LockQueue address. In that case, the lock has been
      released.
Otherwise, it returns the actual value of the spinlock. Unfortunately that address is not necessarily the next processor to acquire the lock, as multiple acquisition attempts may have been made meanwhile and the lock only contains the address of the last entry.
To retrieve the actual next processor, our CPU must spin-wait on
      the other processor to finishing its initialization. The
      initialization is done when the succeeding CPU has set the
      Next field in our KSPIN_LOCK_QUEUE entry.
      The busy-wait is performed in the
      KxWaitForLockChainValid
      (base\ntos\ke\amd64\queuelock.c:24) function:
DECLSPEC_NOINLINE PKSPIN_LOCK_QUEUE KxWaitForLockChainValid ( __inout PKSPIN_LOCK_QUEUE LockQueue ) { PKSPIN_LOCK_QUEUE NextQueue; // [...] do { KeYieldProcessor(); } while ((NextQueue = LockQueue->Next) == NULL); return NextQueue; }
It spins until the LockQueue->Next entry is not
      NULL. In that case, our successor in the queue has
      finished its intialization and is spinning for the
      LOCK_QUEUE_WAIT bit.
To inform the next CPU that the spinlock has been released,
      InterlockedXor64 is called to clear the
      LOCK_QUEUE_WAIT bit of the
      NextQueue->Lock field. Once the bit is cleared, the
      lock is automatically assigned to the waiting processor and after
      cleaning up the KSPIN_LOCK_QUEUE structure, the
      function returns.
A nice feature of that implementation is that if the user of the
      queued spinlock API allocated the storage for the
      KLOCK_QUEUE_HANDLE structure on the stack, the memory
      management of the lock queue entries comes for free.
Comments
One Response to "Queued Spinlocks in the WRK"
[...] Queued Spinlocks in the WRK [...]