Queued Spinlocks in the WRK
Windows Research Kernel @ HPIA few days ago, we came up with a discussion on the advantages of queued spinlocks over normal spinlocks. The biggest advantage in our oppinion is that queued spinlocks guarantee a FIFO ordering among competing processors while normal spinlocks don't. In this article, we show the implementation of queued spinlocks in the WRK. We present the source code of the 64-bit version for two reasons: first, the 64-bit version contains the implementation in plain C and not in the Assembly language, and second, in the 32-bit version, queued spinlocks are implemented in the HAL, which is not available as source code. The rational behind the implementation remains however the same.
Queued spinlocks come in two flavors in the WRK: numbered queued spinlocks and in stack queued spinlocks. In this article, we will only cover the latter as those are part of the exposed kernel API and can as well be used by driver developers.
The basic idea of a queued spinlock is that all competing CPUs or cores form a queue and upon release the first core gets the lock. Queued spinlocks use the following fundamental data structures:
typedef struct _KSPIN_LOCK_QUEUE { struct _KSPIN_LOCK_QUEUE * volatile Next; PKSPIN_LOCK volatile Lock; } KSPIN_LOCK_QUEUE, *PKSPIN_LOCK_QUEUE; typedef struct _KLOCK_QUEUE_HANDLE { KSPIN_LOCK_QUEUE LockQueue; KIRQL OldIrql; } KLOCK_QUEUE_HANDLE, *PKLOCK_QUEUE_HANDLE;
The KSPIN_LOCK_QUEUE
data structure is defined in
the public\inc\sdk\ntkeapi.h file, so it comes with
every Windows Driver Development Kit. We will come to the meaning of
those fields a bit later.
The WRK and later Windows versions expose the following API for in stack queued spinlocks (for a complete list on the synchronization API please click here):
KeAcquireInStackQueuedSpinLock
KeAcquireInStackQueuedSpinLockAtDpcLevel
KeReleaseInStackQueuedSpinLock
KeReleaseInStackQueuedSpinLockFromDpcLevel
In the XxxDpcLevel
functions, the WRK will not
change the IRQL. So, if your driver or kernel code is already
executing at an IRQL of DPC or higher, you may want to use those
API. Otherwise the WRK first raises the IRQL to DPC level on
acquisition and lowers it to the previous level upon release. As
raising and lowering the IRQL is not of interest here, we only cover
the XxxDpcLevel
functions.
Acquiring a Queued Spinlock
The signature for acquiring a queud spinlock is as follows:
VOID KeAcquireInStackQueuedSpinLockAtDpcLevel ( __inout PKSPIN_LOCK SpinLock, __out PKLOCK_QUEUE_HANDLE LockHandle )
The first parameter is a pointer to the spinlock that should be acquired while the second parameter is a pointer to the lock queue entry that will be used to enqueue the CPU if necessary. The second parameter may and should point to a stack location of the current thread.
In a first part, the KLOCK_QUEUE_HANDLE
data
structure will be initialized
(base\ntos\ke\amd64\queuelock.c:729):
{ // // Acquire the specified in stack queued spin lock at the current // IRQL. // #if !defined(NT_UP) LockHandle->LockQueue.Lock = SpinLock; LockHandle->LockQueue.Next = NULL; #else UNREFERENCED_PARAMETER(SpinLock); #endif KxAcquireQueuedSpinLock(&LockHandle->LockQueue, SpinLock); return; }
The actual acquisition is done in
KxAcquireQueuedSpinLock
, an internal helper function.
It should be clear that on uni-processor machines spinlocks are
unnecessary. If the IRQL is raised to DPC level, no dispatching
takes place, which means the currently running thread gets the lock
and runs until completion. That's the reason for the
#ifdefs
in the above and following code fragments. The
implementation of KxAcquireQueuedSpinLock
(base\ntos\ke\amd64\queuelock.c:114) is as follows:
__forceinline VOID KxAcquireQueuedSpinLock ( __inout PKSPIN_LOCK_QUEUE LockQueue, __inout PKSPIN_LOCK SpinLock ) { // // [...] // #if !defined(NT_UP) PKSPIN_LOCK_QUEUE TailQueue; TailQueue = InterlockedExchangePointer((PVOID *)SpinLock, LockQueue); if (TailQueue != NULL) { KxWaitForLockOwnerShip(LockQueue, TailQueue); } #else UNREFERENCED_PARAMETER(LockQueue); UNREFERENCED_PARAMETER(SpinLock); #endif return; }
First, the value of SpinLock
is atomically exchanged
with the address of the lock queue entry
(PKSPIN_LOCK_QUEUE
) and the result is stored into
TailQueue
. If TailQueue
is
NULL
, it means that the lock was free and the caller is
the first to attempt the acquire. In that case we are done.
If TailQueue
is not NULL
, we see the
address of our predecessor in the waiter queue. It does not
necessarily mean, that we are the last in the queue, but we have at
least one predecessor. However, we must now wait for the lock to be
released. This is done in the function
KxWaitForLockOwnerShip
(base\ntos\ke\amd64\queuelock.c:62)
DECLSPEC_NOINLINE ULONG64 KxWaitForLockOwnerShip ( __inout PKSPIN_LOCK_QUEUE LockQueue, __inout PKSPIN_LOCK_QUEUE TailQueue ) { ULONG64 SpinCount; // // [...] // *((ULONG64 volatile *)&LockQueue->Lock) |= LOCK_QUEUE_WAIT; TailQueue->Next = LockQueue; // // [...] // SpinCount = 0; do { KeYieldProcessor(); } while ((*((ULONG64 volatile *)&LockQueue->Lock) & LOCK_QUEUE_WAIT) != 0); KeMemoryBarrier(); return SpinCount; }
What happens here is that we first set the
LOCK_QUEUE_WAIT
bit in the Lock
field of
the KSPIN_LOCK_QUEUE
structure. Then, we update the
Next
field in our predecessor's queue entry. This is
necessary to inform the preceding CPU that someone else is waiting.
Finally, we spin on the Lock
until the
LOCK_QUEUE_WAIT
bit is cleared. If the bit is cleared,
the spinlock has been acquired. Please note that the actual spinning
is done on a local variable. This is in contrast to normal
spinlocks, where all competing CPUs are spinning on a global
variable, which may decrease performance due to cache coherency
issues.
Interestingly, the SpinCount
variable that should
return the number of spin-loops is defined, initialized, and
returned, but it is never updated. Maybe this was due to performance
issues or it is enabled only in debug version of the kernel.
Releasing a Queued Spinlock
When KeReleaseInStackQueuedSpinLockFromDpcLevel
is
called, it basically redirects control to
KxReleaseQueuedSpinLock
(base\ntos\ke\amd64\queuelock.c:227):
__forceinline VOID KxReleaseQueuedSpinLock ( __inout PKSPIN_LOCK_QUEUE LockQueue ) { // // [...] // #if !defined(NT_UP) PKSPIN_LOCK_QUEUE NextQueue; NextQueue = ReadForWriteAccess(&LockQueue->Next); if (NextQueue == NULL) { if (InterlockedCompareExchangePointer((PVOID *)LockQueue->Lock, NULL, LockQueue) == LockQueue) { return; } NextQueue = KxWaitForLockChainValid(LockQueue); } ASSERT(((ULONG64)NextQueue->Lock & LOCK_QUEUE_WAIT) != 0); InterlockedXor64((LONG64 volatile *)&NextQueue->Lock, LOCK_QUEUE_WAIT); LockQueue->Next = NULL; #else UNREFERENCED_PARAMETER(LockQueue); #endif return; }
What happens here is that we first determine, whether there are
any processors waiting for the lock. Therefore, the
Next
field is examined. If it is NULL
, we
are the last (and only) core and must free the lock. The lock is
released by setting it to NULL
. However, we must do
this in an atomic compare-and-exchange operation to make sure that
no other processor updated the spinlock meanwhile.
InterlockedCompareExchangePointer
will set the data
where LockQueue->Lock
points to to
NULL
, if and only if it still contains the address of
LockQueue
. If it succeeds, it returns the
LockQueue
address. In that case, the lock has been
released.
Otherwise, it returns the actual value of the spinlock. Unfortunately that address is not necessarily the next processor to acquire the lock, as multiple acquisition attempts may have been made meanwhile and the lock only contains the address of the last entry.
To retrieve the actual next processor, our CPU must spin-wait on
the other processor to finishing its initialization. The
initialization is done when the succeeding CPU has set the
Next
field in our KSPIN_LOCK_QUEUE
entry.
The busy-wait is performed in the
KxWaitForLockChainValid
(base\ntos\ke\amd64\queuelock.c:24) function:
DECLSPEC_NOINLINE PKSPIN_LOCK_QUEUE KxWaitForLockChainValid ( __inout PKSPIN_LOCK_QUEUE LockQueue ) { PKSPIN_LOCK_QUEUE NextQueue; // [...] do { KeYieldProcessor(); } while ((NextQueue = LockQueue->Next) == NULL); return NextQueue; }
It spins until the LockQueue->Next
entry is not
NULL
. In that case, our successor in the queue has
finished its intialization and is spinning for the
LOCK_QUEUE_WAIT
bit.
To inform the next CPU that the spinlock has been released,
InterlockedXor64
is called to clear the
LOCK_QUEUE_WAIT
bit of the
NextQueue->Lock
field. Once the bit is cleared, the
lock is automatically assigned to the waiting processor and after
cleaning up the KSPIN_LOCK_QUEUE
structure, the
function returns.
A nice feature of that implementation is that if the user of the
queued spinlock API allocated the storage for the
KLOCK_QUEUE_HANDLE
structure on the stack, the memory
management of the lock queue entries comes for free.
Comments
One Response to "Queued Spinlocks in the WRK"
[...] Queued Spinlocks in the WRK [...]