Scalability Issues in CSRSS

By Thomas

One of our current topics is pushing the limits of operating systems. Inspired by Mark Russinovich's article Pushing the Limits of Windows: Processes and Threads, we built a theoretical model for the amount of threads that can be created on a given system and then tried to reach these limits. However, the practical evaluation was kind of tricky and we stumbled upon an issue within CSRSS, Windows' client server runtime subsystem.

Limits on Threads
The maximum number of threads in a system is limited by various resources. For example, on a 32-bit Windows system, the major limitation is the address space of a process. As each thread requires a certain number of memory in kernel space (NtCreateThread - memory allocations in kernel mode) and user space, you can easily calculate the number of threads you can create until the address space of the process is exhausted. In contrast, on 64-bit systems, the major limitation is not the address space, but the physical amount of memory plugged into the machine. For each thread, Windows allocates some memory in kernel space and also the thread stack in user mode. Once the total physical memory on a machine without paging is exhaustet, no further threads can be created.

Limitation by Physical Memory
The HPI recently established a laboratory - the Future SOC Lab - with absolutely stunning hardware. We managed to get access to a machine equipped with 64 cores and 1 TB of main memory to evaluate our hypothesis that once we would have enough memory in the machine, we could reach the theoretical limit.
So, how many threads could be created on such a machine? According to our calculation model a thread on a 64-bit Windows Server 2008 R2 system consumes at least 50KB of memory - using a minimal stack. Therefore we should be able to exceed the number of 20 million threads, spread across multiple processes.

Limitation by Kernel Constants
Having a closer look into the WRK, we found out another limit. The kernel itself holds a list of all process and thread IDs used in the system. And this list is an ordinary handle table. So the maximum size of 2^24 entries for a handle table also holds for the process/thread ID table.
The constant is defined in base\ntos\ex\handle.c, Line 101:

#define MAX_HANDLES (1<<24)

Thus, the maximum number of threads in the whole system is limited to 16,777,216. So let's try to consume all those thread IDs.

First Tests
Here is our test setup: we wrote a little C program, which creates as many threads as possible. Of course we created our threads only with minimal stack sizes and and let Windows only reserve but not commit stack allocations. Also, we started our threads only in a suspended execution mode.

Our program tries to create threads in parallel. Therefore we start some dedicated threads that create as many threads as possible in a loop. The routine for the dedicated threads looks like:

DWORD WINAPI parallelCreateThreadFunc(LPVOID data)
{
        HANDLE hThread;
        // create threads until errors occur
        while(hThread = CreateThread(NULL, 0, sleepThread, NULL, CREATE_SUSPENDED, NULL))
        {
                // close the handle to keep the process-wide handle table clean
                CloseHandle(hThread);
                // count the created threads in a thread-safe manner 
                InterlockedIncrement(&totalThreadCount);
        }
        return 0;
}

On another test system with 4GB of main memory, we reached the maximum of about 90,000 threads in 5 seconds. On the big system we stumbled upon an issue: Our test program created the first million of threads in about some minutes. Continuing to the next million, our program needed almost an hour! That was strange, so we decided to investigate the matter a little bit further. Unfortunately, we could not investigate it on the big Future SOC Lab machine, as our access time was restricted. Nevertheless, we will demonstrate that our findings apply to the big machine as well.

First tests indicated that the time for creating a thread increases with every new thread, instead of being constant. A linear correlation between the number of created threads and the creation time was documented by further measurements. But where is the bottleneck?

CSRSS
While starring at the taskmanager during our tests we noticed the increasing number of open handles in the CSRSS process. The more threads we created, the more handles showed up in Task Manager. During thread creation the number of threads and the open handles in CSRSS increased synchronously. But the reverse process, the termination of these threads, was not synchronized. The CSRSS took much more time for reducing the handle count towards zero. Maybe there's the bottleneck. So what in the world is CSRSS doing?

The Windows Internals book told us that during process creation CSRSS is informed and a function named "CsrCreateProcess" is called via LPC. The Dependency Walker shows for csrss.exe that the corresponding csrsrv.dll also exports a function named "CsrCreateThread". So maybe for thread creation something similar is going on.

Hash Table Full of Threads
In the WRK there are no sources for those DLLs. But the ReactOS project held some sources for CSRSS. Of course the ReactOS code is not the one underlying to our Windows Server 2008 R2, but the principles should be the same. The sources (thread.c) revealed that CSRSS maintains a hash table for threads inside of csrsrv.dll. It is implemented as chained hash table with doubly linked lists and only 256 buckets. Also, for every thread creation, CSRSS makes a lookup for the ID of current thread to receive the corresponding CSR_THREAD object.

Definition of the hash table in thread.c:

LIST_ENTRY CsrThreadHashTable[256];

Extract of the CsrCreateThread function in thread.c:

NTSTATUS
NTAPI
CsrCreateThread(IN PCSR_PROCESS CsrProcess,
                IN HANDLE hThread,
                IN PCLIENT_ID ClientId)
{

    //... more code

    /* Get the current Process and make sure the Thread is valid with this CID */
    CurrentThread = CsrLocateThreadByClientId(&CurrentProcess,
                                              &CurrentCid);

    //... more code
}

Implementation of lookup function CsrLocateThreadByClientId in thread.c:

PCSR_THREAD
NTAPI
CsrLocateThreadByClientId(OUT PCSR_PROCESS *Process OPTIONAL,
                          IN PCLIENT_ID ClientId)
{
    ULONG i;
    PLIST_ENTRY ListHead, NextEntry;
    PCSR_THREAD FoundThread;

    /* Hash the Thread */
    i = CsrHashThread(ClientId->UniqueThread);

    /* Set the list pointers */
    ListHead = &CsrThreadHashTable[i];
    NextEntry = ListHead->Flink;

    /* Star the loop */
    while (NextEntry != ListHead)
    {
        /* Get the thread */
        FoundThread = CONTAINING_RECORD(NextEntry, CSR_THREAD, HashLinks);

        /* Compare the CID */
        if (FoundThread->ClientId.UniqueThread == ClientId->UniqueThread)
        {
            /* Match found, return the process */
            *Process = FoundThread->Process;

            /* Return thread too */
            return FoundThread;
        }

        /* Next */
        NextEntry = NextEntry->Flink;
    }

    /* Nothing found */
    return NULL;
}

With millions of threads created that hash table would degenerate more or less into a simple list with linear search complexity. However, these findings are based on ReactOS, so we needed to verify that something similar is going on in Windows Server 2008 R2.

Drill Into Windows Server 2008 R2
We started WinDbg for inspecting our CSRSS. If you intend to follow our experiment, be careful not to break into CSRSS in the same session as your debugger runs, as otherwise the system will halt. Debugging another session works fine. When using the debugging symbols provided by Microsoft, the inspection was quite simple. We could verify that there is still a hash table called CSRSRV!CsrThreadHashTable with linked lists for its buckets. The size of the hash table changed however in Windows Server 2008 R2. We determined the actual size right from the underlying hash function. In ReactOS this function was a simple macro and in the disassembly the hashing algorithm was inlined in many places.

byte code and global hash table identified in bytecode

The formula is: (ThreadID / 4) % 1024. Thus, the hash table has 1024 buckets - but that's still too small. To prove that this hash table has a significant impact on thread creation time, we patched the hash algorithm into a constant function. So every thread ID was placed into the first bucket of the hash table. The time for creating a thread increased by an order of magnitude, compared to our unpatched version. And indeed, on the small 4GB machine the creation process extremely slowed down.
So we could prove that the performance degradation correlates to CSRSS's hash table, which means that we found the bottleneck. But how can we fix this?

Working Around the Scalability Issues
We had three alternatives: either increasing the number of hash table entries, replacing the hash table with another data structure, or not using it at all. As we lack the sources of CSRSS, we decided for the latter. In this regard, we were not fixing the issue but rather working around to see if our assumptions were correct.

We extended our test program so that it patches the WinAPI CreateThread function in memory in order to avoid any LPC calls to CsrCreateThread. Therefore, we replaced the call instruction in the kernel32 module (more precisely: kernelbase.dll) at the right offset with NOP instructions. But beware of changes in global shared memory without activating copy-on-write for the corresponding pages.

Surprisingly, this quick and dirty hack worked very well and everything went as before except that CSRSS wasn't informed about newly created threads. Though, we must admit that we don't know about all the implications of our modification. Maybe, if we would use "real working" threads instead of "sleeping" threads, something might fail.
However, thread creation was fast, really fast. Within seconds we created 4.7 million threads - maybe more than anyone did before. We will provide detailed results of our performance measurements soon.

screenshot of Windows task manager after modifications

In conclusion, we could verify that CSRSS is having scalability issues upon thread creation, specifically when there are millions of those. Also, we ran into yet another limitation, which we have not been able to resolve yet. We could only create 4.7 millions of threads but not 16 millions. As soon as we figured out that limitation, we will post it here.

Scalability Issues in CSRSS

Comments