Threads need to communicate with each other in two basic situations:
- When you have multiple threads accessing a shared resource in such a way that the resource does not become corrupt.
- When one thread needs to notify one or more other threads that a specific task has been completed.
A big part of thread synchronization has to do with atomic access—a thread’s ability to access a resource with the guarantee that no other thread will access that same resource at the same time.
Let’s look at a simple example:
// Define a global variable.
long g_x = 0;
DWORD WINAPI ThreadFunc1(PVOID pvParam) {
g_x++;
return(0);
}
DWORD WINAPI ThreadFunc2(PVOID pvParam) {
g_x++;
return(0);
}
I’ve declared a global variable, g_x, and initialized it to 0. Now let’s say that I create two threads: one thread executes ThreadFunc1, and the other thread executes ThreadFunc2. The code in these two functions is identical: they both add 1 to the global variable g_x. So when both threads stop running, you might expect to see the value 2 in g_x. But do you? The answer is … maybe. The way the code is written, you can’t tell what g_x will ultimately contain. Here’s why. Let’s say that the compiler generates the following code for the line that increments g_x by 1:
MOV EAX, [g_x] ; Move the value in g_x into a register.
INC EAX ; Increment the value in the register.
MOV [g_x], EAX ; Store the new value back in g_x.
Both threads are unlikely to execute this code at exactly the same time. So if one thread executes this code followed by another thread, here is what effectively executes:
MOV EAX, [g_x] ; Thread 1: Move 0 into a register.
INC EAX ; Thread 1: Increment the register to 1.
MOV [g_x], EAX ; Thread 1: Store 1 back in g_x.
MOV EAX, [g_x] ; Thread 2: Move 1 into a register.
INC EAX ; Thread 2: Increment the register to 2.
MOV [g_x], EAX ; Thread 2: Store 2 back in g_x.
After both threads are done incrementing g_x, the value in g_x is 2. This is great and is exactly what we expect: take zero (0), increment it by 1 twice, and the answer is 2. Beautiful. But wait— Windows is a preemptive, multithreaded environment. So a thread can be switched away from at any time and another thread might continue executing at any time. So the preceding code might not execute exactly as I’ve written it. Instead, it might execute as follows:
MOV EAX, [g_x] ; Thread 1: Move 0 into a register.
INC EAX ; Thread 1: Increment the register to 1.
MOV EAX, [g_x] ; Thread 2: Move 0 into a register.
INC EAX ; Thread 2: Increment the register to 1.
MOV [g_x], EAX ; Thread 2: Store 1 back in g_x.
MOV [g_x], EAX ; Thread 1: Store 1 back in g_x.
If the code executes this way, the final value in g_x is 1—not 2 as you expect!
To solve the problem just presented, we need something simple. We need a way to guarantee that the incrementing of the value is done atomically—that is, without interruption. The interlocked family of functions provides the solution we need. The interlocked functions are awesome and underused by most software developers, even though they are incredibly helpful and easy to understand.
The C run-time library offers an _aligned_malloc function that you can use to allocate a block of memory that is properyly aligned. Its prototype is as follows:
void * _aligned_malloc(size_t size, size_t alignment);
The size argument identifies the number of bytes you want to allocate, and the alignment argument indicates the byte boundary that you want the block aligned on. The value you pass for the alignment argument must be an integer power of 2
Switching from user mode to kernel mode requires about 1000 CPU cycles to execute.
To solve the problem just presented, we need something simple. We need a way to guarantee that the incrementing of the value is done atomically—that is, without interruption. The interlocked family of functions provides the solution we need. The interlocked functions are awesome and underused by most software developers, even though they are incredibly helpful and easy to understand. All the functions manipulate a value atomically. Take a look at InterlockedExchangeAdd and its sibling InterlockedExchangeAdd64 that works on LONGLONG values.
You must take extreme care when using this technique because a spinlock wastes CPU time. The CPU must constantly compare two values until one “magically” changes because of another thread. Also, this code assumes that all threads using the spinlock run at the same priority level. You might also want to disable thread priority boosting (by calling SetProcessPriorityBoost or SetThreadPriorityBoost) for threads that execute spinlocks.
In addition, you should ensure that the lock variable and the data that the lock protects are maintained in different cache lines. If the lock variable and data share the same cache line, a CPU using the resource will contend with any CPUs attempting access of the resource. This hurts performance.
Spinlocks assume that the protected resource is always accessed for short periods of time. This makes it more efficient to spin and then transition to kernel mode and wait. Many developers spin some number of times (say 4000), and if access to the resource is still denied, the thread transitions to kernel mode, where it waits (consuming no CPU time) until the resource becomes available. This is how critical sections are implemented.
If you want to build a high-performance application that runs on multiprocessor machines, you must be aware of CPU cache lines. When a CPU reads a byte from memory, it does not just fetch the single byte; it fetches enough bytes to fill a cache line. Cache lines consist of 32 (for older CPUs), 64, or even 128 bytes (depending on the CPU), and they are always aligned on 32-byte, 64-byte, or 128-byte boundaries, respectively. Cache lines exist to improve performance. Usually, an application manipulates a set of adjacent bytes. If these bytes are in the cache, the CPU does not have to access the memory bus, which requires much more time.
You should group your application’s data together in cache line—size chunks and on cache-line boundaries (i.e. 32, 64 or 128). The goal is to make sure that different CPUs access different memory addresses separated by at least a cache-line boundary. Also, you should separate your read-only data (or infrequently read data) from read-write data. And you should group together pieces of data that are accessed around the same time.
Here is an example of a poorly designed data structure:
struct CUSTINFO {
DWORD dwCustomerID; // Mostly read-only
int nBalanceDue; // Read-write
wchar_t szName[100]; // Mostly read-only
FILETIME ftLastOrderDate; // Read-write
};
The easiest way to determine the cache line size of a CPU is by calling Win32’s GetLogicalProcessorInformation function. This functions returns an array of SYSTEM_LOGICAL_PROCESSOR_INFORMATION structures. You can examine a structure’s Cache field, which refers to a CACHE_DESCRIPTOR structure that contains a LineSize field indicating the CPU’s cache line size. Once you have this information, you can use the C/C++ compilers’ __declspec(align(#)) directive to control field alignment. Here is an improved version of this structure:
// Force each structure to be in a different cache line.
struct __declspec(align(CACHE_ALIGN)) CUSTINFO {
DWORD dwCustomerID; // Mostly read-only
wchar_t szName[100]; // Mostly read-only
// Force the following members to be in a different cache line.
__declspec(align(CACHE_ALIGN))
int nBalanceDue; // Read-write
FILETIME ftLastOrderDate; // Read-write
};
For more information on using __declspec(align(#)), read http://msdn2.microsoft.com/en-us/library/83ythb65.aspx.
It is best for data to be always accessed by a single thread (function parameters and local variables are the easiest way to ensure this) or for the data to be always accessed by a single CPU (using thread affinity). If you do either of these, you avoid cache-line issues entirely. |
When passing parameters to threaded code use volatile keyword. For this code fragment to even come close to working, the volatile type qualifier must be there. This tells the compiler that the variable can be modified by something outside of the application itself, such as the operating system, hardware, or a concurrently executing thread. Specifically, the volatile qualifier tells the compiler to exclude the variable from any optimizations and always reload the value from the variable’s memory location. Let’s say that the compiler has generated the following pseudocode for the while statement shown in the previous code fragment:
MOV Reg0, [g_fFinishedCalculation] ; Copy the value into a register
Label: TEST Reg0, 0 ; Is the value 0?
JMP Reg0 == 0, Label ; The register is 0, try again
… ; The register is not 0 (end of loop)
Without making the Boolean variable volatile, it’s possible that the compiler might optimize your C++ code as shown here. For this optimization, the compiler loads the value of the BOOL variable into a CPU register just once. Then it repeatedly performs tests against the CPU register. This certainly yields better performance than constantly rereading the value in a memory address and retesting it; therefore, an optimizing compiler might write code like that just shown. However, if the compiler does this, the thread enters an infinite loop and never wakes up. By the way, making a structure volatile ensures that all of its members are volatile and are always read from memory when referenced.
A critical section is a small section of code that requires exclusive access to some shared resource before the code can execute. This is a way to have several lines of code “atomically” manipulate a resource. By atomic, I mean that the code knows that no other thread will access the resource. Of course, the system can still preempt your thread and schedule other threads. However, it will not schedule any other threads that want to access the same resource until your thread leaves the critical section.
Here is some problematic code that demonstrates what happens without the use of a critical section:
const int COUNT = 1000;
int g_nSum = 0;
DWORD WINAPI FirstThread(PVOID pvParam) {
g_nSum = 0;
for (int n = 1; n <= COUNT; n++) {
g_nSum += n;
}
return(g_nSum);
}
DWORD WINAPI SecondThread(PVOID pvParam) {
g_nSum = 0;
for (int n = 1; n <= COUNT; n++) {
g_nSum += n;
}
return(g_nSum);
}
The resolution using Critical Sections is:
const int COUNT = 10;
int g_nSum = 0;
CRITICAL_SECTION g_cs;
DWORD WINAPI FirstThread(PVOID pvParam) {
EnterCriticalSection(&g_cs);
g_nSum = 0;
for (int n = 1; n <= COUNT; n++) {
g_nSum += n;
}
LeaveCriticalSection(&g_cs);
return(g_nSum);
}
DWORD WINAPI SecondThread(PVOID pvParam) {
EnterCriticalSection(&g_cs);
g_nSum = 0;
for (int n = 1; n <= COUNT; n++) {
g_nSum += n;
}
LeaveCriticalSection(&g_cs);
return(g_nSum);
}
What are the key points to remember? When you have a resource that is accessed by multiple threads, you should create a CRITICAL_SECTION structure. Since I’m writing this on an airplane flight, let me draw the following analogy. A CRITICAL_SECTION structure is like an airplane’s lavatory, and the toilet is the data that you want protected. Because the lavatory is small, only one person (thread) at a time can be inside the lavatory (critical section) using the toilet (protected resource.)
If you have multiple resources that are always used together, you can place them all in a single lavatory: create just one CRITICAL_SECTION structure to guard them all.
If you have multiple resources that are not always used together—for example, threads 1 and 2 access one resource and threads 1 and 3 access another resource—you should create a separate lavatory, or CRITICAL_SECTION structure, for each resource
When you can’t solve your synchronization problem with interlocked functions, you should try using critical sections. The great thing about critical sections is that they are easy to use and they use the interlocked functions internally, so they execute quickly. The major disadvantage of critical sections is that you cannot use them to synchronize threads in multiple processes.
Normally, CRITICAL_SECTION structures are allocated as global variables to allow all threads in the process an easy way to reference the structure—by variable name. However, CRITICAL_SECTION structures can be allocated as local variables or dynamically allocated from a heap; and it is common to allocate them as private fields of a class definition. There are just two requirements. The first is that all threads that want to access the resource must know the address of the CRITICAL_SECTION structure that protects the resource. You can get this address to these threads using any mechanism you like. The second requirement is that the members within the CRITICAL_SECTION structure be initialized before any threads attempt to access the protected resource. The structure is initialized via a call to
VOID InitializeCriticalSection(PCRITICAL_SECTION pcs);
This function initializes the members of a CRITICAL_SECTION structure (pointed to by pcs). Because this function simply sets some member variables, it cannot fail and is therefore prototyped with a return value of VOID. This function must be called before any thread calls EnterCriticalSection. The Platform SDK documentation clearly states that the results are undefined if a thread attempts to enter an uninitialized CRITICAL_SECTION.
If two threads call EnterCriticalSection at exactly the same time on a multiprocessor machine, the function still behaves correctly: one thread is granted access to the resource, and the other thread is placed in a wait state.
If EnterCriticalSection places a thread in a wait state, the thread might not be scheduled again for a long time. In fact, in a poorly written application, the thread might never be scheduled CPU time again. If this happens, the thread is said to be starved.
In reality, threads waiting for a critical section never starve. Calls to EnterCriticalSection eventually time out, causing an exception to be raised. You can then attach a debugger to your application to determine what went wrong. The amount of time that must expire is determined by the CriticalSectionTimeout data value contained in the following registry subkey:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager
This value is in seconds and defaults to 2,592,000 seconds, or about 30 days. Do not set this value too low (below 3 seconds, for example) or you will adversely affect threads in the system and other applications that normally wait more than 3 seconds for a critical section
When a thread attempts to enter a critical section owned by another thread, the calling thread is placed immediately into a wait state. This means that the thread must transition from user mode to kernel mode (about 1000 CPU cycles). This transition is very expensive. On a multiprocessor machine, the thread that currently owns the resource might execute on a different processor and might relinquish control of the resource shortly. In fact, the thread that owns the resource might release it before the other thread has completed executing its transition into kernel mode. If this happens, a lot of CPU time is wasted.
To improve the performance of critical sections, Microsoft has incorporated spinlocks into them. So when EnterCriticalSection is called, it loops using a spinlock to try to acquire the resource some number of times. Only if all the attempts fail does the thread transition to kernel mode to enter a wait state.
To use a spinlock with a critical section, you should initialize the critical section by calling this function:
BOOL InitializeCriticalSectionAndSpinCount(PCRITICAL_SECTION pcs, DWORD dwSpinCount);
If you call this function while running on a single-processor machine, the dwSpinCount parameter is ignored and the count is always set to 0. This is good because setting a spin count on a single-processor machine is useless: the thread owning the resource can’t relinquish it if another thread is spinning.
You can change a critical section’s spin count by calling this function:
DWORD SetCriticalSectionSpinCount(PCRITICAL_SECTION pcs, DWORD dwSpinCount);
Again, the dwSpinCount value is ignored if the host machine has just one processor.
In my opinion, you should always use spinlocks with critical sections because you have nothing to lose. The hard part is determining what value to pass for the dwSpinCount parameters. For the best performance, you simply have to play with numbers until you’re happy with the performance results. As a guide, the critical section that guards access to your process’ heap uses a spin count of roughly 4000.
There is a small chance that the InitializeCriticalSection function can fail. Microsoft didn’t really think about this when it originally designed the function, which is why the function is prototyped as returning VOID. The function might fail because it allocates a block of memory so that the system can have some internal debugging information. If this memory allocation fails, a STATUS_
NO_MEMORY exception is raised. You can trap this in your code using structured exception handling.
Another problem can arise when you use critical sections. Internally, critical sections use an event kernel object if two or more threads contend for the critical section at the same time. Because contention is rare, the system does not create the event kernel object until the first time it is required. This saves a lot of system resources because most critical sections never have contention. By the way, this event kernel object is only released when you call DeleteCriticalSection; so you should never forget to call this function when you’re done with the critical section.
Before Windows XP, in a low-memory situation, a critical section might have contention, and the system might be unable to create the required event kernel object. The EnterCriticalSection function will then raise an EXCEPTION_INVALID_HANDLE exception. Most developers simply ignore this potential error and have no special handling in their code because this error is extremely rare. However, if you want to be prepared for this situation, you do have two options.
You can use structured exception handling and trap the error. When the error occurs, you can either not access the resource protected with the critical section or wait for some memory to become available and then call EnterCriticalSection again.
Your other option is to create the critical section using InitializeCriticalSectionAndSpinCount, making sure that you set the high bit of the dwSpinCount parameter. When this function sees that the high bit is set, it creates the event kernel object and associates it with the critical section at initialization time. If the event cannot be created, the function returns FALSE and you can handle this more gracefully in your code. If the event is created successfully, you know that EnterCriticalSection will always work and never raise an exception. (Always preallocating the event kernel objects can waste system resources. You should do this only if your code cannot tolerate EnterCriticalSection failing, if you are sure that contention will occur, or if you expect the process to be run in very low-memory environments.)
Since Windows XP, the new keyed event type of kernel objects has been introduced to help to solve this event creation issue under low resource conditions.
An SRWLock has the same purpose as a simple critical section: to protect a single resource against access made by different threads. However, unlike a critical section, an SRWLock allows you to distinguish between threads that simply want to read the value of the resource (the readers) and other threads that are trying to update this value (the writers). It should be possible for all readers to access the shared resource at the same time because there is no risk of data corruption if you only read the value of a resource. The need for synchronization begins when a writer thread wants to update the resource. In that case, the access should be exclusive: no other thread, neither a reader nor a writer, should be allowed to access the resource. This is exactly what an SRWLock allows you to do in your code and in a very explicit way.
Compared to a critical section, an SRWLock is missing some features:
- There is no TryEnter(Shared/Exclusive)SRWLock: your calls to the AcquireSRWLock(Shared/Exclusive) functions block the calling thread if the lock is already owned.
- It is not possible to recursively acquire an SRWLOCK; that is, a single thread cannot acquire the lock for writing multiple times and then release it with a corresponding number of ReleaseSRWLock* calls.
To summarize, if you want to get the best performance in an application, you should try to use nonshared data first and then use volatile reads, volatile writes, interlocked APIs, SRWLocks, critical sections. And if all of these won’t work for your situation, then and only then, use kernel objects.
Condition variables are designed to simplify your life when implementing synchronization scenarios where a thread has to atomically release a lock on a resource and blocks until a condition is met through the SleepConditionVariableCS or SleepConditionVariableSRW functions.
A thread blocked inside these Sleep* functions is awakened when WakeConditionVariable or WakeAllConditionVariable is called by another thread that detects that the right condition is satisfied, such as the presence of an element to consume for a reader thread or enough room to insert a produced element for a writer thread.