Thread Synchronization in User Mode

Threads need to communicate with each other in two basic situations:

  • When you have multiple threads accessing a shared resource in such a way that the resource does not become corrupt.
  • When one thread needs to notify one or more other threads that a specific task has been completed.

A big part of thread synchronization has to do with atomic access—a thread’s ability to access a resource with the guarantee that no other thread will access that same resource at the same time.

Let’s look at a simple example:

// Define a global variable.
long g_x = 0;
DWORD WINAPI ThreadFunc1(PVOID pvParam) {
   g_x++;
   return(0);
}
DWORD WINAPI ThreadFunc2(PVOID pvParam) {
   g_x++;
   return(0);

}

I’ve declared a global variable, g_x, and initialized it to 0. Now let’s say that I create two threads: one thread executes ThreadFunc1, and the other thread executes ThreadFunc2. The code in these two functions is identical: they both add 1 to the global variable g_x. So when both threads stop running, you might expect to see the value 2 in g_x. But do you? The answer is … maybe. The way the code is written, you can’t tell what g_x will ultimately contain. Here’s why. Let’s say that the compiler generates the following code for the line that increments g_x by 1:

MOV EAX, [g_x] ; Move the value in g_x into a register.

INC EAX ; Increment the value in the register.

MOV [g_x], EAX ; Store the new value back in g_x.

Both threads are unlikely to execute this code at exactly the same time. So if one thread executes this code followed by another thread, here is what effectively executes:

MOV EAX, [g_x] ; Thread 1: Move 0 into a register.

INC EAX ; Thread 1: Increment the register to 1.

MOV [g_x], EAX ; Thread 1: Store 1 back in g_x.

MOV EAX, [g_x] ; Thread 2: Move 1 into a register.

INC EAX ; Thread 2: Increment the register to 2.

MOV [g_x], EAX ; Thread 2: Store 2 back in g_x.

After both threads are done incrementing g_x, the value in g_x is 2. This is great and is exactly what we expect: take zero (0), increment it by 1 twice, and the answer is 2. Beautiful. But wait— Windows is a preemptive, multithreaded environment. So a thread can be switched away from at any time and another thread might continue executing at any time. So the preceding code might not execute exactly as I’ve written it. Instead, it might execute as follows:

MOV EAX, [g_x] ; Thread 1: Move 0 into a register.

INC EAX ; Thread 1: Increment the register to 1.

MOV EAX, [g_x] ; Thread 2: Move 0 into a register.

INC EAX ; Thread 2: Increment the register to 1.

MOV [g_x], EAX ; Thread 2: Store 1 back in g_x.

MOV [g_x], EAX ; Thread 1: Store 1 back in g_x.

If the code executes this way, the final value in g_x is 1—not 2 as you expect!

To solve the problem just presented, we need something simple. We need a way to guarantee that the incrementing of the value is done atomically—that is, without interruption. The interlocked family of functions provides the solution we need. The interlocked functions are awesome and underused by most software developers, even though they are incredibly helpful and easy to understand.

The C run-time library offers an _aligned_malloc function that you can use to allocate a block of memory that is properyly aligned. Its prototype is as follows:

void * _aligned_malloc(size_t size, size_t alignment);

The size argument identifies the number of bytes you want to allocate, and the alignment argument indicates the byte boundary that you want the block aligned on. The value you pass for the alignment argument must be an integer power of 2

Switching from user mode to kernel mode requires about 1000 CPU cycles to execute.

To solve the problem just presented, we need something simple. We need a way to guarantee that the incrementing of the value is done atomically—that is, without interruption. The interlocked family of functions provides the solution we need. The interlocked functions are awesome and underused by most software developers, even though they are incredibly helpful and easy to understand. All the functions manipulate a value atomically. Take a look at InterlockedExchangeAdd and its sibling InterlockedExchangeAdd64 that works on LONGLONG values.

You must take extreme care when using this technique because a spinlock wastes CPU time. The CPU must constantly compare two values until one “magically” changes because of another thread. Also, this code assumes that all threads using the spinlock run at the same priority level. You might also want to disable thread priority boosting (by calling SetProcessPriorityBoost or SetThreadPriorityBoost) for threads that execute spinlocks.

In addition, you should ensure that the lock variable and the data that the lock protects are maintained in different cache lines. If the lock variable and data share the same cache line, a CPU using the resource will contend with any CPUs attempting access of the resource. This hurts performance.

Spinlocks assume that the protected resource is always accessed for short periods of time. This makes it more efficient to spin and then transition to kernel mode and wait. Many developers spin some number of times (say 4000), and if access to the resource is still denied, the thread transitions to kernel mode, where it waits (consuming no CPU time) until the resource becomes available. This is how critical sections are implemented.

If you want to build a high-performance application that runs on multiprocessor machines, you must be aware of CPU cache lines. When a CPU reads a byte from memory, it does not just fetch the single byte; it fetches enough bytes to fill a cache line. Cache lines consist of 32 (for older CPUs), 64, or even 128 bytes (depending on the CPU), and they are always aligned on 32-byte, 64-byte, or 128-byte boundaries, respectively. Cache lines exist to improve performance. Usually, an application manipulates a set of adjacent bytes. If these bytes are in the cache, the CPU does not have to access the memory bus, which requires much more time.

You should group your application’s data together in cache line—size chunks and on cache-line boundaries (i.e. 32, 64 or 128). The goal is to make sure that different CPUs access different memory addresses separated by at least a cache-line boundary. Also, you should separate your read-only data (or infrequently read data) from read-write data. And you should group together pieces of data that are accessed around the same time.

Here is an example of a poorly designed data structure:

struct CUSTINFO {

DWORD dwCustomerID; // Mostly read-only

int nBalanceDue; // Read-write

wchar_t szName[100]; // Mostly read-only

FILETIME ftLastOrderDate; // Read-write

};

The easiest way to determine the cache line size of a CPU is by calling Win32’s GetLogicalProcessorInformation function. This functions returns an array of SYSTEM_LOGICAL_PROCESSOR_INFORMATION structures. You can examine a structure’s Cache field, which refers to a CACHE_DESCRIPTOR structure that contains a LineSize field indicating the CPU’s cache line size. Once you have this information, you can use the C/C++ compilers’ __declspec(align(#)) directive to control field alignment. Here is an improved version of this structure:

#define CACHE_ALIGN 64

// Force each structure to be in a different cache line.

struct __declspec(align(CACHE_ALIGN)) CUSTINFO {

DWORD dwCustomerID; // Mostly read-only

wchar_t szName[100]; // Mostly read-only

// Force the following members to be in a different cache line.

__declspec(align(CACHE_ALIGN))

int nBalanceDue; // Read-write

FILETIME ftLastOrderDate; // Read-write

};

For more information on using __declspec(align(#)), read http://msdn2.microsoft.com/en-us/library/83ythb65.aspx.

It is best for data to be always accessed by a single thread (function parameters and local variables are the easiest way to ensure this) or for the data to be always accessed by a single CPU (using thread affinity). If you do either of these, you avoid cache-line issues entirely.

When passing parameters to threaded code use volatile keyword. For this code fragment to even come close to working, the volatile type qualifier must be there. This tells the compiler that the variable can be modified by something outside of the application itself, such as the operating system, hardware, or a concurrently executing thread. Specifically, the volatile qualifier tells the compiler to exclude the variable from any optimizations and always reload the value from the variable’s memory location. Let’s say that the compiler has generated the following pseudocode for the while statement shown in the previous code fragment:

MOV Reg0, [g_fFinishedCalculation] ; Copy the value into a register

Label: TEST Reg0, 0 ; Is the value 0?

JMP Reg0 == 0, Label ; The register is 0, try again

… ; The register is not 0 (end of loop)

Without making the Boolean variable volatile, it’s possible that the compiler might optimize your C++ code as shown here. For this optimization, the compiler loads the value of the BOOL variable into a CPU register just once. Then it repeatedly performs tests against the CPU register. This certainly yields better performance than constantly rereading the value in a memory address and retesting it; therefore, an optimizing compiler might write code like that just shown. However, if the compiler does this, the thread enters an infinite loop and never wakes up. By the way, making a structure volatile ensures that all of its members are volatile and are always read from memory when referenced.

A critical section is a small section of code that requires exclusive access to some shared resource before the code can execute. This is a way to have several lines of code “atomically” manipulate a resource. By atomic, I mean that the code knows that no other thread will access the resource. Of course, the system can still preempt your thread and schedule other threads. However, it will not schedule any other threads that want to access the same resource until your thread leaves the critical section.

Here is some problematic code that demonstrates what happens without the use of a critical section:

const int COUNT = 1000;

int g_nSum = 0;

DWORD WINAPI FirstThread(PVOID pvParam) {

g_nSum = 0;

for (int n = 1; n <= COUNT; n++) {

g_nSum += n;

}

return(g_nSum);

}

DWORD WINAPI SecondThread(PVOID pvParam) {

g_nSum = 0;

for (int n = 1; n <= COUNT; n++) {

g_nSum += n;

}

return(g_nSum);

}

The resolution using Critical Sections is:

const int COUNT = 10;

int g_nSum = 0;

CRITICAL_SECTION g_cs;

DWORD WINAPI FirstThread(PVOID pvParam) {

EnterCriticalSection(&g_cs);

g_nSum = 0;

for (int n = 1; n <= COUNT; n++) {

g_nSum += n;

}

LeaveCriticalSection(&g_cs);

return(g_nSum);

}

DWORD WINAPI SecondThread(PVOID pvParam) {

EnterCriticalSection(&g_cs);

g_nSum = 0;

for (int n = 1; n <= COUNT; n++) {

g_nSum += n;

}

LeaveCriticalSection(&g_cs);

return(g_nSum);

}

What are the key points to remember? When you have a resource that is accessed by multiple threads, you should create a CRITICAL_SECTION structure. Since I’m writing this on an airplane flight, let me draw the following analogy. A CRITICAL_SECTION structure is like an airplane’s lavatory, and the toilet is the data that you want protected. Because the lavatory is small, only one person (thread) at a time can be inside the lavatory (critical section) using the toilet (protected resource.)

If you have multiple resources that are always used together, you can place them all in a single lavatory: create just one CRITICAL_SECTION structure to guard them all.

If you have multiple resources that are not always used together—for example, threads 1 and 2 access one resource and threads 1 and 3 access another resource—you should create a separate lavatory, or CRITICAL_SECTION structure, for each resource

When you can’t solve your synchronization problem with interlocked functions, you should try using critical sections. The great thing about critical sections is that they are easy to use and they use the interlocked functions internally, so they execute quickly. The major disadvantage of critical sections is that you cannot use them to synchronize threads in multiple processes.

Normally, CRITICAL_SECTION structures are allocated as global variables to allow all threads in the process an easy way to reference the structure—by variable name. However, CRITICAL_SECTION structures can be allocated as local variables or dynamically allocated from a heap; and it is common to allocate them as private fields of a class definition. There are just two requirements. The first is that all threads that want to access the resource must know the address of the CRITICAL_SECTION structure that protects the resource. You can get this address to these threads using any mechanism you like. The second requirement is that the members within the CRITICAL_SECTION structure be initialized before any threads attempt to access the protected resource. The structure is initialized via a call to

VOID InitializeCriticalSection(PCRITICAL_SECTION pcs);

This function initializes the members of a CRITICAL_SECTION structure (pointed to by pcs). Because this function simply sets some member variables, it cannot fail and is therefore prototyped with a return value of VOID. This function must be called before any thread calls EnterCriticalSection. The Platform SDK documentation clearly states that the results are undefined if a thread attempts to enter an uninitialized CRITICAL_SECTION.

If two threads call EnterCriticalSection at exactly the same time on a multiprocessor machine, the function still behaves correctly: one thread is granted access to the resource, and the other thread is placed in a wait state.

If EnterCriticalSection places a thread in a wait state, the thread might not be scheduled again for a long time. In fact, in a poorly written application, the thread might never be scheduled CPU time again. If this happens, the thread is said to be starved.

In reality, threads waiting for a critical section never starve. Calls to EnterCriticalSection eventually time out, causing an exception to be raised. You can then attach a debugger to your application to determine what went wrong. The amount of time that must expire is determined by the CriticalSectionTimeout data value contained in the following registry subkey:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager

This value is in seconds and defaults to 2,592,000 seconds, or about 30 days. Do not set this value too low (below 3 seconds, for example) or you will adversely affect threads in the system and other applications that normally wait more than 3 seconds for a critical section

When a thread attempts to enter a critical section owned by another thread, the calling thread is placed immediately into a wait state. This means that the thread must transition from user mode to kernel mode (about 1000 CPU cycles). This transition is very expensive. On a multiprocessor machine, the thread that currently owns the resource might execute on a different processor and might relinquish control of the resource shortly. In fact, the thread that owns the resource might release it before the other thread has completed executing its transition into kernel mode. If this happens, a lot of CPU time is wasted.

To improve the performance of critical sections, Microsoft has incorporated spinlocks into them. So when EnterCriticalSection is called, it loops using a spinlock to try to acquire the resource some number of times. Only if all the attempts fail does the thread transition to kernel mode to enter a wait state.

To use a spinlock with a critical section, you should initialize the critical section by calling this function:

BOOL InitializeCriticalSectionAndSpinCount(PCRITICAL_SECTION pcs, DWORD dwSpinCount);

If you call this function while running on a single-processor machine, the dwSpinCount parameter is ignored and the count is always set to 0. This is good because setting a spin count on a single-processor machine is useless: the thread owning the resource can’t relinquish it if another thread is spinning.

You can change a critical section’s spin count by calling this function:

DWORD SetCriticalSectionSpinCount(PCRITICAL_SECTION pcs, DWORD dwSpinCount);

Again, the dwSpinCount value is ignored if the host machine has just one processor.

In my opinion, you should always use spinlocks with critical sections because you have nothing to lose. The hard part is determining what value to pass for the dwSpinCount parameters. For the best performance, you simply have to play with numbers until you’re happy with the performance results. As a guide, the critical section that guards access to your process’ heap uses a spin count of roughly 4000.

There is a small chance that the InitializeCriticalSection function can fail. Microsoft didn’t really think about this when it originally designed the function, which is why the function is prototyped as returning VOID. The function might fail because it allocates a block of memory so that the system can have some internal debugging information. If this memory allocation fails, a STATUS_
NO_MEMORY exception is raised. You can trap this in your code using structured exception handling.

Another problem can arise when you use critical sections. Internally, critical sections use an event kernel object if two or more threads contend for the critical section at the same time. Because contention is rare, the system does not create the event kernel object until the first time it is required. This saves a lot of system resources because most critical sections never have contention. By the way, this event kernel object is only released when you call DeleteCriticalSection; so you should never forget to call this function when you’re done with the critical section.

Before Windows XP, in a low-memory situation, a critical section might have contention, and the system might be unable to create the required event kernel object. The EnterCriticalSection function will then raise an EXCEPTION_INVALID_HANDLE exception. Most developers simply ignore this potential error and have no special handling in their code because this error is extremely rare. However, if you want to be prepared for this situation, you do have two options.

You can use structured exception handling and trap the error. When the error occurs, you can either not access the resource protected with the critical section or wait for some memory to become available and then call EnterCriticalSection again.

Your other option is to create the critical section using InitializeCriticalSectionAndSpinCount, making sure that you set the high bit of the dwSpinCount parameter. When this function sees that the high bit is set, it creates the event kernel object and associates it with the critical section at initialization time. If the event cannot be created, the function returns FALSE and you can handle this more gracefully in your code. If the event is created successfully, you know that EnterCriticalSection will always work and never raise an exception. (Always preallocating the event kernel objects can waste system resources. You should do this only if your code cannot tolerate EnterCriticalSection failing, if you are sure that contention will occur, or if you expect the process to be run in very low-memory environments.)

Since Windows XP, the new keyed event type of kernel objects has been introduced to help to solve this event creation issue under low resource conditions.

An SRWLock has the same purpose as a simple critical section: to protect a single resource against access made by different threads. However, unlike a critical section, an SRWLock allows you to distinguish between threads that simply want to read the value of the resource (the readers) and other threads that are trying to update this value (the writers). It should be possible for all readers to access the shared resource at the same time because there is no risk of data corruption if you only read the value of a resource. The need for synchronization begins when a writer thread wants to update the resource. In that case, the access should be exclusive: no other thread, neither a reader nor a writer, should be allowed to access the resource. This is exactly what an SRWLock allows you to do in your code and in a very explicit way.

Compared to a critical section, an SRWLock is missing some features:

  • There is no TryEnter(Shared/Exclusive)SRWLock: your calls to the AcquireSRWLock(Shared/Exclusive) functions block the calling thread if the lock is already owned.
  • It is not possible to recursively acquire an SRWLOCK; that is, a single thread cannot acquire the lock for writing multiple times and then release it with a corresponding number of ReleaseSRWLock* calls.

To summarize, if you want to get the best performance in an application, you should try to use nonshared data first and then use volatile reads, volatile writes, interlocked APIs, SRWLocks, critical sections. And if all of these won’t work for your situation, then and only then, use kernel objects.

Condition variables are designed to simplify your life when implementing synchronization scenarios where a thread has to atomically release a lock on a resource and blocks until a condition is met through the SleepConditionVariableCS or SleepConditionVariableSRW functions.

A thread blocked inside these Sleep* functions is awakened when WakeConditionVariable or WakeAllConditionVariable is called by another thread that detects that the right condition is satisfied, such as the presence of an element to consume for a reader thread or enough room to insert a produced element for a writer thread.

Thread Scheduling, Priorities, and Affinities

Every 20 milliseconds or so (as returned by the second parameter of the GetSystemTimeAdjustment function), Windows looks at all the thread kernel objects currently in existence. Of these objects, only some are considered schedulable. Windows selects one of the schedulable thread kernel objects and loads the CPU’s registers with the values that were last saved in the thread’s context. This action is called a context switch.

Windows is called a preemptive multithreaded operating system because a thread can be stopped at any time and another thread can be scheduled.

Developers frequently ask Jeff. how they can guarantee that their thread will start running within some time period of some event—for example, how can you ensure that a particular thread will start running within 1 millisecond of data coming from the serial port? I have an easy answer: You can’t.

Real-time operating systems can make these promises, but Windows is not a real-time operating system. A real-time operating system requires intimate knowledge of the hardware it is running on so that it knows the latency associated with its hard disk controllers, keyboards, and so on.

In real life, an application must be careful when it calls SuspendThread because you have no idea what the thread might be doing when you attempt to suspend it. If the thread is attempting to allocate memory from a heap, for example, the thread will have a lock on the heap. As other threads attempt to access the heap, their execution will be halted until the first thread is resumed. SuspendThread is safe only if you know exactly what the target thread is (or might be doing) and you take extreme measures to avoid problems or deadlocks caused by suspending the thread.

Windows doesn’t offer any other way to suspend all threads in a process because of race conditions. For example, while the threads are suspended, a new thread might be created. Somehow the system must suspend any new threads during this window of time. Microsoft has integrated this functionality into the debugging mechanism of the system.

A race condition occurs when multiple processes access and manipulate the same data concurrently, and the outcome of the execution depends on the particular order in which the access takes place. A race condition is of interest to a hacker when the race condition can be utilized to gain privileged system access. Consider the following code snippet which illustrates a race condition:

if(access(“/tmp/datafile”,R_OK)==0){

fd=open(“/tmp/datafile);

process(fd);

close(fd);

}

This code creates the temporary file /tmp/datafile and then opens it. The potential race condition occurs between the call to access() and the call to open(). If an attacker can replace the contents of /tmp/datafile between the access() and open() functions, he can manipulate the actions of the program which uses that datafile. This is the race.

You probably understand why Suspending a Process does not work 100 percent of the time: while enumerating the set of threads, new threads can be created and destroyed. So after you take a snapshot from the current threads at the system, a new thread might appear in the target process, which my function will not suspend. Later, when you call SuspendProcess to resume the threads, it will resume a thread that it never suspended. Even worse, while it is enumerating the thread IDs, an existing thread might be destroyed and a new thread might be created, and both of these threads might have the same ID. This would cause the function to suspend some arbitrary thread (probably in a process other than the target process.)

There are a few important things to notice about Sleep function:

  • Calling Sleep allows the thread to voluntarily give up the remainder of its time slice.
  • The system makes the thread not schedulable for approximately the number of milliseconds specified. That’s right—if you tell the system you want to sleep for 100 milliseconds, you will sleep approximately that long, but possibly several seconds or minutes more. Remember that Windows is not a real-time operating system. Your thread will probably wake up at the right time, but whether it does depends on what else is going on in the system.
  • You can call Sleep and pass INFINITE for the dwMilliseconds parameter. This tells the system to never schedule the thread. This is not a useful thing to do. It is much better to have the thread exit and to recover its stack and kernel object.
  • You can pass 0 to Sleep. This tells the system that the calling thread relinquishes the remainder of its time slice, and it forces the system to schedule another thread. However, the system can reschedule the thread that just called Sleep. This will happen if there are no more schedulable threads at the same priority or higher.

Calling SwitchToThread is similar to calling Sleep and passing it a timeout of 0 milliseconds. The difference is that SwitchToThread allows lower-priority threads to execute. Sleep reschedules the calling thread immediately even if lower-priority threads are being starved.

Hyper-threading is a technology available on some Xeon, Pentium 4, and later CPUs. A hyper-threaded processor chip has multiple “logical” CPUs, and each can run a thread. Each thread has its own architectural state (set of registers), but all threads share main execution resources such as the CPU cache. When one thread is paused, the CPU automatically executes another thread; this happens without operating system intervention. A pause is a cache miss, branch misprediction, waiting for results of a previous instruction, and so on. Intel reports that hyper-threaded CPUs improve throughput somewhere between 10 percent to 30 percent, depending on the application and how it is using memory.

Sometimes you want to time how long it takes a thread to perform a particular task. What many people do is write code similar to the following, taking advantage of the new GetTickCount64 function:

// Get the current time (start time).

ULONGLONG qwStartTime = GetTickCount64();

// Perform complex algorithm here.

// Subtract start time from current time to get duration.

ULONGLONG qwElapsedTime = GetTickCount64() – qwStartTime;

This code makes a simple assumption: it won’t be interrupted. However, in a preemptive operating system, you never know when your thread will be scheduled CPU time. When CPU time is taken away from your thread, it becomes more difficult to time how long it takes your thread to perform various tasks. What we need is a function that returns the amount of CPU time that the thread has received.

A CONTEXT structure contains processor-specific register data. The system uses CONTEXT structures to perform various internal operations. Refer to the header file WinNT.h for definitions of these structures.

Starvation occurs when higher-priority threads use so much CPU time that they prevent lower-priority threads from executing.

By the way, when the system boots, it creates a special thread called the zero page thread. This thread is assigned priority 0 and is the only thread in the entire system that runs at priority 0. The zero page thread is responsible for zeroing any free pages of RAM in the system when there are no other threads that need to perform work.

Windows supports six priority classes: idle, below normal, normal, above normal, high, and real-time. Of course, normal is the most common priority class and is used by 99 percent of the applications out there.

A process cannot run in the real-time priority class unless the user has the Increase Scheduling Priority privilege. Any user designated as an administrator or a power user has this privilege by default.

Windows supports seven relative thread priorities: idle, lowest, below normal, normal, above normal, highest, and time-critical.

The concept of a process priority class confuses some people. They think that this somehow means that processes are scheduled. Processes are never scheduled; only threads are scheduled. The process priority class is an abstract concept that Microsoft created to help isolate you from the internal workings of the scheduler; it serves no other purpose.

In general, a thread with a high priority level should not be schedulable most of the time. When the thread has something to do, it quickly gets CPU time. At this point, the thread should execute as few CPU instructions as possible and go back to sleep, waiting to be schedulable again. In contrast, a thread with a low priority level can remain schedulable and execute a lot of CPU instructions to do its work. If you follow these rules, the entire operating system will be responsive to its users.

It might seem odd that the process that creates a child process chooses the priority class at which the child process runs. Let’s consider Windows Explorer as an example. When you use Windows Explorer to run an application, the new process runs at normal priority. Windows Explorer has no idea what the process does or how often its threads need to be scheduled. However, once the child process is running, it can change its own priority class by calling SetPriorityClass.

The system determines the thread’s priority level by combining a thread’s relative priority with the priority class of the thread’s process. This is sometimes referred to as the thread’s base priority level. Occasionally, the system boosts the priority level of a thread—usually in response to some I/O event such as a window message or a disk read.

For example, a thread with a normal thread priority in a high-priority class process has a base priority level of 13. If the user presses a key, the system places a WM_KEYDOWN message in the thread’s queue. Because a message has appeared in the thread’s queue, the thread is schedulable. In addition, the keyboard device driver can tell the system to temporarily boost the thread’s level. So the thread might be boosted by 2 and have a current priority level of 15.

The system boosts only threads that have a base priority level between 1 and 15. In fact, this is why this range is referred to as the dynamic priority range. In addition, the system never boosts a thread into the real-time range (above 15). Because threads in the real-time range perform most operating system functions, enforcing a cap on the boost prevents an application from interfering with the operating system. Also, the system never dynamically boosts threads in the real-time range (16 through 31.)

Another situation causes the system to dynamically boost a thread’s priority level. Imagine a priority 4 thread that is ready to run but cannot because a priority 8 thread is constantly schedulable. In this scenario, the priority 4 thread is being starved of CPU time. When the system detects that a thread has been starved of CPU time for about three to four seconds, it dynamically boosts the starving thread’s priority to 15 and allows that thread to run for twice its time quantum. When the double time quantum expires, the thread’s priority immediately returns to its base priority.

When the user works with windows of a process, that process is said to be the foreground process and all other processes are background processes. Certainly, a user would prefer the process that he or she is using to behave more responsively than the background processes. To improve the responsiveness of the foreground process, Windows tweaks the scheduling algorithm for threads in the foreground process. The system gives foreground process threads a larger time quantum than they would usually receive. This tweak is performed only if the foreground process is of the normal priority class. If it is of any other priority class, no tweaking is performed.

By default, Windows Vista uses soft affinity when assigning threads to processors. This means that if all other factors are equal, it tries to run the thread on the processor it ran on last. Having a thread stay on a single processor helps reuse data that is still in the processor’s memory cache.

When you can control which CPUs can run certain threads. This is called hard affinity.

In most environments, altering thread affinities interferes with the scheduler’s ability to effectively migrate threads across CPUs that make the most efficient use of CPU time. The following table shows an example

When Thread A wakes, the scheduler sees that the thread can run on CPU 0 and it is assigned to CPU 0. Thread B then wakes, and the scheduler sees that the thread can be assigned to CPU 0 or 1, but because CPU 0 is in use, the scheduler assigns it to CPU 1. So far, so good.

Now Thread C wakes, and the scheduler sees that it can run only on CPU 1. But CPU 1 is in use by Thread B, a priority 8 thread. Because Thread C is a priority 6 thread, it can’t preempt Thread B. Thread C can preempt Thread A, a priority 4 thread, but the scheduler will not preempt Thread A because Thread C can’t run on CPU 0.

This demonstrates how setting hard affinities for threads can interfere with the scheduler’s priority scheme

You can also set processor affinity in the header of an executable file. Oddly, there doesn’t seem to be a linker switch for this, but you can use code similar to this that takes advantage of functions declared in ImageHlp.h:

// Load the EXE into memory.

PLOADED_IMAGE pLoadedImage = ImageLoad(szExeName, NULL);

// Get the current load configuration information for the EXE.

IMAGE_LOAD_CONFIG_DIRECTORY ilcd;

GetImageConfigInformation(pLoadedImage, &ilcd);

// Change the processor affinity mask.

ilcd.ProcessAffinityMask = 0x00000003; // I desire CPUs 0 and 1

// Save the new load configuration information.

SetImageConfigInformation(pLoadedImage, &ilcd);

// Unload the EXE from memory

ImageUnload(pLoadedImage);

Thread Basics

As we know any process is consisted of two components:

  • Process Kernel Object.
  • Process Address Space.

Similarly, a thread consists of:

  • Kernel object that the operating system uses to manage the thread. The kernel object is also where the system keeps statistical information about the thread.
  • Thread stack that maintains all the function parameters and local variables required as the thread executes code.

A process never executes anything; it is simply a container for threads. Threads are always created in the context of some process and live their entire life within that process. What this really means is that the thread executes code and manipulates data within its process’ address space. So if you have two or more threads running in the context of a single process, the threads share a single address space. The threads can execute the same code and manipulate the same data. Threads can also share kernel object handles because the handle table exists for each process, not each thread.

As you can see, processes use a lot more system resources than threads do. The reason for this is the address space. Creating a virtual address space for a process requires a lot of system resources. A lot of record keeping takes place in the system, and this requires a lot of memory. Also, because .exe and .dll files get loaded into an address space, file resources are required as well. A thread, on the other hand, uses significantly fewer system resources. In fact, a thread has just a kernel object and a stack; little record keeping is involved, and little memory is required.

Because threads require less overhead than processes, you should always try to solve your programming problems using additional threads and avoid creating new processes. However, don’t take this recommendation as law. Many designs are better implemented using multiple processes. You should be aware of the tradeoffs, and experience will guide you.

Every process has at least one thread in it. So if you do nothing special in your application, you already get a lot of benefit just from running on a multithreaded operating system. For example, you can build an application and use the word processor at the same time (something I do a lot). If the computer has two CPUs, the build executes on one processor while the other processor handles a document. In other words, the user notices no degradation in performance and there’s no glitch in the user interface as he types. Also, if the compiler has a bug that causes its thread to enter an infinite loop, you can still use other processes. (This is not true of 16-bit Windows and MS-DOS applications.)

Threads are incredibly useful and have a place, but when you use threads you can create new problems while trying to solve the old ones.

In almost all applications, all the user interface components (windows) should share the same thread. A single thread should definitely create all of a window’s child windows. Sometimes creating different windows on different threads is useful, but these occasions are rare indeed.

Although it is unusual for a single process to have multiple user interface threads, there are some valid uses for this. Windows Explorer creates a separate thread for each folder’s window. This allows you to copy files from one folder to another and still explore other folders on your system. Also, if Windows Explorer has a bug in it, the thread handling one folder might crash, but you can still manipulate other folders—at least until you do the thing that causes the other folder to crash too.

Every thread must have an entry-point function where it begins execution. We already discussed this entry-point function for your primary thread: _tmain or _tWinMain. If you want to create a secondary thread in your process, it must also have an entry-point function, which should look something like this:

DWORD WINAPI ThreadFunc(PVOID pvParam){

DWORD dwResult = 0;

return(dwResult);

}

Your thread function can perform any task you want it to. Ultimately, your thread function will come to an end and return. At this point, your thread stops running, the memory for its stack is freed, and the usage count of your thread’s kernel object is decremented. If the usage count becomes 0, the thread kernel object is destroyed. Like process kernel objects, thread kernel objects always live at least as long as the thread they are associated with, but the object might live well beyond the lifetime of the thread itself.

Some notes about threads:

  • Thread entry point function can have any name.
  • You should not worry about ANSI/UNICODE issues (especially in passed parameters)
  • Thread function must return a value.
  • It’s a good practice using thread’s local stack rather than global memory.

The system allocates memory out of the process’ address space for use by the thread’s stack. The new thread runs in the same process context as the creating thread. The new thread therefore has access to all of the process’ kernel object handles, all of the memory in the process, and the stacks of all other threads that are in this same process. This makes it really easy for multiple threads in a single process to communicate with each other.

The CreateThread function is the Windows function that creates a thread. However, if you are writing C/C++ code, you should never call CreateThread. Instead, you should use the Microsoft C++ run-time library function _beginthreadex. If you o not use Microsoft’s C++ compiler, your compiler vendor will have its own alternative to CreateThread. Whatever this alternative is, you must use it.

The cbStackSize parameter specifies how much address space the thread can use for its own stack. Every thread owns its own stack. When CreateProcess starts a process, it internally calls CreateThread to initialize the process’ primary thread. For the cbStackSize parameter, CreateProcess uses a value stored inside the executable file. You can control this value using the linker’s /STACK switch:

/STACK:[reserve][,commit]

The reserve argument sets the amount of address space the system should reserve for the thread’s stack. The default is 1 MB. The commit argument specifies the amount of physical storage that should be initially committed to the stack’s reserved region. The default is one page. As the code in your thread executes, you might require more than one page of storage. When your thread overflows its stack, an exception is generated. The system catches the exception and commits another page (or whatever you specified for the commit argument) to the reserved space, which allows a thread’s stack to grow dynamically as needed.

When you call CreateThread, passing a value other than 0 causes the function to reserve and commit all storage for the thread’s stack. Because all the storage is committed up front, the thread is guaranteed to have the specified amount of stack storage available. The amount of reserved space is either the amount specified by the /STACK linker switch or the value of cbStack, whichever is larger. The amount of storage committed matches the value you passed for cbStack. If you pass 0 to the cbStack parameter, CreateThread reserves a region and commits the amount of storage indicated by the /STACK linker switch information embedded in the .exe file by the linker.

The reserve amount sets an upper limit for the stack so that you can catch endless recursion bugs in your code.

Remember that Windows is a preemptive multithreading system, which means that the new thread and the thread that called CreateThread can execute simultaneously. Because the threads run simultaneously, problems can occur. Watch out for code like this:

DWORD WINAPI FirstThread(PVOID pvParam) {

// Initialize a stack-based variable

int x = 0;

DWORD dwThreadID;

// Create a new thread.

HANDLE hThread = CreateThread(NULL, 0, SecondThread, (PVOID) &x, 0, &dwThreadID);

// We don’t reference the new thread anymore,

// so close our handle to it.

CloseHandle(hThread);

// Our thread is done.

// BUG: our stack will be destroyed, but

// SecondThread might try to access it.

return(0);

}

DWORD WINAPI SecondThread(PVOID pvParam) {

// Do some lengthy processing here. …

// Attempt to access the variable on FirstThread’s stack.

// NOTE: This may cause an access violation – it depends on timing! * ((int *) pvParam) = 5;

return(0);

}

In the preceding code, FirstThread might finish its work before SecondThread assigns 5 to FirstThread‘s x. If this happens, SecondThread won’t know that FirstThread no longer exists and will attempt to change the contents of what is now an invalid address. This causes SecondThread to raise an access violation because FirstThread‘s stack is destroyed when FirstThread terminates. One way to solve this problem is to declare x as a static variable so that the compiler will create a storage area for x in the application’s data section rather than on the stack. You can solve this problem using thread synchronization in user and Kernel mode techniques.

A thread can be terminated in four ways:

  • The thread function returns. (This is highly recommended.)
  • The thread kills itself by calling the ExitThread function. (Avoid this method.)
  • A thread in the same process or in another one calls the TerminateThread function. (Avoid this method.)
  • The process containing the thread terminates. (Avoid this method.)

The following actions occur when a thread terminates:

  • All User object handles owned by the thread are freed. In Windows, most objects are owned by the process containing the thread that creates the objects. However, a thread owns two User objects: windows and hooks. When a thread dies, the system automatically destroys any windows and uninstalls any hooks that were created or installed by the thread. Other objects are destroyed only when the owning process terminates.
  • The thread’s exit code changes from STILL_ACTIVE to the code passed to ExitThread or TerminateThread.
  • The state of the thread kernel object becomes signaled.
  • If the thread is the last active thread in the process, the system considers the process terminated as well.
  • The thread kernel object’s usage count is decremented by 1.

Let’s look closely at this figure to understand exactly what’s going on.

  • A call to CreateThread causes the system to create a thread kernel object. This object has an initial usage count of 2. (The thread kernel object is not destroyed until the thread stops running and the handle returned from CreateThread is closed)
  • Other properties of the thread’s kernel object are also initialized: the suspension count is set to 1, the exit code is set to STILL_ACTIVE (0x103), and the object is set to the nonsignaled state.
  • Once the kernel object has been created, the system allocates memory, which is used for the thread’s stack. This memory is allocated from the process’ address space because threads don’t have an address space of their own.
  • The system then writes two values to the upper end of the new thread’s stack. (Thread stacks always build from high memory addresses to low memory addresses.) The first value written to the stack is the value of the pvParam parameter that you passed to CreateThread. Immediately below it is the pfnStartAddr value that you also passed to CreateThread.
  • Each thread has its own set of CPU registers, called the thread’s context. The context reflects the state of the thread’s CPU registers when the thread last executed. The set of CPU registers for the thread is saved in a CONTEXT structure (defined in the WinNT.h header file). The CONTEXT structure is itself contained in the thread’s kernel object.
  • The instruction pointer and stack pointer registers are the two most important registers in the thread’s context. Remember that threads always run in the context of a process. So both these addresses identify memory in the owning process’ address space. When the thread’s kernel object is initialized, the CONTEXT structure’s stack pointer register is set to the address of where pfnStartAddr was placed on the thread’s stack. The instruction pointer register is set to the address of an undocumented function called RtlUserThreadStart, which is exported by the NTDLL.dll module in the code below:

VOID

RtlUserThreadStart(PTHREAD_START_ROUTINE

pfnStartAddr, PVOID

pvParam) {

__try {

ExitThread((pfnStartAddr)(pvParam));

}

 

__except(UnhandledExceptionFilter(GetExceptionInformation())) {

ExitProcess(GetExceptionCode());

}

// NOTE: We never get here.

 

}

  • After the thread has completely initialized the system checks to see whether the CREATE_SUSPENDED flag was passed to CreateThread. If this flag was not passed, the system decrements the thread’s suspend count to 0 and the thread can be scheduled to a processor. The system then loads the actual CPU registers with the values that were last saved in the thread’s context. The thread can now execute code and manipulate data in its process’ address space.

When the new thread executes the RtlUserThreadStart function, the following things happen:

  • A structured exception handling (SEH) frame is set up around your thread function so that any exceptions raised while your thread executes get some default handling by the system.
  • The system calls your thread function, passing it the pvParam parameter that you passed to the CreateThread function.
  • When your thread function returns, RtlUserThreadStart calls ExitThread, passing it your thread function’s return value. The thread kernel object’s usage count is decremented and the thread stops executing.
  • If your thread raises an exception that is not handled, the SEH frame set up by the RtlUserThreadStart function handles the exception. Usually, this means that a message box is presented to the user and that when the user dismisses the message box, RtlUserThreadStart calls ExitProcess to terminate the entire process, not just the offending thread.

Jobs

You often need to treat a group of processes as a single entity. For example, when you tell Microsoft Visual Studio to build a C++ project, it spawns Cl.exe, which might have to spawn additional processes (such as the individual passes of the compiler). But if the user wants to prematurely stop the build, Visual Studio must somehow be able to terminate Cl.exe and all its child processes. Solving this simple (and common) problem in Microsoft Windows has been notoriously difficult because Windows doesn’t maintain a parent/child relationship between processes. In particular, child processes continue to execute even after their parent process has been terminated.

Microsoft Windows offers a job kernel object that lets you group processes together and create a “sandbox” that restricts what the processes can do. It is best to think of a job object as a container of processes. However, it is also useful to create jobs that contain a single process because you can place restrictions on that process that you normally cannot.

Be aware that closing a job object does not force all the processes in the job to be terminated. The job object is actually marked for deletion and is destroyed automatically only after all the processes within the job have been terminated.

Note that closing the job’s handle causes the job to be inaccessible to all processes even though the job still exists, as shown in the following code:

// Create a named job object.

HANDLE
hJob = CreateJobObject(NULL, TEXT(“Jeff”));

 

// Put our own process in the job.

AssignProcessToJobObject(hJob, GetCurrentProcess());

 

// Closing the job does not kill our process or the job.

// But the name (“Jeff”) is immediately disassociated with the job.

CloseHandle(hJob);

 

// Try to open the existing job.

hJob = OpenJobObject(JOB_OBJECT_ALL_ACCESS, FALSE, TEXT(“Jeff”));

// OpenJobObject fails and returns NULL here because the name (“Jeff”)

// was disassociated from the job when CloseHandle was called.

// There is no way to get a handle to this job now.

Several types of restrictions on a job:

  • The basic limit and extended basic limit prevent processes within a job from monopolizing the system’s resources.
  • Basic UI restrictions prevent processes within a job from altering the user interface.
  • Security limits prevent processes within a job from accessing secure resources (files, registry subkeys, and so on).

Well, certainly one of the most popular things that you will want to do with a job is kill all the processes within it. Visual Studio doesn’t have an easy way to stop a build that is in progress because it would have to know which processes were spawned from the first process that it spawned. (This is very tricky. Jeff explain how Developer Studio accomplished this in his Win32 Q & A column in the June 1998 issue of Microsoft Systems Journal, readable at http://www.microsoft.com/msj/0698/win320698.aspx.) Maybe future versions of Visual Studio will use jobs instead because the code is a lot easier to write and you can do much more with it.

Notice how you could be tempted to take advantage of functions from psapi.h, such as GetModuleFileNameEx and GetProcessImageFileName, to obtain the full pathname of a process given its process ID. However, the former fails when the job is notified that a new process is created under its constraints because the address space is not fully initialized: the modules are not yet mapped into it. The case of GetProcessImageFileName is interesting because it is able to retrieve the full pathname in that extreme condition, but the obtained syntax is closer to what you see in kernel mode rather than in user mode— for example, \Device\HarddiskVolume1\Windows\System32\notepad.exe instead of C:\Windows\System32\notepad.exe. This is why you should rely on the new QueryFullProcessImageName function, which returns the expected full pathname in all situations.

Processes

A process is usually defined as an instance of a running program and consists of two components:

  • A kernel object that the operating system uses to manage the process. The kernel object is also where the system keeps statistical information about the process.
  • An address space that contains all the executable or dynamic-link library (DLL) module’s code and data. It also contains dynamic memory allocations such as thread stacks and heap allocations.

Processes are inert. For a process to accomplish anything it must have a thread that runs in its context; this thread is responsible for executing the code contained in the process’ address space. In fact, a single process might contain several threads, all of them executing code “simultaneously” in the process’ address space. To do this, each thread has its own set of CPU registers and its own stack. Each process has at least one thread that executes code in the process’ address space.

If there were no threads executing code in the process’ address space, there would be no reason for the process to continue to exist, and the system would automatically destroy the process and its address space.

When you use Microsoft Visual Studio to create an application project, the integrated environment sets up various linker switches so that the linker embeds the proper type of subsystem in the resulting executable. This linker switch is /SUBSYSTEM:CONSOLE for CUI applications and /SUBSYSTEM:WINDOWS for GUI applications. When the user runs an application, the operating system’s loader looks inside the executable image’s header and grabs this subsystem value.

When your entry-point function returns, the startup function calls the C run-time exit function, passing it your return value (nMainRetVal). The exit function does the following:

  • It calls any functions registered by calls to the _onexit function.
  • It calls destructors for all global and static C++ class objects.
  • In DEBUG builds, leaks in the C/C++ run-time memory management are listed by a call to the _CrtDumpMemoryLeaks function if the _CRTDBG_LEAK_CHECK_DF flag has been set.
  • It calls the operating system’s ExitProcess function, passing it nMainRetVal. This causes the operating system to kill your process and set its exit code.

As it turns out, HMODULEs and HINSTANCEs are exactly the same thing. If the documentation for a function indicates that an HMODULE is required, you can pass an HINSTANCE, and vice versa. There are two data types because in 16-bit Windows HMODULEs and HINSTANCEs identified different things.

The base address where an executable file’s image loads is determined by the linker. Different linkers can use different default base addresses. The Visual Studio linker uses a default base address of 0x00400000 for a historical reason: this is the lowest address an executable file image can load to when you run Windows 98. You can change the base address that your application loads to by using the /BASE:address linker switch for Microsoft’s linker.

Keep in mind two important characteristics of the GetModuleHandle function.

  • First, it examines only the calling process’ address space. If the calling process does not use any common dialog functions, calling GetModuleHandle and passing it ComDlg32 causes NULL to be returned even though ComDlg32.dll is probably loaded into the address spaces of other processes.
  • Second, calling GetModuleHandle and passing a value of NULL returns the base address of the executable file in the process’ address space. So even if you call GetModuleHandle(NULL) from code that is contained inside a DLL, the value returned is the executable file’s base address—not the DLL file’s base address.

Below is example code to get a specified environment variable:

PCTSTR
pszVariableName = L“TEMP”;

PTSTR
pszValue = NULL;

// Get the size of the buffer that is required to store the value

DWORD
dwResult = GetEnvironmentVariable(pszVariableName, pszValue, 0);

if (dwResult != 0) {

// Allocate the buffer to store the environment variable value

DWORD
size = dwResult * sizeof(TCHAR);

pszValue = (PTSTR)malloc(size);

GetEnvironmentVariable(pszVariableName, pszValue, size);

_tprintf(TEXT(“%s=%s\n”), pszVariableName, pszValue);

free(pszValue);

} else {

_tprintf(TEXT(“‘%s’=<unknown value>\n”), pszVariableName);

}

Many strings contain replaceable strings within them. For example, I found this string somewhere in the registry:

%USERPROFILE%\Documents

The portion that appears in between percent signs (%) indicates a replaceable string. In this case, the value of the environment variable, USERPROFILE, should be placed in the string. On my machine, the value of my USERPROFILE environment variable is

C:\Users\jrichter

So, after performing the string replacement, the resulting string becomes

C:\Users\jrichter\Documents

Because this type of string replacement is common, Windows offers the ExpandEnvironmentStrings function:

DWORD ExpandEnvironmentStrings(PTCSTR pszSrc, PTSTR pszDst, DWORD chSize);

When you call this function, the pszSrc parameter is the address of the string that contains replaceable environment variable strings. The pszDst parameter is the address of the buffer that will receive the expanded string, and the chSize parameter is the maximum size of this buffer, in characters. The returned value is the size in characters of the buffer needed to store the expanded string. If the chSize parameter is less than this value, the %% variables are not expanded but replaced by empty strings. So you usually call ExpandEnvironmentStrings twice as shown in the following code snippet:

DWORD chValue = ExpandEnvironmentStrings(TEXT(“PATH=’%PATH%'”), NULL, 0);

PTSTR pszBuffer = new TCHAR[chValue];

chValue = ExpandEnvironmentStrings(TEXT(“PATH=’%PATH%'”), pszBuffer, chValue);

_tprintf(TEXT(“%s\r\n”), pszBuffer); delete[] pszBuffer;

Finally, you can use the SetEnvironmentVariable function to add a variable, delete a variable, or modify a variable’s value:

BOOL SetEnvironmentVariable(PCTSTR pszName, PCTSTR pszValue);

This function sets the variable identified by the pszName parameter to the value identified by the pszValue parameter. If a variable with the specified name already exists, SetEnvironmentVariable modifies the value. If the specified variable doesn’t exist, the variable is added and, if pszValue is NULL, the variable is deleted from the environment block.

Normally, threads within a process can execute on any of the CPUs in the host machine. However, a process’ threads can be forced to run on a subset of the available CPUs. This is called processor affinity. Child processes inherit the affinity of their parent processes.

Associated with each process is a set of flags that tells the system how the process should respond to serious errors, which include disk media failures, unhandled exceptions, file-find failures, and data misalignment. A process can tell the system how to handle each of these errors by calling the SetErrorMode function:

UINT SetErrorMode(UINT fuErrorMode);

The fuErrorMode parameter is a combination of any of the flags shown in Table below bitwise ORed together:

When full pathnames are not supplied, the various Windows functions look for files and directories in the current directory of the current drive. For example, if a thread in a process calls CreateFile to open a file (without specifying a full pathname), the system looks for the file in the current drive and directory.

If you call a function, passing a drive-qualified name indicating a drive that is not the current drive, the system looks in the process’ environment block for the variable associated with the specified drive letter. If the variable for the drive exists, the system uses the variable’s value as the current directory. If the variable does not exist, the system assumes that the current directory for the specified drive is its root directory.

You create a process with the CreateProcess function:

BOOL CreateProcess(

PCTSTR pszApplicationName,

PTSTR pszCommandLine,

PSECURITY_ATTRIBUTES psaProcess,

PSECURITY_ATTRIBUTES psaThread,

BOOL bInheritHandles,

DWORD fdwCreate,

PVOID pvEnvironment,

PCTSTR pszCurDir,

PSTARTUPINFO psiStartInfo,

PPROCESS_INFORMATION ppiProcInfo);

Steps to run an application:

  • When a thread calls CreateProcess, the system creates a process kernel object with an initial usage count of 1.
  • The system then creates a virtual address space for the new process and loads the code and data for the executable file and any required DLLs into the process’ address space.
  • The system then creates a thread kernel object (with a usage count of 1) for the new process’ primary thread.
  • This primary thread begins by executing the application entry point set by the linker as the C/C++ run-time startup code, which eventually calls your WinMain, wWinMain, main, or wmain function.
  • If the system successfully creates the new process and primary thread, CreateProcess returns TRUE.

By the way, if you are calling the ANSI version of CreateProcess on Windows Vista, you will not get an access violation because a temporary copy of the command-line string is made. (For more information read “Working with Characters and Strings.”).

Process priority classes affect how the threads contained within the process are scheduled with respect to other processes’ threads.

If you need to pass the two attributes at the same time, don’t forget that the handles associated with PROC_THREAD_ATTRIBUTE_HANDLE_LIST must be valid in the new parent process associated with PROC_THREAD_ATTRIBUTE_PARENT_PROCESS because they will be inherited from this process, not the current process calling CreateProcess.

For some reason, many developers believe that closing the handle to a process or thread forces the system to kill that process or thread. This is absolutely not true. Closing the handle simply tells the system that you are not interested in the process or thread’s statistical data. The process or thread will continue to execute until it terminates on its own.

Task Manager creates this “System Idle Process” as a placeholder for the Idle thread that runs when nothing else is running. The number of threads in the System Idle Process is always equal to the number of CPUs in the machine. As such, it always represents the percentage of CPU usage that is not being used by real processes.

If your application uses IDs to track processes and threads, you must be aware that the system reuses process and thread IDs immediately.

You can discover the ID of the current process by using GetCurrentProcessId and the ID of the running thread by calling GetCurrentThreadId. You can also get the ID of a process given its handle by using GetProcessId and the ID of a thread given its handle by using GetThreadId. Last but not least, from a thread handle, you can determine the ID of its owning process by calling GetProcessIdOfThread.

Occasionally, you’ll work on an application that wants to determine its parent process. The first thing you should know is that a parent-child relationship exists between processes only at the time when the child is spawned. Just before the child process begins executing code, Windows does not consider a parent-child relationship to exist anymore.

A process can be terminated in four ways:

  • The primary thread’s entry-point function returns. (This is highly recommended.)
  • One thread in the process calls the ExitProcess function. (Avoid this method.)
  • A thread in another process calls the TerminateProcess function. (Avoid this method.)
  • All the threads in the process just die on their own. (This hardly ever happens.)

Having your primary thread’s entry-point function return ensures the following:

  • Any C++ objects created by this thread will be destroyed properly using their destructors.
  • The operating system will properly free the memory used by the thread’s stack.
  • The system will set the process’ exit code (maintained in the process kernel object) to your entry-point function’s return value.
  • The system will decrement the process kernel object’s usage count

When your primary thread’s entry-point function (WinMain, wWinMain, main, or wmain) returns, it returns to the C/C++ run-time startup code, which properly cleans up all the C run-time resources used by the process. After the C run-time resources have been freed, the C run-time startup code explicitly calls ExitProcess, passing it the value returned from your entry-point function. This explains why simply returning from your primary thread’s entry-point function terminates the entire process. Note that any other threads running in the process terminate along with the process.

Note that calling ExitProcess or ExitThread causes a process or thread to die while inside a function. As far the operating system is concerned, this is fine and all of the process’ or thread’s operating system resources will be cleaned up perfectly. However, a C/C++ application should avoid calling these functions because the C/C++ run time might not be able to clean up properly. Examine the following code:

 

#include <windows.h>

 

 

#include <stdio.h>

 

 

class CSomeObj {

 

 

public:

 

 

   CSomeObj()  { printf("Constructor\r\n"); }

 

 

   ~CSomeObj() { printf("Destructor\r\n"); }

 

 

};

 

 

CSomeObj g_GlobalObj;

 

 

void main () {

 

 

   CSomeObj LocalObj;

 

 

   ExitProcess(0);    // This shouldn't be here

 

 

   // At the end of this function, the compiler automatically added

 

 

   // the code necessary to call LocalObj's destructor.

 

 

   // ExitProcess prevents it from executing.

 

 

}

 

When the preceding code executes, you’ll see the following:

 

Constructor

 

 

Constructor

 

Two objects are being constructed: a global object and a local object. However, you’ll never see the word Destructor appears. The C++ objects are not properly destructed because ExitProcess forces the process to die on the spot: the C/C++ run time is not given a chance to clean up.

As said, you should never call ExitProcess explicitly. If I remove the call to ExitProcess in the preceding code, running the program yields this:

 

Constructor

 

 

Constructor

 

 

Destructor

 

 

Destructor

 

When a process terminates, the following actions are set in motion:

  • Any remaining threads in the process are terminated.
  • All the User and GDI objects allocated by the process are freed, and all the kernel objects are closed. (These kernel objects are destroyed if no other process has open handles to them. However, the kernel objects are not destroyed if other processes do have open handles to them.)
  • The process’ exit code changes from STILL_ACTIVE to the code passed to ExitProcess or TerminateProcess.
  • The process kernel object’s status becomes signaled. This is why other threads in the system can suspend themselves until the process is terminated.
  • The process kernel object’s usage count is decremented by 1.

When you design an application, you might encounter situations in which you want another block of code to perform work. You assign work like this all the time by calling functions or subroutines. When you call a function, your code cannot continue processing until the function has returned. And in many situations, this single-tasking synchronization is needed. An alternative way to have another block of code perform work is to create a new thread within your process and have it help with the processing. This lets your code continue processing while the other thread performs the work you requested. This technique is useful, but it creates synchronization problems when your thread needs to see the results of the new thread.

Another approach is to spawn off a new process—a child process—to help with the work. To process the work, you simply create a new thread within the same process. You write some code, test it, and get some incorrect results. You might have an error in your algorithm, or maybe you dereferenced something incorrectly and accidentally overwrote something important in your address space. One way to protect your address space while having the work processed is to have a new process perform the work. You can then wait for the new process to terminate before continuing with your own work, or you can continue working while the new process works.

Unfortunately, the new process probably needs to perform operations on data contained in your address space. In this case, it might be a good idea to have the process run in its own address space and simply give it access to the relevant data contained in the parent process’ address space, thus protecting all the data not relevant to the task at hand. Windows offers several methods for transferring data between different processes: Dynamic Data Exchange (DDE), OLE, pipes, mailslots, and so on. One of the most convenient ways to share the data is to use memory-mapped files.

If you want to create a new process, have it do some work, and wait for the result, you can use code similar to the following:

 

PROCESS_INFORMATION pi;

 

 

DWORD dwExitCode;

 

 

// Spawn the child process.

 

 

BOOL fSuccess = CreateProcess(..., &pi);

 

 

if (fSuccess) {

 

 

   // Close the thread handle as soon as it is no longer needed!

 

 

   CloseHandle(pi.hThread);

 

 

   // Suspend our execution until the child has terminated.

 

 

   WaitForSingleObject(pi.hProcess, INFINITE);

 

 

   // The child process terminated; get its exit code.

 

 

   GetExitCodeProcess(pi.hProcess, &dwExitCode);

 

 

   // Close the process handle as soon as it is no longer needed.

 

 

   CloseHandle(pi.hProcess);

 

 

}

 

You’ll notice that in the code fragment, we close the handle to the child process’ primary thread kernel object immediately after CreateProcess returns. This does not cause the child’s primary thread to terminate—it simply decrements the usage count of the child’s primary thread object. Here’s why this practice is a good idea: Suppose that the child process’ primary thread spawns off another thread and then the primary thread terminates. At this point, the system can free the child’s primary thread object from its memory if the parent process doesn’t have an outstanding handle to this thread object. But if the parent process does have a handle to the child’s thread object, the system can’t free the object until the parent process closes the handle.

Most of the time, an application starts another process as a detached process. This means that after the process is created and executing, the parent process doesn’t need to communicate with the new process or doesn’t require it to complete its work before the parent process continues. This is how Windows Explorer works. After Windows Explorer creates a new process for the user, it doesn’t care whether that process continues to live or whether the user terminates it.

To give up all ties to the child process, Windows Explorer must close its handles to the new process and its primary thread by calling CloseHandle. The following code example shows how to create a new process and how to let it run detached:

 

PROCESS_INFORMATION pi;

 

 

// Spawn the child process.

 

 

BOOL fSuccess = CreateProcess(..., &pi);

 

 

if (fSuccess) {

 

 

   // Allow the system to destroy the process & thread kernel

 

 

   // objects as soon as the child process terminates.

 

 

   CloseHandle(pi.hThread);

 

 

   CloseHandle(pi.hProcess);

 

 

}

 

Many people ask why Windows Vista doesn’t just ask once and then let the user’s desire to run a specific application as Administrator be stored in the system so that Windows Vista never asks the user again. Windows Vista doesn’t offer this because it would have to store the data somewhere (like in the registry or in some other file), and if this store ever got compromised, an application could modify the store so that its malware always ran elevated without the user being prompted.

If your application always requires Administrator privileges, such as during an installation step, the operating system can automatically prompt the user for privileges elevation each time your application is invoked. How do the UAC components of Windows decide what to do when a new process is spawned?

If a specific kind of resource (RT_MANIFEST) is found embedded within the application executable, the system looks for the <trustInfo> section and parses its contents. Here is an example of this section in the manifest file:

 

...

 

 

<trustInfo xmlns="urn:schemas-microsoft-com:asm.v2">

 

 

   <security>

 

 

      <requestedPrivileges>

 

 

         <requestedExecutionLevel

 

 


	level="requireAdministrator"

 

 

         />

 

 

      </requestedPrivileges>

 

 

   </security>

 

 

</trustInfo>

 

 

...