Discovering a Bug in Windows System Programming Edition 4

I’ve discovered a bug in the book Windows System Programming. The bugged code is stated below:

/* SkipArg.c
Skip one command line argument – skip tabs and spaces. */

#include “Everything.h”

LPTSTR SkipArg (LPCTSTR targv)

p = (LPTSTR)targv;
/* Skip up to the next tab or space. */
while (*p != _T(”) && *p != _T(‘ ‘) && *p != _T(‘\t’)) p++;
/* Skip over tabs and spaces to the next arg. */
while (*p != _T(”) && (*p == _T(‘ ‘) || *p == _T(‘\t’))) p++;
return p;

This message was last reply from John (author of the book):

My apologies. I see what you’re talking about now. The correct solution, I think, is to fix SkipArg so that it accounts for quotes in the command line. I’ll take a look at it and send you the solution. SkipArg is used in other projects, so it’s worth fixing.
Thanks for pointing this out. Interestingly, in 17 years since Edition 1, no one seems to have noticed this problem, or, at least, no one has reported it.

Further you can check more details by searching for the word “‘Ogail” in the book’s web page.

DLL Injection and API Hooking

Situations that require breaking through process boundary walls to access another process’ address space include the following:

  • When you want to subclass a window created by another process
  • When you need debugging aids—for example, when you need to determine which dynamic-link libraries (DLLs) another process is using
  • When you want to hook other processes

In computer programming, DLL injection is a technique used to run code within the address space of another process by forcing it to load a dynamic-link library. DLL injection is often used by third-party developers to influence the behavior of a program in a way its authors did not anticipate or intend. For example, the injected code could trap system function calls, or read the contents of password textboxes, which cannot be done the usual way.

Common DLL injection approaches:

  • Using the registry
  • Using windows hooks
  • Using remote threads
  • Using Trojans
  • As a debugger
  • Injecting code with CreateProcess

Of all the methods for injecting a DLL, using the regsitry by far the easiest. All you do is add two values to an already existing registry key. But this technique also has some disadvantages:

  • Your DLL is mapped only into processes that use User32.dll. All GUI-based applications use User32.dll, but most CUI-based applications do not. So if you need to inject your DLL into a compiler or linker, this method won’t work.
  • Your DLL is mapped into every GUI-based application, but you probably need to inject your library into only one or a few processes. The more processes your DLL is mapped into, the greater the chance of crashing the “container” processes. After all, threads running in these processes are executing your code. If your code enters an infinite loop or accesses memory incorrectly, you affect the behavior and robustness of the processes in which your code runs. Therefore, it is best to inject your library into as few processes as possible.
  • Your DLL is mapped into every GUI-based application for its entire lifetime. This is similar to the previous problem. Ideally, your DLL should be mapped into just the processes you need, and it should be mapped into those processes for the minimum amount of time. Suppose that when the user invokes your application, you want to subclass WordPad’s main window. Your DLL doesn’t have to be mapped into WordPad’s address space until the user invokes your application. If the user later decides to terminate your application, you’ll want to unsubclass WordPad’s main window. In this case, your DLL no longer needs to be injected into WordPad’s address space. It’s best to keep your DLL injected only when necessary

Following are steps to inject a DLL using Remote Threads:

  1. Use the VirtualAllocEx function to allocate memory in the remote process’ address space.
  2. Use the WriteProcessMemory function to copy the DLL’s pathname to the memory allocated in step 1.
  3. Use the GetProcAddress function to get the real address (inside Kernel32.dll) of the LoadLibraryW or LoadLibraryA function.
  4. Use the CreateRemoteThread function to create a thread in the remote process that calls the proper LoadLibrary function, passing it the address of the memory allocated in step 1. At this point, the DLL has been injected into the remote process’ address space, and the DLL’s DllMain function receives a DLL_PROCESS_ATTACH notification and can execute the desired code. When DllMain returns, the remote thread returns from its call to Load-LibraryW/A back to the BaseThreadStart function. BaseThreadStart then calls ExitThread, causing the remote thread to die.

    Now the remote process has the block of storage allocated in step 1 and the DLL still stuck in its address space. To clean this stuff up, we need to execute the following steps after the remote thread exists:

  5. Use the VirtualFreeEx function to free the memory allocated in step 1.
  6. Use the GetProcAddress function to get the real address (inside Kernel32.dll) of the FreeLibrary function.
  7. Use the CreateRemoteThread function to create a thread in the remote process that calls the FreeLibrary function, passing the remote DLL’s HMODULE.

Another way to inject a DLL is to replace a DLL that you know a process will load. For example, if you know that a process will load Xyz.dll, you can create your own DLL and give it the same filename. Of course, you must rename the original Xyz.dll to something else.

Inside your Xyz.dll, you must export all the same symbols that the original Xyz.dll exported. You can do this easily using function forwarders, which make it trivially simple to hook certain functions, but you should avoid using this technique because it is not version-resilient. If you replace a system DLL, for example, and Microsoft adds new functions in the future, your DLL will not have function forwarders for them. Applications that reference these new functions will be unable to load and execute.

If you have just a single application in which you want to use this technique, you can give your DLL a unique name and change the import section of the application’s .exe module. More specifically, the import section contains the names of the DLLs required by a module. You can rummage through this import section in the file and alter it so that the loader loads your own DLL. This technique is not too bad, but you have to be pretty familiar with the .exe and DLL file formats

API Hooking

Jeff mentioned an example that is very suitable for using API Hooking, he said:

“For example, I know of a company that produced a DLL that was loaded by a database product. The DLL’s job was to enhance and extend the capabilities of the database product. When the database product was terminated, the DLL received a DLL_PROCESS_DETACH notification and only then executed all of its cleanup code. The DLL would call functions in other DLLs to close socket connections, files, and other resources, but by the time it received the DLL_PROCESS_DETACH notification, other DLLs in the process’ address space had already gotten their DLL_PROCESS_DETACH notifications. So when the company’s DLL tried to clean up, many of the functions it called would fail because the other DLLs had already uninitialized.

The company hired me to help them solve this problem, and I suggested that we hook the ExitProcess function. As you know, calling ExitProcess causes the system to notify the DLLs with DLL_PROCESS_DETACH notifications. By hooking the ExitProcess function, we ensured that the company’s DLL was notified when ExitProcess was called. This notification would come in before any DLLs got a DLL_PROCESS_DETACH notification; therefore, all the DLLs in the process were still initialized and functioning properly. At this point, the company’s DLL would know that the process was about to terminate and could perform all of its cleanup successfully. Then the operating system’s ExitProcess function would be called, causing all the DLLs to receive their DLL_PROCESS_DETACH notifications and clean up. The company’s DLL would have no special cleanup to perform when it received this notification because it had already done what it needed to do”

Hook function is the function which replaces the original process function.

API Hooking Methodologies:

  • API Hooking by overwriting code.
  • API Hooking by Manipulating a Module’s Import Section

API Hooking by overwriting code steps:

  1. Save the first few bytes of this function in some memory of your own.
  2. You overwrite the first few bytes of this function with a JUMP CPU instruction that jumps to the memory address of your replacement function. Of course, your replacement function must have exactly the same signature as the function you’re hooking: all the parameters must be the same, the return value must be the same, and the calling convention must be the same.
  3. Now, when a thread calls the hooked function, the JUMP instruction will actually jump to your replacement function. At this point, you can execute whatever code you’d like.
  4. You unhook the function by taking the saved bytes (from step 2) and placing them back at the beginning of the hooked function.
  5. You call the hooked function (which is no longer hooked), and the function performs its normal processing.
  6. When the original function returns, you execute steps 2 and 3 again so that your replacement function will be called in the future.

This method was heavily used by 16-bit Windows programmers and worked just fine in that environment. Today, this method has several serious shortcomings, and I strongly discourage its use. First, it is CPU-dependent: JUMP instructions on x86, x64, IA-64, and other CPUs are different, and you must use hand-coded machine instructions to get this technique to work. Second, this method doesn’t work at all in a preemptive, multithreaded environment. It takes time for a thread to overwrite the code at the beginning of a function. While the code is being overwritten, another thread might attempt to call the same function. The results are disastrous! So this method works only if you know that no more than one thread will attempt to call a particular function at any given time.

API Hooking by Manipulating a Module’s Import Section

As it turns out, another API hooking technique solves both of the problems I’ve mentioned. This technique is easy to implement and is quite robust. But to understand it, you must understand how dynamic linking works. In particular, you must understand what’s contained in a module’s import section.

As you know, a module’s import section contains the set of DLLs that the module requires in order to run. In addition, it contains the list of symbols that the module imports from each of the DLLs. When the module places a call to an imported function, the thread actually grabs the address of the desired imported function from the module’s import section and then jumps to that address.

So, to hook a particular function, all you do is change the address in the module’s import section. That’s it. No CPU-dependent stuff. And because you’re not modifying the function’s code in any way, you don’t need to worry about any thread synchronization issues.

You should take into consideration delay loaded DLLs, in particular you can hook LoadLibrary* to first Load the DLL then hook its desired functions.

Thread-Local Storage

All threads of a process share its virtual address space. The local variables of a function are unique to each thread that runs the function. However, the static and global variables are shared by all threads in the process. With thread local storage (TLS), you can provide unique data for each thread that the process can access using a global index. One thread allocates the index, which can be used by the other threads to retrieve the unique data associated with the index.

There are two types of Thread-Local Storage (TLS):

  • Dynamic TLS.
  • Static TLS.

When the compiler compiles your program, it puts all the TLS variables into their own section, which is named, unsurprisingly enough, .tls. The linker combines all the .tls sections from all the object modules to produce one big .tls section in the resulting executable or DLL file.

For static TLS to work, the operating system must get involved. When your application is loaded into memory, the system looks for the .tls section in your executable file and dynamically allocates a block of memory large enough to hold all the static TLS variables. Every time the code in your application refers to one of these variables, the reference resolves to a memory location contained in the allocated block of memory. As a result, the compiler must generate additional code to reference the static TLS variables, which makes your application both larger in size and slower to execute. On an x86 CPU, three additional machine instructions are generated for every reference to a static TLS variable.

If another thread is created in your process, the system traps it and automatically allocates another block of memory to contain the new thread’s static TLS variables. The new thread has access only to its own static TLS variables and cannot access the TLS variables belonging to any other thread.

That’s basically how static TLS works. Now let’s add DLLs to the story. It’s likely that your application will use static TLS variables and that you’ll link to a DLL that also wants to use static TLS variables. When the system loads your application, it first determines the size of your application’s .tls section and adds the value to the size of any .tls sections in any DLLs to which your application links. When threads are created in your process, the system automatically allocates a block of memory large enough to hold all the TLS variables required by your application and all the implicitly linked DLLs. This is pretty cool.

DLL Advanced Techniques

For a thread to call a function in a DLL module, the DLL’s file image must be mapped into the address space of the calling thread’s process. You can accomplish this in two ways. The first way is to have your application’s source code simply reference symbols contained in the DLL. This causes the loader to implicitly load (and link) the required DLL when the application is invoked.

The second way is for the application to explicitly load the required DLL and explicitly link to the desired exported symbol while the application is running. In other words, while the application is running, a thread within it can decide that it wants to call a function within a DLL. That thread can explicitly load the DLL into the process’ address space, get the virtual memory address of a function contained within the DLL, and then call the function using this memory address. The beauty of this technique is that everything is done while the application is running.

Figure below shows how an application explicitly loads a DLL and links to a symbol within it

If you pass NULL to GetModuleHandle, the handle of the application executable is returned.

You can explicitly load a DLL through the WinAPI LoadLibrary(Ex)

You can explicitly unload a DLL through WInAPI FreeLibrary or FreeLibraryAndExitThread

You can explicitly link to an exported symbol from a DLL using WinAPI GetProcAddress

The Platform SDK documentation states that your DllMain function should perform only simple initialization, such as setting up thread-local storage, creating kernel objects, and opening files. You must also avoid calls to User, Shell, ODBC, COM, RPC, and socket functions (or functions that call these functions) because their DLLs might not have initialized yet or the functions might call LoadLibrary(Ex) internally, again creating a dependency loop.

If a process terminates because some thread in the system calls TerminateProcess, the system does not call the DLL’s DllMain function with a value of DLL_PROCESS_DETACH. This means that any DLLs mapped into the process’ address space do not have a chance to perform any cleanup before the process terminates. This can result in the loss of data. You should use the TerminateProcess function only as a last resort.

Figure below explains what happens when a thread call LoadLibrary

And this figure when FreeLibrary is called

Normally, you don’t even think about this DllMain serialization. The reason I’m making a big deal out of it is that you can have a bug in your code caused by DllMain serialization. This code looked something like the following code:

BOOL WINAPI DllMain(HINSTANCE hInstDll, DWORD fdwReason, PVOID fImpLoad) {
   HANDLE hThread;
   DWORD dwThreadId;

   switch (fdwReason) {


      // The DLL is being mapped into the process' address space.

      // Create a thread to do some stuff.

      hThread = CreateThread(NULL, 0, SomeFunction, NULL, 0, &dwThreadId);

      // Suspend our thread until the new thread terminates.
      WaitForSingleObject(hThread, INFINITE);
      // We no longer need access to the new thread.
      // A thread is being created.
      // A thread is exiting cleanly.
      // The DLL is being unmapped from the process' address space.




It may took several hours to discover the problem with the code. Can you see it? When DllMain receives a DLL_PROCESS_ATTACH notification, a new thread is created. The system must call DllMain again with a value of DLL_THREAD_ATTACH. However, the new thread is suspended because the thread that caused the DLL_PROCESS_ATTACH notification to be sent to DllMain has not finished processing. The problem is the call to WaitForSingleObject. This function suspends the currently executing thread until the new thread terminates. However, the new thread never gets a chance to run, let alone terminate, because it is suspended—waiting for the current thread to exit the DllMain function. What we have here is a deadlock situation. Both threads are suspended forever!

Delay Loading a DLL

Microsoft Visual C++ offers a fantastic feature to make working with DLLs easier: delay-load DLLs. A delay-load DLL is a DLL that is implicitly linked but not actually loaded until your code attempts to reference a symbol contained within the DLL. Delay-load DLLs are helpful in these situations:

  • If your application uses several DLLs, its initialization time might be slow because the loader maps all the required DLLs into the process’ address space. One way to alleviate this problem is to spread out the loading of the DLLs as the process executes. Delay-load DLLs let you accomplish this easily.
  • If you call a new function in your code and then try to run your application on an older version of the system in which the function does not exist, the loader reports an error and does not allow the application to run. You need a way to allow your application to run and then, if you detect (at run time) that the application is running on an older system, you don’t call the missing function. For example, let’s say that an application wants to use the new Thread Pool functions when running on Windows Vista and the old functions when running on older versions of Windows. When the application initializes, it calls GetVersionEx to determine the host operating system and properly calls the appropriate functions. Attempting to run this application on versions of Windows older than Windows Vista causes the loader to display an error message because the new Thread Pool functions don’t exist on these operating systems. Again, delay-load DLLs let you solve this problem easily.

However, a couple of limitations are worth mentioning:

  • It is not possible to delay-load a DLL that exports fields.
  • The Kernel32.dll module cannot be delay-loaded because it must be loaded for LoadLibrary and GetProcAddress to be called.
  • You should not call a delay-load function in a DllMain entry point because the process might crash.

Let’s start with the easy stuff: getting delay-load DLLs to work. First, you create a DLL just as you normally would. You also create an executable as you normally would, but you do have to change a couple of linker switches and relink the executable. Here are the two linker switches you need to add:

  • /Lib:DelayImp.lib
  • /DelayLoad:MyDll.dll

To unload a delay-loaded DLL, you must do two things. First, you must specify an additional linker switch (/Delay:unload) when you build your executable file. Second, you must modify your source code and place a call to the __FUnloadDelayLoadedDLL2 function at the point where you want the DLL to be unloaded:

BOOL __FUnloadDelayLoadedDLL2(PCSTR szDll);

You can take advantage of function forwarders in your DLL module as well. The easiest way to do this is by using a pragma directive, as shown here:

// Function forwarders to functions in DllWork

#pragma comment(linker, "/export:SomeFunc=DllWork.SomeOtherFunc")

This pragma tells the linker that the DLL being compiled should export a function called SomeFunc. But the actual implementation of SomeFunc is in another function called SomeOtherFunc, which is contained in a module called DllWork.dll. You must create separate pragma lines for each function you want to forward.

Applications can depend on a specific version of a shared DLL and start to fail if another application is installed with a newer or older version of the same DLL. There are two ways to ensure that your application uses the correct DLL: DLL redirection and side-by-side components. Developers and administrators should use DLL redirection for existing applications, because it does not require any changes to the application. If you are creating a new application or updating an application and want to isolate your application from potential problems, create a side-by-side component.

Rebasing Modules


When this executable module is invoked, the operating system loader creates a virtual address for the new process. Then the loader maps the executable 
module at memory address 0x00400000 and the DLL module at 0x10000000. Why is this preferred base address so important? Let's look at this code:
int g_x;


void Func() {

   g_x = 5; // This is the important line.


When the compiler processes the Func function, the compiler and linker produce machine code that looks something like this:

MOV   [0x00414540], 5

In other words, the compiler and linker have created machine code that is actually hard-coded in the address of the g_x variable: 0x00414540. This address is in the machine code and absolutely identifies the location of the g_x variable in the process’ address space. But, of course, this memory address is correct if and only if the executable module loads at its preferred base address: 0x00400000.

What if we had the exact same code in a DLL module? In that case, the compiler and linker would generate machine code that looks something like this:

MOV   [0x10014540], 5

Again, notice that the virtual memory address for the DLL's g_x variable is hard-coded in the DLL file's image on the disk drive. And again, this memory address is absolutely correct as long as the DLL does in fact load at its preferred base address.

Relocating an executable (or DLL) module is an absolutely horrible process, and you should take measures to avoid it. Let's see why. 
Suppose that the loader relocates the second DLL to address 0x20000000. In that case, the code that changes the g_x variable to 5 should be

MOV   [0x20014540], 5

But the code in the file’s image looks like this:


MOV   [0x10014540], 5


If the code from the file’s image is allowed to execute, some 4-byte value in the first DLL module will be overwritten with the value 5. This can’t possibly be allowed. The loader must somehow fix this code. When the linker builds your module, it embeds a relocation section in the resulting file. This section contains a list of byte offsets. Each byte offset identifies a memory address used by a machine code instruction. If the loader can map a module at its preferred base address, the module’s relocation section is never accessed by the system. This is certainly what we want—you never want the relocation section to be used.

If, on the other hand, the module cannot be mapped at its preferred base address, the loader opens the module’s relocation section and iterates though all the entries. For each entry found, the loader goes to the page of storage that contains the machine code instruction to be modified. It then grabs the memory address that the machine instruction is currently using and adds to the address the difference between the module’s preferred base address and the address where the module actually got mapped.

So, in the preceding example, the second DLL was mapped at 0x20000000 but its preferred base address is 0x10000000. This yields a difference of 0x10000000, which is then added to the address in the machine code instruction, giving us this:

MOV   [0x20014540], 5

Now this code in the second DLL will reference its g_x variable correctly.

There are two major drawbacks when a module cannot load at its preferred base address:

  • The loader has to iterate through the relocation section and modify a lot of the module’s code. This produces a major performance hit and can really hurt an application’s initialization time.
  • As the loader writes to the module’s code pages, the system’s copy-on-write mechanism forces these pages to be backed by the system’s paging file.

By the way, you can create an executable or DLL module that doesn’t have a relocation section in it. You do this by passing the /FIXED switch to the linker when you build the module. Using this switch makes the module smaller in bytes, but it means that the module cannot be relocated. If the module cannot load at its preferred base address, it cannot load at all. If the loader must relocate a module but no relocation section exists for the module, the loader kills the entire process and displays an "Abnormal Process Termination" message to the user.

Preferred base addresses must always start on an allocation-granularity boundary.

Visual Studio ships a tool called rebase utility this utility do the following when you call ReBaseImage:

When you execute Rebase, passing it a set of image filenames, it does the following:

  1. It simulates creating a process’ address space.
  2. It opens all the modules that would normally be loaded into this address space. It thus gets the preferred base address and size of each module.
  3. It simulates relocating the modules in the simulated address space so that none of the modules overlap.
  4. For the relocated modules, it parses that module’s relocation section and modifies the code in the module file on disk.
  5. It updates the header of each relocated module to reflect the new preferred base address.

Binding Modules

Rebasing is very important and greatly improves the performance of the entire system. However, you can do even more to improve performance. Let’s say that you have properly rebased all your application’s modules. The loader writes the symbol’s virtual address into the executable module’s import section. This allows references to the imported symbols to actually get to the correct memory location.

Let’s think about this for a second. If the loader is writing the virtual addresses of the imported symbols into the .exe module’s import section, the pages that back the import section are written to. Because these pages are copy-on-write, the pages are backed by the paging file. So we have a problem that is similar to the rebasing problem: portions of the image file are swapped to and from the system’s paging file instead of being discarded and reread from the file’s disk image when necessary. Also, the loader has to resolve the addresses of all the imported symbols (for all modules), which can be time-consuming.

You can use the technique of binding a module so that your application can initialize faster and use less storage. Binding a module prepares that module’s import section with the virtual addresses of all the imported symbols. To improve initialization time and to use less storage, you must do this before loading the module, of course.

Visual Studio ships a bind.exe utility that do the following when you execute Bind passing it an image name:

  1. It opens the specified image file’s import section.
  2. For every DLL listed in the import section, it opens the DLL file and looks in its header to determine its preferred base address.
  3. It looks up each imported symbol in the DLL’s export section.
  4. It takes the RVA of the symbol and adds to it the module’s preferred base address. It writes the resulting expected virtual address of the imported symbol to the image file’s import section.
  5. It adds some additional information to the image file’s import section. This information includes the name of all DLL modules that the image is bound to and the time stamp of those modules.

OK, so now you know that you should bind all the modules that you ship with your application. But when should you perform the bind? If you bind your modules at your company, you would bind them to the system DLLs that you’ve installed, which are unlikely to be what the user has installed. Because you don’t know if your user is running Windows XP, Windows 2003, or Windows Vista, or whether these have service packs installed, you should perform binding as part of your application’s setup.

DLL Basics

Dynamic-link libraries (DLLs) have been the cornerstone of Microsoft Windows since the first version of the operating system. All the functions in the Windows application programming interface (API) are contained in DLLs. The three most important DLLs are Kernel32.dll, which contains functions for managing memory, processes, and threads; User32.dll, which contains functions for performing user-interface tasks such as window creation and message sending; and GDI32.dll, which contains functions for drawing graphical images and displaying text.

Windows also comes with several other DLLs that offer functions for performing more specialized tasks. For example, AdvAPI32.dll contains functions for object security, registry manipulation, and event logging; ComDlg32.dll contains the common dialog boxes (such as File Open and File Save); and ComCtl32.DLL supports all the common window controls.

Advantages of using DLLs:

  • They extend the features of an application.
  • They simplify the project management.
  • They help conserve memory by being shared among several processes (in case of static library).
  • Facilitate resource sharing.
  • Facilitate localization.
  • They help resolve platform differences.

A DLL can be load implicitly or explicitly.

Any allocated memory by functions from a DLL resides in the process address space.

It is important to realize that a single address space consists of one executable module and several DLL modules. Some of these modules can link to a static version of the C/ C++ run-time library, some of these modules might link to a DLL version of the C/C++ run-time library, and some of these modules (if not written in C/C++) might not require the C/ C++ run-time library at all.

Many developers make a common mistake because they forget that several C/C++ run-time libraries can be present in a single address space. Examine the following code:

VOID EXEFunc() {
   PVOID pv = DLLFunc();
   // Access the storage pointed to by pv...
   // Assumes that pv is in EXE's C/C++ run-time heap

   // Allocate block from DLL's C/C++ run-time heap
So, what do you think? Does the preceding code work correctly? Is the block allocated by the DLL’s function freed by the EXE’s function? The answer is: maybe. The code shown does not give you enough information. If both the EXE and the DLL link to the DLL C/C++ run-time library, the code works just fine. However, if one or both of the modules link to the static C/C++ run-time library, the call to free fails. I have seen developers write code similar to this too many times, and it has burned them all.

There is an easy fix for this problem. When a module offers a function that allocates memory, the module must also offer a function that frees memory. Let me rewrite the code just shown:

VOID EXEFunc() {
   PVOID pv = DLLFunc();
   // Access the storage pointed to by pv...
   // Makes no assumptions about C/C++ run-time heap

   // Allocate block from DLL's C/C++ run-time heap
    PVOID pv = malloc(100);


BOOL DLLFreeFunc(PVOID pv) {

   // Free block from DLL's C/C++ run-time heap



This code is correct and will always work. When you write a module, don’t forget that functions in other modules might not even be written in C/C++ and therefore might not use malloc and free for memory allocations. Be careful not to make these assumptions in your code. By the way, this same argument holds true for the C++ new and delete operators while calling malloc and free internally

Building a DLL requires the following steps:

  1. You must first create a header file, which contains the function prototypes, structures, and symbols that you want to export from the DLL. This header file is included by all of your DLL’s source code modules to help build the DLL. As you’ll see later, this same header file is required when you build an executable module (or modules) that uses the functions and variables contained in your DLL.
  2. You create the C/C++ source code module (or modules) that implements the functions and variables that you want in the DLL module. Because these source code modules are not required to build an executable module, the DLL company’s source code can remain a company secret.
  3. Building the DLL module causes the compiler to process each source code module, producing an .obj module (one .obj module per source code module).
  4. After all the .obj modules are created, the linker combines the contents of all the .obj modules and produces a single DLL image file. This image file (or module) contains all the binary code and global/static data variables for the DLL. This file is required in order to execute the executable module.
  5. If the linker detects that the DLL’s source code module exports at least one function or variable, the linker also produces a single .lib file. This .lib file is small because it contains no functions or variables. It simply lists all the exported function and variable symbol names. This file is required in order to build the executable module.

    Once you build the DLL module, you can build the executable module by following these steps:

  6. In all the source modules that reference functions, variables, data structures, or symbols, you must include the header file created by the DLL developer.
  7. You create the C/C++ source code module (or modules) that implements the functions and variables that you want in the executable module. The code can, of course, reference functions and variables defined in the DLL’s header file.
  8. Building the executable module causes the compiler to process each source code module, producing an .obj module (one .obj module per source code module).
  9. After all the .obj modules are created, the linker combines the contents of all the .obj modules and produces a single executable image file. This image file (or module) contains all the binary code and global/static data variables for the executable. The executable module also contains an import section that lists all the DLL module names required by this executable. In addition, for each DLL name listed, the section indicates which function and variable symbols are referenced by the executable’s binary code. The operating system loader parses the import section, as you’ll see in a moment.

    Once the DLL and the executable modules are built, a process can execute. When you attempt to run the executable module, the operating system’s loader performs the following step.

  10. The loader creates a virtual address space for the new process. The executable module is mapped into the new process’ address space. The loader parses the executable module’s import section. For every DLL name listed in the section, the loader locates the DLL module on the user’s system and maps that DLL into the process’ address space. Note that because a DLL module can import functions and variables from another DLL module, a DLL module might have its own import section. To fully initialize a process, the loader parses every module’s import section and maps all required DLL modules into the process’ address space. As you can see, initializing a process can be time consuming.

Once the executable module and all the DLL modules have been mapped into the process’ address space, the process’ primary thread can start executing and the application can run.

Building the DLL module

When you create a DLL, you create a set of functions that an executable module (or other DLLs) can call. A DLL can export variables, functions, or C++ classes to other modules. In real life, you should avoid exporting variables because this removes a level of abstraction in your code and makes it more difficult to maintain your DLL’s code. In addition, C++ classes can be exported only if the modules importing the C++ class are compiled using a compiler from the same vendor. For this reason, you should also avoid exporting C++ classes unless you know that the executable module developers use the same tools as the DLL module developers.

When an executable file is invoked, the operating system loader creates the virtual address space for the process. Then the loader maps the executable module into the process’ address space. The loader examines the executable’s import section and attempts to locate and map any required DLLs into the process’ address space.

Because the import section contains just a DLL name without its pathname, the loader must search the user’s disk drives for the DLL. Here is the loader’s search order:

  1. The directory containing the executable image file
  2. The Windows system directory returned by GetWindowsDirectory
  3. The 16-bit system directory—that is, the System subfolder under the Windows directory
  4. The Windows directory returned by GetSystemDirectory
  5. The process’ current directory

The directories listed in the PATH environment variable


The third and last mechanism for manipulating memory is the use of heaps. Heaps are great for allocating lots of small blocks of data. For example, linked lists and trees are best managed using heaps rather than the virtual memory techniques or the memory-mapped file techniques. The advantage of heaps is that they allow you to ignore all the allocation granularity and page boundary stuff and concentrate on the task at hand. The disadvantage of heaps is that allocating and freeing memory blocks is slower than the other mechanisms and you lose the direct control over the committing and decommitting of physical storage.

Internally, a heap is a region of reserved address space. Initially, most of the pages within the reserved region are not committed with physical storage. As you make more allocations from the heap, the heap manager commits more physical storage to the heap. This physical storage is always allocated from the system’s paging file. As you free blocks within a heap, the heap manager decommits the physical storage.

When a process initializes, the system creates a heap in the process’ address space. This heap is called the process’ default heap. By default, this heap’s region of address space is 1 MB in size. However, the system can grow a process’ default heap so that it becomes larger than this. You can change the default region size of 1 MB using the /HEAP linker switch when you create an application. Because a dynamic-link library (DLL) does not have a heap associated with it, you should not use the /HEAP switch when you are linking a DLL. The /HEAP switch has the following syntax:




Many Windows functions require the process’ default heap. For example, the core functions in Windows perform all of their operations using Unicode characters and strings. If you call an ANSI version of a Windows function, this ANSI version must convert the ANSI strings to Unicode strings and then call the Unicode version of the same function. To convert the strings, the ANSI function needs to allocate a block of memory to hold the Unicode version of the string. This block of memory is allocated from your process’ default heap. Many other Windows functions require the use of temporary memory blocks; these blocks are allocated from the process’ default heap. Also, the old 16-bit Windows functions LocalAlloc and GlobalAlloc make their memory allocations from the process’ default heap

Because the process’ default heap is used by many of the Windows functions, and because your application has many threads simultaneously calling the various Windows functions, access to the default heap is serialized. In other words, the system guarantees that only one thread at a time can allocate or free blocks of memory in the default heap at any given time. If two threads attempt to simultaneously allocate a block of memory in the default heap, only one thread will be able to allocate a block; the other thread will be forced to wait until the first thread’s block is allocated. Once the first thread’s block is allocated, the heap functions will allow the second thread to allocate a block. This serialized access causes a small performance hit. If your application has only one thread and you want to have the fastest possible access to a heap, you should create your own separate heap and not use the process’ default heap. Unfortunately, you cannot tell the Windows functions not to use the default heap, so their accesses to the heap are always serialized.

Reasons for creating new process heap:

  • Component protection.
  • More efficient memory management (avoiding fragmentation)
  • Local access.
  • Avoid thread synchronization overhead.
  • Quick free.

It is recommended that you use VirtualAlloc when allocating large blocks (around 1 MB or more). Avoid using the heap functions for such large allocations.

Memory Mapped Files

Working with files is something almost every application must do, and it’s always a hassle. Should your application open the file, read it, and close the file, or should it open the file and use a buffering algorithm to read from and write to different portions of the file? Microsoft Windows offers the best of both worlds: memory-mapped files.

Like virtual memory, memory-mapped files allow you to reserve a region of address space and commit physical storage to the region. The difference is that the physical storage comes from a file that is already on the disk instead of the system’s paging file. Once the file has been mapped, you can access it as if the whole file were loaded in memory.

Memory-mapped files are used for three different purposes:

  • The system uses memory-mapped files to load and execute .exe and dynamic-link library (DLL) files. This greatly conserves both paging file space and the time required for an application to begin executing.
  • You can use memory-mapped files to access a data file on disk. This shelters you from performing file I/O operations on the file and from buffering the file’s contents.
  • You can use memory-mapped files to allow multiple processes running on the same machine to share data with each other. Windows does offer other methods for communicating data among processes—but these other methods are implemented using memory-mapped files, making memory-mapped files the most efficient way for multiple processes on a single machine to communicate with one another.

When a thread calls CreateProcess, the system performs the following steps:

  1. The system locates the .exe file specified in the call to CreateProcess. If the .exe file cannot be found, the process is not created and CreateProcess returns FALSE.
  2. The system creates a new process kernel object.
  3. The system creates a private address space for this new process.
  4. The system reserves a region of address space large enough to contain the .exe file. The desired location of this region is specified inside the .exe file itself. By default, an .exe file’s base address is 0x00400000. (This address might be different for a 64-bit application running on 64-bit Windows.) However, you can override this when you create your application’s .exe file by using the linker’s /BASE option when you link your application.
  5. The system notes that the physical storage backing the reserved region is in the .exe file on disk instead of the system’s paging file

After the .exe file has been mapped into the process’ address space, the system accesses a section of the .exe file that lists the DLLs containing functions that the code in the .exe calls. The system then calls LoadLibrary for each of these DLLs, and if any of the DLLs require additional DLLs, the system calls LoadLibrary to load those DLLs as well. Every time LoadLibrary is called to load a DLL, the system performs steps similar to steps 4 and 5 just listed:

  1. The system reserves a region of address space large enough to contain the DLL file. The desired location of this region is specified inside the DLL file itself. By default, Microsoft’s linker sets the DLL’s base address to 0x10000000 for an x86 DLL and 0x00400000 for an x64 DLL. However, you can override this when you build your DLL by using the linker’s /BASE option. All the standard system DLLs that ship with Windows have different base addresses so that they don’t overlap if loaded into a single address space.
  2. If the system is unable to reserve a region at the DLL’s preferred base address, either because the region is occupied by another DLL or .exe or because the region just isn’t big enough, the system will then try to find another region of address space to reserve for the DLL. It is unfortunate when a DLL cannot load at its preferred base address for two reasons. First, the system might not be able to load the DLL if it does not have relocation information. (You can remove relocation information from a DLL when it is created by using the linker’s /FIXED switch. This makes the DLL file smaller, but it also means that the DLL must load at its preferred address or it can’t load at all.) Second, the system must perform some relocations within the DLL. These relocations require additional storage from the system’s paging file; they also increase the amount of time needed to load the DLL.
  3. The system notes that the physical storage backing the reserved region is in the DLL file on disk instead of in the system’s paging file. If Windows has to perform relocations because the DLL could not load at its preferred base address, the system also notes that some of the physical storage for the DLL is mapped to the paging file.

If for some reason the system is unable to map the .exe and all the required DLLs, the system displays a message box to the user and frees the process’ address space and the process object. CreateProcess will return FALSE to its caller; the caller can call GetLastError to get a better idea of why the process could not be created.

After all the .exe and DLL files have been mapped into the process’ address space, the system can begin executing the .exe file’s startup code. After the .exe file has been mapped, the system takes care of all the paging, buffering, and caching. For example, if code in the .exe causes it to jump to the address of an instruction that isn’t loaded into memory, a fault will occur. The system detects the fault and automatically loads the page of code from the file’s image into a page of RAM. Then the system maps the page of RAM to the proper location in the process’ address space and allows the thread to continue executing as though the page of code were loaded all along. Of course, all this is invisible to the application. This process is repeated each time any thread in the process attempts to access code or data that is not loaded into RAM

Note: “Initially, static data is not shared by multiple instances of an executable or DLL”.

When you create a new process for an application that is already running, the system simply opens another memory-mapped view of the file-mapping object that identifies the executable file’s image and creates a new process object and a new thread object (for the primary thread). The system also assigns new process and thread IDs to these objects. By using memory-mapped files, multiple running instances of the same application can share the same code and data in RAM.

Note one small problem here. Processes use a flat address space. When you compile and link your program, all the code and data are thrown together as one large entity. The data is separated from the code but only to the extent that it follows the code in the .exe file. (See the following note for more detail.) The following illustration shows a simplified view of how the code and data for an application are loaded into virtual memory and then mapped into an application’s address space

As an example, let’s say that a second instance of an application is run. The system simply maps the pages of virtual memory containing the file’s code and data into the second application’s address space, as shown next

The system allocated a new page of virtual memory (labeled as “New page” in the image above) and copied the contents of data page 2 into it. The first instance’s address space is changed so that the new data page is mapped into the address space at the same location as the original address page. Now the system can let the process alter the global variable without fear of altering the data for another instance of the same application.

A similar sequence of events occurs when an application is being debugged. Let’s say that you’re running multiple instances of an application and want to debug only one instance. You access your debugger and set a breakpoint in a line of source code. The debugger modifies your code by changing one of your assembly language instructions to an instruction that causes the debugger to activate itself. So you have the same problem again. When the debugger modifies the code, it causes all instances of the application to activate the debugger when the changed assembly instruction is executed. To fix this situation, the system again uses copy-on-write memory. When the system senses that the debugger is attempting to change the code, it allocates a new block of memory, copies the page containing the instruction into the new page, and allows the debugger to modify the code in the page copy.

Sharing Static Data across Multiple Instances of an Executable or DLL

The fact that global and static data is not shared by multiple mappings of the same .exe or DLL is a safe default. However, on some occasions it is useful and convenient for multiple mappings of an .exe to share a single instance of a variable. For example, Windows offers no easy way to determine whether the user is running multiple instances of an application. But if you could get all the instances to share a single global variable, this global variable could reflect the number of instances running. When the user invoked an instance of the application, the new instance’s thread could simply check the value of the global variable (which had been updated by another instance), and if the count were greater than 1, the second instance could notify the user that only one instance of the application is allowed to run and the second instance would terminate.

Every .exe or DLL file image is composed of a collection of sections. By convention, each standard section name begins with a period. For example, when you compile your program, the compiler places all the code in a section called .text. The compiler also places all the uninitialized data in a .bss section and all the initialized data in a .data section.

Section Attributes

Executable Common Sections

In addition to using the standard sections created by the compiler and the linker, you can create your own sections when you compile using the following directive:


#pragma data_seg("sectionname")


So, for example, I can create a section called “Shared” that contains a single LONG value, as follows:


#pragma data_seg("Shared")



LONG g_lInstanceCount = 0;



#pragma data_seg()


When the compiler compiles this code, it creates a new section called Shared and places all the initialized data variables that it sees after the pragma in this new section. In the preceding example, the variable is placed in the Shared section. Following the variable, the #pragma data_seg() line tells the compiler to stop putting initialized variables in the Shared section and to start putting them back in the default data section. It is extremely important to remember that the compiler will store only initialized variables in the new section

The Microsoft Visual C++ compiler offers an allocate declaration specifier, however, that does allow you to place uninitialized data in any section you desire. Take a look at the following code:


// Create Shared section & have compiler place initialized data in it.



#pragma data_seg("Shared")



// Initialized, in Shared section



int a = 0;



// Uninitialized, not in Shared section



int b;



// Have compiler stop placing initialized data in Shared section.



#pragma data_seg()



// Initialized, in Shared section



__declspec(allocate("Shared")) int c = 0;



// Uninitialized, in Shared section



__declspec(allocate("Shared")) int d;



// Initialized, not in Shared section



int e = 0;



// Uninitialized, not in Shared section



int f;



Simply telling the compiler to place certain variables in their own section is not enough to share those variables. You must also tell the linker that the variables in a particular section are to be shared. You can do this by using the /SECTION switch on the linker's command line:





Following the colon, type the name of the section for which you want to alter attributes. In our example, we want to change the attributes of the Shared section. So we’d construct our linker switch as follows:




After the comma, we specify the desired attributes: use R for READ, W for WRITE, E for EXECUTE, and S for SHARED. The switch shown indicates that the data in the Shared section is readable, writable, and shared. If you want to change the attributes of more than one section, you must specify the /SECTION switch multiple times—once for each section for which you want to change attributes.

You can also embed linker switches right inside your source code using this syntax:


#pragma comment(linker, "/SECTION:Shared,RWS")


This line tells the compiler to embed the preceding string inside a special section of the generated .obj file named “.drectve”. When the linker combines all the .obj modules together, the linker examines each .obj module’s “.drectve” section and pretends that all the strings were passed to the linker as command-line arguments. this technique should be used all the time because it is so convenient—if you move a source code file into a new project, you don’t have to remember to set linker switches in the Visual C++ Project Properties dialog box

Although you can create shared sections, Microsoft discourages the use of shared sections for two reasons. First, sharing memory in this way can potentially violate security. Second, sharing variables means that an error in one application can affect the operation of another application because there is no way to protect a block of data from being randomly written to by an application.

Memory-Mapped Data Files

The operating system makes it possible to memory map a data file into your process’ address space. Thus it is very convenient to manipulate large streams of data. To understand the power of using memory-mapped files this way, let’s look at four possible methods of implementing a program to reverse the order of all the bytes in a file. They are:

Method 1: One file, one buffer.

The first and theoretically simplest method involves allocating a block of memory large enough to hold the entire file. The file is opened, its contents are read into the memory block, and the file is closed. With the contents in memory, we can now reverse all the bytes by swapping the first byte with the last, the second byte with the second-to-last, and so on. This swapping continues until you reach the middle of the file. After all the bytes have been swapped, you reopen the file and overwrite its contents with the contents of the memory block.

This method is pretty easy to implement but has two major drawbacks. First, a memory block the size of the file must be allocated. This might not be so bad if the file is small, but what if the file is huge—say, 2 GB? A 32-bit system will not allow the application to commit a block of physical storage that large. Large files require a different method.

Second, if the process is interrupted in the middle—while the reversed bytes are being written back out to the file—the contents of the file will be corrupted. The simplest way to guard against this is to make a copy of the original file before reversing its contents. If the whole process succeeds, you can delete the copy of the file. Unfortunately, this safeguard requires additional disk space

Methods 2: Two files, one buffer.

In the second method, you open the existing file and create a new file of 0 length on the disk. Then you allocate a small internal buffer—say, 8 KB. You seek to the end of the original file minus 8 KB, read the last 8 KB into the buffer, reverse the bytes, and write the buffer’s contents to the newly created file. The process of seeking, reading, reversing, and writing repeats until you reach the beginning of the original file. Some special—but not extensive—handling is required if the file’s length is not an exact multiple of 8 KB. After the original file is fully processed, both files are closed and the original file is deleted.

This method is a bit more complicated to implement than the first one. It uses memory much more efficiently because only an 8-KB chunk is ever allocated, but it has two big problems. First, the processing is slower than in the first method because on each iteration you must perform a seek on the original file before performing a read. Second, this method can potentially use a large amount of hard disk space. If the original file is 1 GB, the new file will grow to be 1 GB as the process continues. Just before the original file is deleted, the two files will occupy 2 GB of disk space. This is 1 GB more than should be required—a disadvantage that leads us to the next method

Method 3: One file, two buffers.

For this method, let’s say the program initializes by allocating two separate 8-KB buffers. The program reads the first 8 KB of the file into one buffer and the last 8 KB of the file into the other buffer. The process then reverses the contents of both buffers and writes the contents of the first buffer back to the end of the file and the contents of the second buffer back to the beginning of the same file. Each iteration continues by moving blocks from the front and back of the file in 8-KB chunks. Some special handling is required if the file’s length is not an exact multiple of 16 KB and the two 8-KB chunks overlap. This special handling is more complex than the special handling in the previous method, but it’s nothing that should scare off a seasoned programmer.

Compared with the previous two methods, this method is better at conserving hard disk space. Because everything is read from and written to the same file, no additional disk space is required. As for memory use, this method is also not too bad, using only 16 KB. Of course, this method is probably the most difficult to implement. Like the first method, this method can result in corruption of the data file if the process is somehow interrupted.

Now let’s take a look at how this process might be accomplished using memory-mapped files

Method 4: One file, zero buffers.

When using memory-mapped files to reverse the contents of a file, you open the file and then tell the system to reserve a region of virtual address space. You tell the system to map the first byte of the file to the first byte of this reserved region. You can then access the region of virtual memory as though it actually contained the file. In fact, if there were a single 0 byte at the end of a text file, you could simply call the C run-time function _tcsrev to reverse the data in the file because it is usable simply as an in-memory text string.

This method’s great advantage is that the system manages all the file caching for you. You don’t have to allocate any memory, load file data into memory, write data back to the file, or free any memory blocks. Unfortunately, the possibility that an interruption such as a power failure could corrupt data still exists with memory-mapped files.

A portion of a file that is mapped into your process’ address space is called a view, which explains how MapViewOfFile got its name.

Using Memory Mapped-File to Share Data among Processes

Windows has always excelled at offering mechanisms that allow applications to share data and information quickly and easily. These mechanisms include RPC, COM, OLE, DDE, window messages (especially WM_COPYDATA), the Clipboard, mailslots, pipes, sockets, and so on. In Windows, the lowest-level mechanism for sharing data on a single machine is the memory-mapped file. That’s right, all of the mechanisms I mention ultimately use memory-mapped files to do their dirty work if all the processes communicating are on the same machine. If you require high-performance with low overhead, the memory-mapped file is the hands-down best mechanism to use.

Here is an interesting problem that has caught unsuspecting programmers by surprise. Can you guess what is wrong with the following code fragment?


HANDLE hFile = CreateFile(...);



HANDLE hMap = CreateFileMapping(hFile, ...);



if (hMap == NULL)








If the call shown to CreateFile fails, it returns INVALID_HANDLE_VALUE. However, the unsuspecting programmer who wrote this code didn’t test to check whether the file was created successfully. When CreateFileMapping is called, INVALID_HANDLE_VALUE is passed in the hFile parameter, which causes the system to create a file mapping using storage from the paging file instead of the intended disk file. Any additional code that uses the memory-mapped file will work correctly. However, when the file-mapping object is destroyed, all the data that was written to the file-mapping storage (the paging file) will be destroyed by the system. At this point, the developer sits and scratches his or her head, wondering what went wrong! You must always check CreateFile‘s return value to see if an error occurred because CreateFile can fail for so many reasons!

Sparsely Committed Memory-Mapped Files