Kernel Objects

Each kernel object is simply a memory block allocated by the kernel and is accessible only by the kernel. This memory block is a data structure whose members maintain information about the object.

Because the kernel object data structures are accessible only by the kernel, it is impossible for an application to locate these data structures in memory and directly alter their contents.

If we cannot alter these structures directly, how do our applications manipulate these kernel objects? The answer is that Windows offers a set of functions that manipulate these structures in well-defined ways. These kernel objects are always accessible via these functions. When you call a function that creates a kernel object, the function returns a handle that identifies the object. Think of this handle as an opaque value that can be used by any thread in your process. A handle is a 32-bit value in a 32-bit Windows process and a 64-bit value in a 64-bit Windows process.

To make the operating system robust, these handle values are process-relative. So if you were to pass this handle value to a thread in another process (using some form of interprocess communication), the calls that this other process would make using your process’ handle value might fail or, even worse, they will create a reference to a totally different kernel object at the same index in your process handle table.

Kernel objects are owned by the kernel, not by a process. In other words, if your process calls a function that creates a kernel object and then your process terminates, the kernel object is not necessarily destroyed. Under most circumstances, the object will be destroyed; but if another process is using the kernel object your process created, the kernel knows not to destroy the object until the other process has stopped using it.

The kernel knows how many processes are using a particular kernel object because each object contains a usage count. The usage count is one of the data members common to all kernel object types.

Kernel objects can be protected with a security descriptor. A security descriptor describes who owns the object (usually its creator), which group and users can gain access to or use the object, and which group and users are denied access to the object. Security descriptors are usually used when writing server applications.

Neglecting proper security access flags is one of the biggest mistakes that developers make. Using the correct flags will certainly make it much easier to port an application between Windows versions. However, you also need to realize that each new version of Windows brings a new set of constraints that did not exist in the previous versions. For example, in Windows Vista, you need to take care of the User Account Control (UAC) feature. By default, UAC forces applications to run in a restricted context for security safety even though the current user is part of the Administrators group.

When you first start programming for Windows, you might be confused when you try to differentiate a User object or a GDI object from a kernel object. For example, is an icon a User object or a kernel object? The easiest way to determine whether an object is a kernel object is to examine the function that creates the object. Almost all functions that create kernel objects have a parameter that allows you to specify security attribute information.

None of the functions that create User or GDI objects have a PSECURITY_ATTRIBUTES parameter. For example, take a look at the CreateIcon function:

HICON CreateIcon(HINSTANCE hinst, int nWidth, int nHeight, BYTE cPlanes, BYTE cBitsPixel, CONST BYTE *pbANDbits, CONST BYTE *pbXORbits);

When a process is initialized, the system allocates a handle table for it. This handle table is used only for kernel objects, not for User objects or GDI objects.

When a process first initializes, its handle table is empty. When a thread in the process calls a function that creates a kernel object, such as CreateFileMapping, the kernel allocates a block of memory for the object and initializes it. The kernel then scans the process’ handle table for an empty entry.

All functions that create kernel objects return process-relative handles that can be used successfully by any and all threads that are running in the same process. This handle value should actually be divided by 4 (or shifted right two bits to ignore the last two bits that are used internally by Windows) to obtain the real index into the process’ handle table that identifies where the kernel object’s information is stored.

If you call a function to create a kernel object and the call fails, the handle value returned is usually 0 (NULL), and this is why the first valid handle value is 4. The system would have to be very low on memory or encountering a security problem for this to happen. Unfortunately, a few functions return a handle value of -1 (INVALID_HANDLE_VALUE defined in WinBase.h) when they fail. For example, if CreateFile fails to open the specified file, it returns INVALID_HANDLE_VALUE instead of NULL.

Regardless of how you create a kernel object, you indicate to the system that you are done manipulating the object by calling CloseHandle:

BOOL CloseHandle(HANDLE hobject);

Usually, when you create a kernel object, you store the corresponding handle in a variable. After you call CloseHandle with this variable as a parameter, you should also reset the variable to NULL. If, by mistake, you reuse this variable to call a Win32 function, two unexpected situations might occur. Because the handle table slot referenced by the variable has been cleared, Windows receives an invalid parameter and you get an error. But another situation that is harder to debug is also possible. When you create a new kernel object, Windows looks for a free slot in the handle table. So, if new kernel objects have been constructed in your application workflows, the handle table slot referenced by the variable will certainly contain one of these new kernel objects. Thus, the call might target a kernel object of the wrong type or, even worse, a kernel object of the same type as the closed one. Your application state then becomes corrupted without any chance to recover.

Let’s say that you forget to call CloseHandle—will there be an object leak? Well, yes and no. It is possible for a process to leak resources (such as kernel objects) while the process runs. However, when the process terminates, the operating system ensures that all resources used by the process are freed—this is guaranteed. For kernel objects, the system performs the following actions:

  1. When your process terminates, the system automatically scans the process’ handle table.
  2. If the table has any valid entries (objects that you didn’t close before terminating), the system closes these object handles for you.
  3. If the usage count of any of these objects goes to zero, the kernel destroys the object.

Because kernel object handles are process-relative, performing these tasks is difficult. However, Microsoft had several good reasons for designing the handles to be process-relative.

  • The most important reason was robustness. If kernel object handles were system wide values, one process could easily obtain the handle to an object that another process was using and wreak havoc on that process.
  • Another reason for process-relative handles is security. Kernel objects are protected with security, and a process must request permission to manipulate an object before attempting to manipulate it. The creator of the object can prevent an unauthorized user from touching the object simply by denying access to it.

There are three different ways to allow processes to share kernel objects:

  1. Using object handler inheritance.
  2. Naming objects.
  3. Duplicating Object Handlers.

Using Object Handler Inheritance

Object handle inheritance can be used only when processes have a parent-child relationship. In this scenario, one or more kernel object handles are available to the parent process, and the parent decides to spawn a child process, giving the child access to the parent’s kernel objects. For this type of inheritance to work, the parent process must perform several steps.

First, when the parent process creates a kernel object, the parent must indicate to the system that it wants the object’s handle to be inheritable. Sometimes I hear people use the term object inheritance. However, there is no such thing as object inheritance; Windows supports object handle inheritance. In other words, it is the handles that are inheritable, not the objects themselves.

To create an inheritable handle, the parent process must allocate and initialize a SECURITY_ATTRIBUTES structure and pass the structure’s address to the specific Create function. The following code creates a Mutex object and returns an inheritable handle to it:

SECURITY_ATTRIBUTES sa;

sa.nLength = sizeof(sa);

sa.lpSecurityDescriptor = NULL;

sa.bInheritHandle = TRUE; // Make the returned handle inheritable.

 

HANDLE hMutex = CreateMutex(&sa, FALSE, NULL);

The next step to perform when using object handle inheritance is for the parent process to spawn the child process. This is done using the CreateProcess function:

BOOL CreateProcess(

   PCTSTR pszApplicationName,

   PTSTR pszCommandLine,

   PSECURITY_ATTRIBUTES psaProcess,

   PSECURITY_ATTRIBUTES psaThread,

   BOOL bInheritHandles,

   DWORD dwCreationFlags,

   PVOID pvEnvironment,

   PCTSTR pszCurrentDirectory,

   LPSTARTUPINFO pStartupInfo,

   PPROCESS_INFORMATION pProcessInformation);

Usually, when you spawn a process, you pass FALSE for this parameter. This value tells the system that you do not want the child process to inherit the inheritable handles that are in the parent process’ handle table. If you pass TRUE for this parameter, however, the child inherits the parent’s inheritable handle values.

The content of kernel objects is stored in the kernel address space that is shared by all processes running on the system. For 32-bit systems, this is in memory between the following memory addresses: 0x80000000 and 0xFFFFFFFF. For 64-bit systems, this is in memory between the following memory addresses: 0x00000400’00000000 and 0xFFFFFFF’FFFFFFFF.

Be aware that object handle inheritance applies only at the time the child process is spawned. If the parent process were to create any new kernel objects with inheritable handles, an already-running child process would not inherit these new handles.

Object handle inheritance has one very strange characteristic: when you use it, the child has no idea that it has inherited any handles. Kernel object handle inheritance is useful only when the child process documents the fact that it expects to be given access to a kernel object when spawned from another process. Usually, the parent and child applications are written by the same company; however, a different company can write the child application if that company documents what the child application expects.

By far, the most common way for a child process to determine the handle value of the kernel object that it’s expecting is to have the handle value passed as a command-line argument to the child process. The child process’ initialization code parses the command line (usually by calling _stscanf_s) and extracts the handle value. Once the child has the handle value, it has the same access to the object as its parent. Note that the only reason handle inheritance works is because the handle value of the shared kernel object is identical in both the parent process and the child process. This is why the parent process is able to pass the handle value as a command-line argument.

Another technique is for the parent process to add an environment variable to its environment block. The variable’s name would be something that the child process knows to look for, and the variable’s value would be the handle value of the kernel object to be inherited. Then when the parent spawns the child process, the child process inherits the parent’s environment variables and can easily call GetEnvironmentVariable to obtain the inherited object’s handle value. This approach is excellent if the child process is going to spawn another child process, because the environment variables can be inherited again.

For the sake of completeness, I’ll also mention the GetHandleInformation function:

BOOL GetHandleInformation(HANDLE hObject, PDWORD pdwFlags);

This function returns the current flag settings for the specified handle in the DWORD pointed to by pdwFlags. To see if a handle is inheritable, do the following:

DWORD dwFlags; GetHandleInformation(hObj, &dwFlags); BOOL fHandleIsInheritable = (0 != (dwFlags & HANDLE_FLAG_INHERIT));

Naming Objects

Most of Kernel Object functions have a common last parameter, pszName. When you pass NULL for this parameter, you are indicating to the system that you want to create an unnamed (anonymous) kernel object. When you create an unnamed object, you can share the object across processes by using either inheritance or DuplicateHandle. To share an object by name, you must give the object a name.

An alternative method exists for sharing objects by name. Instead of calling a Create* function, a process can call one of the Open* function. As shown in the function below:

HANDLE OpenMutex( DWORD dwDesiredAccess, BOOL bInheritHandle, PCTSTR pszName);

The last parameter, pszName, indicates the name of a kernel object. You cannot pass NULL for this parameter; you must pass the address of a zero-terminated string. These functions search the single namespace of kernel objects attempting to find a match. If no kernel object with the specified name exists, the functions return NULL and GetLastError returns 2 (ERROR_FILE_NOT_FOUND). However, if a kernel object with the specified name does exist, but it has a different type, the functions return NULL and GetLastError returns 6 (ERROR_INVALID_HANDLE). And if it is the same type of object, the system then checks to see whether the requested access (via the dwDesiredAccess parameter) is allowed. If it is, the calling process’ handle table is updated and the object’s usage count is incremented. The returned handle will be inheritable if you pass TRUE for the bInheritHandle parameter.

The main difference between calling a Create* function versus calling an Open* function is that if the object doesn’t already exist, the Create* function will create it, whereas the Open* function will simply fail.

Named objects are commonly used to prevent multiple instances of an application from running. To do this, simply call a Create* function in your _tmain or _tWinMain function to create a named object. (It doesn’t matter what type of object you create.) When the Create* function returns, call GetLastError. If GetLastError returns ERROR_ALREADY_EXISTS, another instance of your application is running and the new instance can exit. Here’s some code that illustrates this:

int WINAPI _tWinMain(HINSTANCE hInstExe, HINSTANCE, PTSTR pszCmdLine,

   int nCmdShow) {

   HANDLE h = CreateMutex(NULL, FALSE,

      TEXT("{FA531CC1-0497-11d3-A180-00105A276C3E}"));

   if (GetLastError() == ERROR_ALREADY_EXISTS) {

      // There is already an instance of this application running.

      // Close the object and immediately return.

      CloseHandle(h);

      return(0);

   }

 

   // This is the first instance of this application running.

   ...

   // Before exiting, close the object.

   CloseHandle(h);

   return(0);

}

 

A service’s named kernel objects always go in the global namespace. By default, in Terminal Services, an application’s named kernel object goes in the session’s namespace. However, it is possible to force the named object to go into the global namespace by prefixing the name with “Global\”, as in the following example:

HANDLE h = CreateEvent(NULL, FALSE, FALSE, TEXT("Global\\MyName"));

You can also explicitly state that you want a kernel object to go in the current session’s namespace by prefixing the name with “Local\”, as in the following example:

HANDLE h = CreateEvent(NULL, FALSE, FALSE, TEXT("Local\\MyName"));

 

When you create a kernel object, you can protect the access to it by passing a pointer to a SECURITY_ATTRIBUTES structure. However, prior to the release of Windows Vista, it was not possible to protect the name of a shared object against hijacking. Any process, even with the lowest privileges, is able to create an object with a given name. If you take the previous example where an application is using a named mutex to detect whether or not it is already started, you could very easily write another application that creates a kernel object with the same name. If it gets started before the singleton application, this application becomes a “none-gleton” because it will start and then always immediately exit, thinking that another instance of itself is already running. This is the base mechanism behind a couple of attacks known as Denial of Service (DoS) attacks. Notice that unnamed kernel objects are not subject to DoS attacks, and it is quite common for an application to use unnamed objects, even though they can’t be shared between processes.

Duplicating Object Handlers

The last technique for sharing kernel objects across process boundaries requires the use of the DuplicateHandle function:

BOOL DuplicateHandle( HANDLE hSourceProcessHandle, HANDLE hSourceHandle, HANDLE hTargetProcessHandle, PHANDLE phTargetHandle, DWORD dwDesiredAccess, BOOL bInheritHandle, DWORD dwOptions);

Simply stated, this function takes an entry in one process’ handle table and makes a copy of the entry into another process’ handle table. DuplicateHandle takes several parameters but is actually quite straightforward. The most general usage of the DuplicateHandle function could involve three different processes that are running in the system.

Working with Characters and Strings

The problem is that some languages and writing systems (Japanese kanji being a classic example) have so many symbols in their character sets that a single byte, which offers no more than 256 different symbols at best, is just not enough. So double-byte character sets (DBCSs) were created to support these languages and writing systems. In a double-byte character set, each character in a string consists of either 1 or 2 bytes. With kanji, for example, if the first character is between 0x81 and 0x9F or between 0xE0 and 0xFC, you must look at the next byte to determine the full character in the string. Working with double-byte character sets is a programmer’s nightmare because some characters are 1 byte wide and some are 2 bytes wide. Fortunately, you can forget about DBCS and take advantage of the support of Unicode strings supported by Windows functions and the C run-time library functions.

Unicode Encoders:

  • UTF-16 encodes each character as 2 bytes (or 16 bits). Most popular and used.
  • UTF-8 encodes some characters as 1 byte, some characters as 2 bytes, some characters as 3 bytes, and some characters as 4 bytes. Characters with a value below 0x0080 are compressed to 1 byte, which works very well for characters used in the United States. Characters between 0x0080 and 0x07FF are converted to 2 bytes, which works well for European and Middle Eastern languages. Characters of 0x0800 and above are converted to 3 bytes, which works well for East Asian languages. Finally, surrogate pairs are written out as 4 bytes. UTF-8 is an extremely popular encoding format, but it’s less efficient than UTF-16 if you encode many characters with values of 0x0800 or above.
  • UTF-32 encodes every character as 4 bytes. This encoding is useful when you want to write a simple algorithm to traverse characters (used in any language) and you don’t want to have to deal with characters taking a variable number of bytes. For example, with UTF-32, you do not need to think about surrogates because every character is 4 bytes. Obviously, UTF-32 is not an efficient encoding format in terms of memory usage. Therefore, it’s rarely used for saving or transmitting strings to a file or network. This encoding format is typically used inside the program itself.

Microsoft’s C/C++ compiler defines a built-in data type, wchar_t, which represents a 16-bit Unicode (UTF-16) character.

This is how to define string of wchar_t:

wchar_t szBuffer[100] = L”A String”;

An uppercase L before a literal string informs the compiler that the string should be compiled as a Unicode string. When the compiler places the string in the program’s data section, it encodes each character using UTF16, interspersing zero bytes between every ASCII character in this simple case.

Header annotation gives the compiler the ability to analyze your code to see if it’s used properly. You can read about header annotation from this link.

 

Under Windows Vista, Microsoft’s source code for CreateWindowExA is simply a translation layer that allocates memory to convert ANSI strings to Unicode strings; the code then calls CreateWindowExW, passing the converted strings. When CreateWindowExW returns, CreateWindowExA frees its memory buffers and returns the window handle to you. So, for functions that fill buffers with strings, the system must convert from Unicode to non-Unicode equivalents before your application can process the string. Because the system must perform all these conversions, your application requires more memory and runs slower. You can make your application perform more efficiently by developing your application using Unicode from the start. Also, Windows has been known to have some bugs in these translation functions, so avoiding them also eliminates some potential bugs.

Certain functions in the Windows API, such as WinExec and OpenFile, exist solely for backward compatibility with 16-bit Windows programs that supported only ANSI strings. These methods should be avoided by today’s programs. You should replace any calls to WinExec and OpenFile with calls to the CreateProcess and CreateFile functions. Internally, the old functions call the new functions anyway. The big problem with the old functions is that they don’t accept Unicode strings and they typically offer fewer features. When you call these functions, you must pass ANSI strings. On Windows Vista, most non-obsolete functions have both Unicode and ANSI versions. However, Microsoft has started to get into the habit of producing some functions offering only Unicode versions—for example, ReadDirectoryChangesW and CreateProcessWithLogonW.

It is possible to write your source code so that it can be compiled using ANSI or Unicode characters and strings. In the WinNT.h header file, the following types and macros are defined:

#ifdef UNICODE

typedef WCHAR TCHAR, *PTCHAR, PTSTR;

typedef CONST WCHAR *PCTSTR;

#define __TEXT(quote) quote // r_winnt

#define __TEXT(quote) L##quote

#else typedef CHAR TCHAR, *PTCHAR, PTSTR;

typedef CONST CHAR *PCTSTR;

#define __TEXT(quote) quote

#endif #define TEXT(quote) __TEXT(quote)

Certain functions in the Windows API, such as WinExec and OpenFile, exist solely for backward compatibility with 16-bit Windows programs that supported only ANSI strings. These methods should be avoided by today’s programs. You should replace any calls to WinExec and OpenFile with calls to the CreateProcess and CreateFile functions.

When Microsoft was porting COM from 16-bit Windows to Win32, an executive decision was made that all COM interface methods requiring a string would accept only Unicode strings. This was a great decision because COM is typically used to allow different components to talk to each other and Unicode is the richest way to pass strings around. Using Unicode throughout your application makes interacting with COM easier too.

Finally, when the resource compiler compiles all your resources, the output file is a binary representation of the resources. String values in your resources (string tables, dialog box templates, menus, and so on) are always written as Unicode strings. Under Windows Vista, the system performs internal conversions if your application doesn’t define the UNICODE macro. For example, if UNICODE is not defined when you compile your source module, a call to LoadString will actually call the LoadStringA function. LoadStringA will then read the Unicode string from your resources and convert the string to ANSI. The ANSI representation of the string will be returned from the function to your application.

ny function that modifies a string exposes a potential danger: if the destination string buffer is not large enough to contain the resulting string, memory corruption occurs. Here is an example:

// The following puts 4 characters in a // 3-character buffer, resulting in memory corruption

WCHAR szBuffer[3] = L””;

wcscpy(szBuffer, L”abc”); // The terminating 0 is a character too!

The problem with the strcpy and wcscpy functions (and most other string manipulation functions) is that they do not accept an argument specifying the maximum size of the buffer, and therefore, the function doesn’t know that it is corrupting memory.

To secure your code use _s suffix string (found in StrSafe.h) functions rather than useual string functions.

The C run time actually allows you to provide a function of your own, which it will call when it detects an invalid parameter. Then, in this function, you can log the failure, attach a debugger, or do whatever you like. To enable this, you must first define a function that matches the following prototype:

void InvalidParameterHandler(PCTSTR expression, PCTSTR function, PCTSTR file, unsigned int line, uintptr_t /*pReserved*/);

To get this done please read this article and examine its sample code.

Why you should use Unicode:

  1. Unicode makes it easy for you to localize your application to world markets.
  2. Unicode allows you to distribute a single binary (.exe or DLL) file that supports all languages.
  3. Unicode improves the efficiency of your application because the code performs faster and uses less memory. Windows internally does everything with Unicode characters and strings, so when you pass an ANSI character or string, Windows must allocate memory and convert the ANSI character or string to its Unicode equivalent.
  4. Using Unicode ensures that your application can easily call all nondeprecated Windows functions, as some Windows functions offer versions that operate only on Unicode characters and strings.
  5. Using Unicode ensures that your code easily integrates with COM (which requires the use of Unicode characters and strings).
  6. Using Unicode ensures that your code easily integrates with the .NET Framework (which also requires the use of Unicode characters and strings).
  7. Using Unicode ensures that your code easily manipulates your own resources (where strings are always persisted as Unicode).

Tips to keep in mind while coding:

String mainpulation tips:

You use the Windows function MultiByteToWideChar to convert multibyte-character strings to wide-character strings.

For many applications that open text files and process them, such as compilers, it would be convenient if, after opening a file, the application could determine whether the text file contained ANSI characters or Unicode characters. The IsTextUnicode function exported by AdvApi32.dll and declared in WinBase.h can help make this distinction:

BOOL IsTextUnicode(CONST PVOID pvBuffer, int cb, PINT pResult);

Introduction and Error Handling

This book focuses on 64-Bit architecture. Here is a quick look at what you need to know about 64-bit Windows:

  • The 64-bit Windows kernel is a port of the 32-bit Windows kernel. This means that all the details and intricacies that you’ve learned about 32-bit Windows still apply in the 64-bit world. In fact, Microsoft has modified the 32-bit Windows source code so that it can be compiled to produce a 32-bit or a 64-bit system. They have just one source-code base, so new features and bug fixes are simultaneously applied to both systems.
  • Because the kernels use the same code and underlying concepts, the Windows API is identical on both platforms. This means that you do not have to redesign or reimplement your application to work on 64-bit Windows. You can simply make slight modifications to your source code and then rebuild.
  • For backward compatibility, 64-bit Windows can execute 32-bit applications. However, your application’s performance will improve if the application is built as a true 64-bit application.
  • Because it is so easy to port 32-bit code, there are already device drivers, tools, and applications available for 64-bit Windows. Unfortunately, Visual Studio is a native 32-bit application and Microsoft seems to be in no hurry to port it to be a native 64-bit application. However, the good news is that 32-bit Visual Studio does run quite well on 64-bit Windows; it just has a limited address space for its own data structures. And Visual Studio does allow you to debug a 64-bit application.

Always distinguish between error MessageId and ErrorCode.To get the error out, you must check on the ErrorCode instead of ErrorMessage.

While debugging, it’s extremely useful to monitor the thread’s last error code. In Microsoft Visual Studio, Microsoft’s debugger supports a useful feature—you can configure the Watch window to always show you the thread’s last error code number and the text description of the error. This is done by selecting a row in the Watch window and typing $err,hr.

This will support you with the error code. Visual Studio also ships with a small utility called Error Lookup. You can use Error Lookup to convert an error code number into its textual description.

If you detect an error in an application you’ve written, you might want to show the text description to the user. Windows offers a function that converts an error code into its text description. This function is called FormatMessage.

To indicate failure, simply set the thread’s last error code and then have your function return FALSE, INVALID_HANDLE_VALUE, NULL, or whatever is appropriate. To set the thread’s last error code, you simply call

VOID SetLastError(DWORD dwErrCode);

Passing into the function whatever 32-bit number you think is appropriate. I try to use codes that already exist in WinError.h—as long as the code maps well to the error I’m trying to report. If you don’t think that any of the codes in WinError.h accurately reflect the error, you can create your own code. The error code is a 32-bit number that is divided into the fields shown in table below:

Reference Books:

Temporaries

The likelihood of writing efficient code is very small unless you understand the origins of temporary objects, their cost, and how to eliminate them when you can.

Only the first form of initialization is guaranteed, across compiler implementations, not to generate a temporary object. If you use forms 2 or 3, you may end up with a temporary, depending on the compiler implementation.

For example take the form:

This lead the compiler to generate the following code:

The overall cost here is two constructors and one destructor!

If you have function with the following definition:

An invocation of g(“message”) will trigger the creation of a temporary string object unless you overload g() to accept a char * as an argument:

In the following code fragment the operator+() expects two Complex objects as arguments. A temporary Complex object gets generated to represent the constant 1.0:

The problem is that this temporary is generated over and over every iteration through the loop. Lifting constant expressions out of a loop is a trivial and well-known optimization. The temporary generation in a = b + 1.0; is a computation whose value is constant from one iteration to the next. In that case, why should we do it over and over? Let’s do it once and for all:

Passing by value constructs temporary objects so passing them by reference optimizes the performance. Passing by value makes the following overheads:

  1. Call for the class constructor and calling for all class data members constructors.
  2. Calling the copy constructor of the class for the created temp variable.
  3. After the return, calling of the class destructor and destructors for all class data members.

The above chunk of code results in 6 function calls and creation of temporary variables. The below code acts as a solution:

As pointed out

But on a performance-critical path you need to forgo elegance in favor of raw performance. The second, “ugly” form is much more efficient. It creates zero temporaries.

Key Points:

  • A temporary object could penalize performance twice in the form of constructor and destructor computations.
  • Declaring a constructor explicit will prevent the compiler from using it for type conversion behind your back.
  • A temporary object is often created by the compiler to fix a type mismatch. You can avoid it by function overloading.
  • Avoid object copy if you can. Pass and return objects by reference.

You can eliminate temporaries by using <op>= operators where <op> may be +, -, *, or /.

The Return Value Optimization

When returning a value, the compiler probably creates a temp variable and then set the value original object (i.e. if c = a + b, then c is original return variable).

If we developed 2 version of as the following:

And

The second version, without RVO, executed in 1.89 seconds. The first version, with RVO applied was much faster—1.30 seconds.

We speculated that the difference may lie in the fact that Version 1 used a named variable (retVal) as a return value whereas Version 2 used an unnamed variable. Version 2 used a constructor call in the return statement but never named it. It may be the case that this particular compiler implementation chose to avoid optimizing away named variables.

In addition, you must also define a copy constructor to “turn on” the Return Value Optimization. If the class involved does not have a copy constructor defined, the RVO is quietly turned off.

If the compiler can’t do RVO you can make iy through computational constructor:

You can now (after declaring the preceding computational constructor) use the following operator overloading guaranteeing RVO:


If you wanted to apply the same idea to the other arithmetic operators, you would have to add a third argument to distinguish the signatures of the computational constructors for addition, subtraction, multiplication, and division. This is the criticism against the computational constructor: It bends over backwards for the sake of efficiency and introduces “unnatural” constructors.

Key Points:

  • If you must return an object by value, the Return Value Optimization will help performance by eliminating the need for creation and destruction of a local object.
  • The application of the RVO is up to the discretion of the compiler implementation. You need to consult your compiler documentation or experiment to find if and when RVO is applied.

You will have a better shot at RVO by deploying the computational constructor.

Virtual Functions

Virtual functions seem to inflict a performance cost in several ways:

  • The vptr (virtual table pointer) must be initialized in the constructor.
  • The vptr must be initialized in the constructor virtual function is invoked via pointer indirection. We must fetch the pointer to the function table and then access the correct function offset.
  • Inlining is a compile-time decision. The compiler cannot inline virtual functions whose resolution takes place at run-time.

The inability to inline a virtual function is its biggest performance penalty.

Key Points:

  • The cost of a virtual function stems from the inability to inline calls that are dynamically bound at run-time. The only potential efficiency issue is the speed gained from inlining if there is any. Inlining efficiency is not an issue in the case of functions whose cost is not dominated by call and return overhead.

Templates are more performance-friendly than inheritance hierarchies. They push type resolution to compile-time, which we consider to be free.

Constructors and Destructors

Always take care of the overhead resulted from invoking constructors and destructors of an object.

If the construct (or destructor) is called frequently, it’s recommended to inline it.

This is not to say that inheritance is fundamentally a performance obstacle. We must make a distinction between the overall computational cost, required cost, and computational penalty. The overall computational cost is the set of all instructions executed in a computation. The required cost is that subset of instructions whose results are necessary. This part of the computation is mandatory; computational penalty is the rest. This is the part of the computation that could have been eliminated by an alternative design or implementation.

Initializing data members as pointer to objects let you initialize them whenever you want. Further, it lets you partially instantiate them. Further, allocating objects at run time consumes performance where the standalone object initialization allocates its memory (on the stack) at compilation time. It’s a trade-off, you must pick what is more suitable for you.

The habit of automatically defining all objects up front could be wasteful—you may construct objects that you end up not using. So, initialize code where you’ll sure need it. As an example observes packet variable in the 2 code fragments below:

After optimization:

This is called, lazy construction.

When initializing a data member in a class, use the initialization list rather than assigning a value to the data member in the constructor body. In strings, this reclaimed about 50 ms.

Key Points:

  • Constructors and destructors may be as efficient as hand-crafted C code. In practice, however, they often contain overhead in the form of superfluous computations.
  • The construction (destruction) of an object triggers recursive construction (destruction) of parent and member objects. Watch out for the combinatorial explosion of objects in complex hierarchies. They make construction and destruction more expensive.
  • Make sure that your code actually uses all the objects that it creates and the computations that they perform.
  • Don’t create an object unless you are going to use it.

Compilers must initialize contained member objects prior to entering the constructor body. You ought to use the initialization phase to complete the member object creation. This will save the overhead of calling the assignment operator later in the constructor body. In some cases, it will also avoid the generation of temporary objects.

The Tracing War Story

Programmers may have different views on C++ performance depending on their respective experiences. But there are a few basic principles that we all agree on:

  • I/O is expensive.
  • Function call overhead is a factor so we should inline short, frequently called functions.
  • Copying objects is expensive. Prefer pass-by-reference over pass-by-value.

Let us see the following code sample:

As you can tell, addOne() doesn’t do much, which is exactly the point of a baseline. We are trying to isolate the performance factors one at a time. Our main() function invoked addOne() a million times and measured execution time:

Next, we added a Trace object to addOne and measured again to evaluate the performance delta. This is Version 1

The cost of the for loop has skyrocketed from 55 ms to 3,500 ms. In other words, the speed of addOne has plummeted by a factor of more than 60! See graph below:

This is duo the following overhead operations made in version 1:

  1. Create the string name local to addOne (at the start).
  2. Invoke the Trace constructor (at the start).
  3. The Trace constructor invokes the string constructor to create the member string (at the start).
  4. Destroy the string name (at the end).
  5. Invoke the Trace destructor (at the end).
  6. The Trace destructor invokes the string destructor for the member string (at the end).

Recovery Plan:

The performance recovery plan was to eliminate objects and computations whose values get dropped when tracing is off. We started with the string argument created by addOne and given to the Trace constructor. We modified the function name argument from a string object to a plain char pointer

Now the execution dropped from 3,500 ms to 2,500 ms. See figure below:

Key Points:

  • Object definitions trigger silent execution in the form of object constructors and destructors. We call it “silent execution” as opposed to “silent overhead” because object construction and destruction are not usually overhead.
  • Just because we pass an object by reference does not guarantee good performance. Avoiding object copy helps, but it would be helpful if we didn’t have to construct and destroy the object in the first place.
  • Don’t waste effort on computations whose results are not likely to be used.
  • Don’t aim for the world record in design flexibility. All you need is a design that’s sufficiently flexible for the problem domain. A char pointer can sometimes do the simple jobs just as well, and more efficiently, than a string.

Inline. Eliminate the function call overhead that comes with small, frequently invoked function calls.

Introduction to Efficient C++ Performance Programming Techniques

In the days of assembler language programming, experienced programmers estimated the execution speed of their source code by counting the number of assembly language instructions. On some architectures, such as RISC, most assembler instructions executed in one clock cycle each. Other architectures featured wide variations in instruction to instruction execution speed, but experienced programmers were able to develop a good feel for average instruction latency. If you knew how many instructions your code fragment contained, you could estimate with accuracy the number of clock cycles their execution would consume. The mapping from source code to assembler was trivially one-to-one. The assembler code was the source code.

Some C++ statements generate three assembly instructions others, generate 300 assembly instructions! These are considerations that C and assembler language programmers have never had to worry about.

At the highest level, software efficiency is determined by the efficiency of two main ingredients:

  1. Design efficiency.
    This involves the program’s high-level design. To fix performance problems at that level you must understand the program’s big picture. To a large extent, this item is language independent. No amount of coding efficiency can provide shelter for a bad design.
  2. Coding efficiency.
    Small- to medium-scale implementation issues fall into this category. Fixing performance in this category generally involves local modifications. For example, you do not need to look very far into a code fragment in order to lift a constant expression out of a loop and prevent redundant computations. The code fragment you need to understand is limited in scope to the loop body.

Design efficiency is broken down further into two items:

  1. Algorithms and data structures.
    Technically speaking, every program is an algorithm in itself. Referring to “algorithms and data structures” actually refers to the well-known subset of algorithms for accessing, searching, sorting, compressing, and otherwise manipulating large collections of data.
    The efficiency of algorithms and data structures is necessary but not sufficient: By itself, it does not guarantee good overall program efficiency
  2. Program decomposition.
    This involves decomposition of the overall task into communicating subtasks, object hierarchies, functions, data, and function flow. It is the program’s high-level design and includes component design as well as intercomponent communication. Few programs consist of a single component. A typical Web application interacts (via API) with a Web server, TCP sockets, and a database, at the very least. There are efficiency tricks and pitfalls with respect to crossing the API layer with each of those components.

We split coding efficiency into four items:

  1. Language constructs.
    C++ adds power and flexibility to its C ancestor. These added benefits do not come for free—some C++ language constructs may produce overhead in exchange.
  2. System architecture
    System designers invest considerable effort to present the programmer with an idealistic view of the system: infinite memory, dedicated CPU, parallel thread execution, and uniform-cost memory access. Of course, none of these is true—it just feels that way. Developing software free of system architecture considerations is also convenient. To achieve high performance, however, these architectural issues cannot be ignored since they can impact performance drastically. When it comes to performance we must bear in mind that:
    * Memory is not infinite. It is the virtual memory system that makes it appear that way.
    * The cost of memory access is nonuniform. There are orders of magnitude difference among cache, main memory, and disk access.
    * Our program does not have a dedicated CPU. We get a time slice only once in a while.
    * On a uniprocessor machine, parallel threads do not truly execute in parallel—they take turns
  3. Libraries.
    The choice of libraries used by an implementation can also affect performance. For starters, some libraries may perform a task faster than others. Because you typically don’t have access to the library’s source code, it is hard to tell how library calls implement their services. For example, to convert an integer to a character string, you can choose between
    sprintf(string, “%d”, i);
    or an integer-to-ASCII function call
    itoa(i, string);
    Which one is more efficient? Is the difference significant?
  4. Compiler optimizations.
    Simply a more descriptive name than “miscellaneous,” this category includes all those small coding tricks that don’t fit in the other coding categories, such as loop unrolling, lifting constant expressions out of loops, and similar techniques for elimination of computational redundancies. Most compilers will perform many of those optimizations for you. But you cannot count on any specific compiler to perform a specific optimization. One compiler may unroll a loop twice, another will unroll it four times, and yet another compiler will not unroll it at all. For ultimate control, you have to take coding matters into your own hands.

Software Crisis is our current inability to develop code that is simple enough to be understood, maintained, and extended by a mere mortal, yet powerful enough to provide solutions to complex problems.

To make a long story short, software performance is important and always will be. This one is not going away. As processor and communication technology march on, they redefine what “fast” means. They give rise to a new breed of bandwidth- and cycle-hungry applications that push the boundaries of technology. You never have enough horsepower. Software efficiency now becomes even more crucial than before. Whether the growth of processor speed is coming to an end or not, it will definitely trail communication speed. This puts the efficiency burden on the software. Further advances in execution speed will depend heavily on the efficiency of the software, not just the processor.

“Performance” can stand for various metrics, the most common ones being space efficiency and time efficiency. Space efficiency seeks to minimize the use of memory in a software solution. Likewise, time efficiency seeks to minimize the use of processor cycles. Time efficiency is often represented in terms of response time and throughput. Other metrics include compile time and executable size.

In discussing time efficiency, we will often mention the terms “pathlength” and “instruction count” interchangeably. Both stand for the number of assembler language instructions generated by a fragment of code. In a RISC architecture, if a code fragment exhibits a reasonable “locality of reference” (i.e., cache hits), the ratio between instruction counts and clock cycles will approximate one. On CISC architectures it may average two or more, but in any event, poor instruction counts always indicate poor execution time, regardless of processor architecture. A good instruction count is necessary but not sufficient for high performance. Consequently, it is a crude performance indicator, but still useful. It will be used in conjunction with time measurements to evaluate efficiency.

Reference Book: