The Importance of Data Alignment

Data alignment is not so much a part of the operating system’s memory architecture as it is a part of the CPU’s architecture.

CPUs operate most efficiently when they access properly aligned data. Data is aligned when the memory address of the data modulo of the data’s size is 0. For example, a WORD value should always start on an address that is evenly divided by 2, a DWORD value should always start on an address that is evenly divided by 4, and so on. When the CPU attempts to read a data value that is not properly aligned, the CPU will do one of two things. It will either raise an exception or the CPU will perform multiple, aligned memory accesses to read the full misaligned data value.

Here is some code that accesses misaligned data:

VOID SomeFunc(PVOID pvDataBuffer) {

   // The first byte in the buffer is some byte of information
   char c = * (PBYTE) pvDataBuffer;

   // Increment past the first byte in the buffer
   pvDataBuffer = (PVOID)((PBYTE) pvDataBuffer + 1);

   // Bytes 2-5 contain a double-word value
   DWORD dw = * (DWORD *) pvDataBuffer;

   // The line above raises a data misalignment exception on some CPUs
...

Obviously, if the CPU performs multiple memory accesses, the performance of your application is hampered. At best, it will take the system twice as long to access a misaligned value as it will to access an aligned value—but the access time could be even worse! To get the best performance for your application, you’ll want to write your code so that the data is properly aligned.

Let’s take a closer look at how the x86 CPU handles data alignment. The x86 CPU contains a special bit flag in its EFLAGS register called the AC (alignment check) flag. By default, this flag is set to zero when the CPU first receives power. When this flag is zero, the CPU automatically does whatever it has to in order to successfully access misaligned data values. However, if this flag is set to 1, the CPU issues an INT 17H interrupt whenever there is an attempt to access misaligned data. The x86 version of Windows never alters this CPU flag bit. Therefore, you will never see a data misalignment exception occur in an application when it is running on an x86 processor. The same behavior happens when running on an AMD x86-64 CPU, where, by default, the hardware takes care of misalignment fault fixup.

Now let’s turn our attention to the IA-64 CPU. The IA-64 CPU cannot automatically fix up misaligned data accesses. Instead, when a misaligned data access occurs, the CPU notifies the operating system. Windows now decides if it should raise a data misalignment exception—or it can execute additional instructions that silently correct the problem and allow your code to continue executing. By default, when you install Windows on an IA-64 machine, the operating system automatically transforms a misalignment fault into an EXCEPTION_DATATYPE_MISALIGNMENT exception. However, you can alter this behavior. You can tell the system to silently correct misaligned data accesses for all threads in your process by having one of your process’ threads call the SetErrorMode function:

UINT SetErrorMode(UINT fuErrorMode);

For our discussion, the flag in question is the SEM_NOALIGNMENTFAULTEXCEPT flag. When this flag is set, the system automatically corrects for misaligned data accesses. When this flag is reset, the system does not correct for misaligned data accesses but instead raises data misalignment exceptions. Once you change this flag, you can’t update it again during the process’ lifetime.

Note that changing this flag affects all threads contained within the process that owns the thread that makes the call. In other words, changing this flag will not affect any threads contained in any other processes. You should also note that a process’ error mode flags are inherited by any child processes. Therefore, you might want to temporarily reset this flag before calling the CreateProcess function (although you usually don’t do this for the SEM_NOALIGNMENTFAULTEXCEPT flag because it can’t be reset once set).

Of course, you can call SetErrorMode, passing the SEM_NOALIGNMENTFAULTEXCEPT flag, regardless of which CPU platform you are running on. However, the results are not always the same. For x86 and x64 systems, this flag is always on and cannot be turned off. You can use the Windows Reliability and Performance Monitor to see how many alignment fixups per second the system is performing. The following figure shows what the Add Counters dialog box looks like just before you add this counter to the chart:

image

What this counter really shows is the number of times per second the CPU notifies the operating system of misaligned data accesses. If you monitor this counter on an x86 machine, you’ll see that it always reports zero fixups per second. This is because the x86 CPU itself is performing the fixups and doesn’t notify the operating system. Because the x86 CPU performs the fixup instead of the operating system, accessing misaligned data on an x86 machine is not nearly as bad a performance hit as that of CPUs that require software (the Windows operating system code) to do the fixup. As you can see, simply calling SetErrorMode is enough to make your application work correctly. But this solution is definitely not the most efficient.

Microsoft’s C/C++ compiler for the IA-64 supports a special keyword called __unaligned. You use the __unaligned modifier just as you would use the const or volatile modifiers, except that the __unaligned modifier is meaningful only when applied to pointer variables. When you access data via an unaligned pointer, the compiler generates code that assumes that the data is not aligned properly and adds the additional CPU instructions necessary to access the data. The code shown here is a modified version of the code shown earlier. This new version takes advantage of the __unaligned keyword:

VOID SomeFunc(PVOID pvDataBuffer) {

   // The first byte in the buffer is some byte of information
   char c = * (PBYTE) pvDataBuffer;

   // Increment past the first byte in the buffer
   pvDataBuffer = (PVOID)((PBYTE) pvDataBuffer + 1);

   // Bytes 2-5 contain a double-word value
   DWORD dw = * (__unaligned DWORD *) pvDataBuffer;

   // The line above causes the compiler to generate additional
   // instructions so that several aligned data accesses are performed
   // to read the DWORD.
   // Note that a data misalignment exception is not raised.
...

The instructions added by the compiler are still much more efficient than letting the CPU trap the misaligned data access and having the operating system correct the problem. In fact, if you monitor the Alignment Fixups/sec counter, you’ll see that accesses via unaligned pointers have no effect on the chart. Notice that the compiler will generate the additional instructions even in the case where the structure is aligned and, so, make the code less efficient in that case.

Finally, the __unaligned keyword is not supported by the x86 version of the Microsoft Visual C/C++ compiler. I assume that Microsoft felt that this wasn’t necessary because of the speed at which the CPU itself can perform the fixups. However, this also means that the x86 compiler will generate errors when it encounters the __unaligned keyword. So if you are trying to create a single source code base for your application, you’ll want to use the UNALIGNED and UNALIGNED64 macros instead of the __unaligned keyword. The UNALIGNED* macros are defined in WinNT.h as follows:

#if defined(_M_MRX000) || defined(_M_ALPHA) || defined(_M_PPC) ||
     defined(_M_IA64) || defined(_M_AMD64)
    #define ALIGNMENT_MACHINE
    #define UNALIGNED __unaligned
    #if defined(_WIN64)
       #define UNALIGNED64 __unaligned
    #else
       #define UNALIGNED64
    #endif


   #else
       #undef ALIGNMENT_MACHINE
       #define UNALIGNED
       #define UNALIGNED64
   #endif

What’s Performance and Footprint Optimization?

 

Performance is an expression of the amount of work that is done during a certain period of time. The more work a program does per unit of time, the better its performance. Put differently, the performance of a program is measured by the number of input (data) units it manages to transform into output (data) units in a given time. This translates directly into the number of algorithmic steps that need to be taken to complete this transformation. For example, an algorithm that executes 10 program statements to store a name in a database performs poorly compared to one that stores the same name in five statements.

Performance can be affected by the following:

  • Performance of Physical Devices (i.e. Printer)
  • Performance of System Resources (i.e. RAM, ROM, EPROM)
  • Performance of Subsystems (i.e. using third party software)
  • Performance of Communication.
    • Call back functions are one of the common tools used in communication optimization.
  • Application Look and Fell (i.e. GUI)
    • Common GUI Problems:
      • Unexplained waiting time (i.e. adding progress bar)
      • Illogical setup of user interface
      • Problematic Interface Access (i.e. delay of rendering the instant controls)
      • Not Sufficiently Aiding the Learning Curve (i.e. adding help pop-ups)

 

When Do Performance Problems Arise

Often a program is closely tailored to fit the current intended use, causing it to run into performance problems almost as soon as the slightest alteration is made to the way it is used—more often than not this is because developers work under strict time constraints

  • Extending program functionality
  • Code reuse
  • Test cases and target systems
  • Side effects of long-term use
    • Disk fragmentation
    • Spawned processes that never terminate
    • Memory leaks
    • Memory fragmentation
    • Files that are opened but never closed
    • Interrupts that are never cleared
    • Log files that grow too large
    • Semaphores that are claimed but never freed (locking problems)
    • Queues and arrays that exceed their maximum size
    • Buffers that wrap around when full
    • Counters that wrap to negative numbers
    • Tasks that are not handled often enough because their priority is set too low

Put in mind the tradeoff between software flexibility and performance.

Software Footprint is used as measurement term of the software. Several aspects are considered as:

  • Storage Requirement: when the program is inactive and stored on hard disk.
  • Runtime Memory Requirement: This is the amount of memory needed while the program is being executed.
    • Compression
    • Data Structures
    • Overlay Techniques
    • Working Memory
    • Cache
    • Memory Fragmentation

The main considerations for keeping footprint sizes in check are the impact on available resources and the impact on performance.

Also, a program’s usability is affected by the size of the runtime footprint. If a program uses a lot of internal memory, it might force the operating system to start swapping memory to the hard disk and back. Remember also that programs virtually never have the sole use of the system. When there is little free internal memory left, an increasingly large part of the memory that is temporarily not needed is swapped out onto the hard disk. The chance that a certain operation will require data that is not found in memory increases. The result is a hard disk that makes a lot of noise and programs that halt and stutter, causing overall slowdown and annoyance.

The ideal program, as seen by the user, has the following characteristics:

  • It needs little user interaction.
  • It has an intuitive user interface.
  • It has a short learning curve.
  • It is highly flexible.
  • It contains, and is accompanied by, extensive but easily readable user documentation.
  • It has no waiting times during user interaction; any slow actions be performed off line and so on.
  • It has readily available information for users at any given time

 

The ideal program, as seen by the developer, contains the following attributes:

  • It is geared toward future developments—added functionality, handling larger bases of data.
  • It is easily maintainable.
  • It has a good and intuitive design.
  • It is accompanied by well-written technical documentation.
  • It can be passed on to any developer

Coding Optimizations

Avoid redundant computations:

for (i = 0; i < 100; i++) {

    a[i] = m*n;

}

Should be

int
k = m*n;

for (i = 0; i < 100; i++) {

    a[i] = k;

}

Caching

Caching is often associated with data and instruction caching on modern processors. But, caching opportunities appear in much wider contexts, including application-level coding and design. Caching is about remembering the results of frequent and costly computations, so that you will not have to perform those computations over and over again.

This piece of code doesn’t cach

for(… ;!done;… ) {

    done = patternMatch(pat1, pat2, isCaseSensitive());

}

This is caching

int
isSensitive = isCaseSensitive();

for(… ;!done;… ) {

    done = patternMatch(pat1, pat2, isSensitive);

}

Take the class X for example:

class
X {

public:

    X() : a(1), c(2) {}

    …

private:

    int
a;

    char
b[4096]; // buffer

    int
c;

};

The X::X() constructor initializes members a and c. A standard-conforming compiler will lay out object X in the declaration order: members a and c will be separated by 4096 bytes of member b, and will not reside on the same cache line. The X::X() constructor is likely to suffer a cache miss when it accesses c.

Since a and c are accessed by neighboring instructions, we could be more cache-friendly by placing them next to one another:

class
X {

    …

private:

    int
a;

    int
c;

    char
b[4096]; // buffer

};

a and c are more likely now to reside on the same cache line. Since a is accessed prior to c, c is almost guaranteed to reside in the data cache when we need it

Precompute

Precomputing is a close relative of caching. When you cache a result, you pay the price of computing it once in the performance-critical path. When you precompute, you don’t even pay the first time. You perform the precomputation outside the performance-critical path (say, initialization-time) and never pay the price of performing a costly computation inside the performance-critical path.

In a web server application, the server can compare inputs with specified form. So, “Accept” is different from “AcCepT”. A computation is made as follows to resolve this problem:

for (char *p = header; *p ; p++) { // Uppercase the header

    *p = toupper(*p);

}

if (0 == memcmp(header, “ACCEPT:”, 7)) {

    …

}

This is costly! Below is more smart solution:

void
initLookupTable()

{

    for (int
i = 0; i < 256; i++) {

        uppercaseTable[i] = toupper(i);

    }

}

for (char *p = header; *p ; p++) {

    *p = uppercaseTable[*p];

}

if (0 == memcmp(header, “ACCEPT:”, 7)) {

    …

}

Just call initLookupTable at the start of processing and then use it directly through processing.

The 80-20 rule has many applications: 80% of the execution scenarios will traverse only 20% of your source code, and 80% of the elapsed time will be spent in 20% of the functions encountered on the execution path. The 80-20 rule is the dominating force driving the argument that premature tuning is a sin. If you randomly tune everything you can think of, not only do you waste 80% of the effort, you will also hack the design beyond repair.

An efficient use of programmer resources is to tune those 20% of the request types that appear 80% of the time.

Remember to use lazy evaluation whenever you can.

Try to use reference local variables over pointer local variables.

Some of the common compiler optimizations are load variables in registers and inlining. Also remember that compiling the program in Release mode is “mostly” more efficient than Debug mode.

The speed differential among cache, RAM, and disk access is significant. Write cache-friendly code.

Reference Counting

One of the major difficulties with chaotic software is memory corruption. Allocated memory flows through the system by way of pointer passing. Pointers are being passed among modules and threads. In a chaotic software system this will result in two major difficulties.

  • Memory leaks. This happens when memory is never freed and will, over time, bring the application down when its consumption of memory gets out of hand.
  • Premature deletion. When ownership of a memory pointer is not clear, it may result in memory being accessed after it was already deleted, resulting in immediate catastrophic failure.

Fortunately, C++ offers a solution to both problems. C++ allows you to control all points of object creation, destruction, copy, and assignment. You can leverage that control to develop a form of garbage collection called reference counting. The basic idea is to transfer the object destruction responsibility from the client code to the object itself. The object keeps track of the current number of references to it and destroys itself when the reference count reaches zero. In other words, the object destroys itself when nobody is using it any longer. With the vast majority of software defects being traced back to memory corruption, reference counting is a very important technique in the C++ arsenal.

The relationship between reference counting and execution speed is context-sensitive. It depends on a few factors:

  • What is the magnitude of resource consumption by the target object? If the target object uses gigantic amounts of memory, for example, a failure to conserve memory will push the limits on available memory and lead to dramatic performance loss in the form of cache misses and page faults.
  • How expensive is it to allocate (and deallocate) the resource used by the target object? This is different from the previous factor. An object such as BigInt will typically consume a small amount of storage and is unlikely to deplete available memory. However, allocating even one byte from the heap will cost you hundreds of instructions. The same goes for deallocating a single byte.
  • How many objects are likely to share a single instance of the target object? Sharing is increased by the use of the assignment operator and copy constructor.
  • How often do we create (destroy) first (last) references to the target object? Creating a brand-new reference-counted object using a constructor other than the copy constructor will create a first reference to the target object. This is expensive. It has more overhead in comparison to creating the target object. The same argument applies to the removal of a last reference.

So in summary these are reasons why to use reference counting pointers:

  • The target object is a large resource consumer.
  • The resource in question is expensive to allocate and free.
  • A high degree of sharing; the reference count is likely to be high due to the use of the assignment operator and copy constructor.
  • The creation or destruction of a reference is relatively cheap.

If you reverse these items, you start leaning towards skipping reference counting in favor of the plain uncounted object

STL

The Standard Template Library (STL) is a powerful combination of containers and generic algorithms. From a performance perspective, a few questions immediately come to mind:

  • The STL comes bundled with performance guarantees of the asymptotic complexity of the various containers and algorithms. What does it really mean?
  • The STL consists of many containers. Faced with a given computational task, what containers should I use? Are some better than others for a given scenario?
  • How good is the performance of the STL? Can I do better by rolling my own home-grown containers and algorithms?\

Insertion Performance:

Steps taken in vector expansion:

  • Allocate a larger memory block to make room for additional elements.
  • Copy the existing collection elements to the newly allocated memory. The copy constructor is invoked for each element in the old collection.
  • Destroy the old collection and free its memory. A destructor would be invoked for each element of the old collection copy.

To see how the copy constructor cost the performance see the comparison below of BigInt class:

BigInt::BigInt(const
BigInt& copyFrom) // Copy constructor

{

    size = ndigits = copyFrom.ndigits;

    digits = new
char[size];

    for ( unsigned
i = 0; i < ndigits; ++i) {

        digits[i] = copyFrom.digits[i];

    }

}

The dramatic difference in performance is a reflection of the difference between the cost of copying and destroying plain integers and that of BigInt objects. If you find yourself in a situation where the object’s copy constructor and destructor are fairly expensive, and vector capacity growth is very likely, you could still circumvent the cost by storing pointers instead of objects. Pointers to objects do not have associated constructors and destructor. The cost of copying a pointer is essentially identical to an integer copy. The vector insertion time of a million BigInt pointers was practically the same as the integer case and is given in this figure:

The list container does not hold its elements in contiguous memory and consequently does not have to contend with capacity issues and their associated performance overhead. As a result, the list container does better than a vector on the BigInt insertion test. See figure below:

The previous figure brings up another important point: Each container has its strengths and weaknesses. Although the vector is superior to the list on insertion of a million integers, the list outperforms the vector on a similar test involving BigInt objects.

Using reserve() will improve vector insertion dramatically:

The inserting at front cost:

Comparing list to vector deletion at the back:

Deleting elements at the front:

Container traversal speed:

With regard to container traversal, the performance of the vector and array was the same. Both outperformed the list container by a wide margin. The dominant factor in container traversal seemed to be the interaction between the container memory layout and the system’s cache. Both vector and array hold their collections in contiguous memory. Logically adjacent elements of the collection reside physically next to one another in memory. When a particular element is loaded into the cache, it pulls with it a few more adjacent elements that will be accessed next (the exact number depends on the element size as well as the size of a cache line). This is ideal behavior from a caching perspective. This is not the case with a list container. Logically adjacent list elements are not necessarily adjacent in memory. Moreover, the list elements must store forward and backward pointers in addition to element value, which makes them larger than the corresponding vector elements. Even if some adjacent list elements were placed next to one another in memory, fewer of them would fit on a cache line due to their size. Consequently, traversal of a list container produces far more cache misses than an array or vector traversal.

Using STL find() function to search for a value inside containers:

Using STL find() Vs. multisel member function find():

The member find() utilizes the fact that the multiset is a sorted container. The generic find() is oblivious to that fact because it must handle containers that are not sorted.

Comparing function object to function pointer:

is in line with expectation. Function pointers cannot be resolved until run-time, which prevents them from being inlined. Function objects, however, are determined at compile-time, which gives the compiler the freedom to inline the calls to operator() and significantly increase efficiency.

Key Points:

  • Unless you know something about the problem domain that the STL doesn’t, it is unlikely that you will beat the performance of an STL implementation by a wide enough margin to justify the effort.

It is possible, however, to exceed the performance of an STL implementation in some specific scenarios.

Inlining—Performance Considerations

As we have already discussed, the performance advantages of avoiding expensive method invocations is only half of the inlining performance story. The other half is cross-call optimizations. Cross-call optimizations allow the compiler to perform source and machine level optimizations to a method based on a more expansive contextual view of its invocation. These optimizations generally take the form of doing things at compile-time to avoid the necessity of doing them at run-time; for example, simple things like converting

float
x = 90.0;

// nothing that changes x’s value

float
y = sin(x);

to

float
x = 90.0;


float
y = 1.0; // sin(90) = 1

Why Not Inlining?

If inlining is so good, why don’t we just inline everything? This simple question has a lot of complicated answers. Let’s start the answer with an inlining situation. Suppose we inline a method that, when compiled, contains 550 bytes of source code. Further suppose that 50 bytes of the called method are associated with call prologue and epilogue (method invocation overhead). If our hypothetical method is statically invoked a dozen times (called from a dozen different locations within a program), we have just increased our program size by 5,450 instructions ((550 instructions per inlining—50 instructions of invocation overhead) * 12)—550 for the otherwise called version), and we have improved the execution performance of each inlined execution of the method say by only 10%. (Assume that the large method has 50 cycles of call overhead and that the method requires 500 cycles to execute. This is pure conjecture; some methods with 500 machine code instructions may have an average execution time of 10 cycles and others may require millions of cycles.) Thus we have a 10x increase in code size of this one method with only marginal perinvocation performance improvement. Code expansions of this magnitude, when extrapolated across all the inlineable methods with a program, will have huge negative secondary performance characteristics, like cache misses and page faults, that will dwarf any supposed primary gains. Put differently, an overaggressively inlined program will execute fewer instructions, but take longer doing so. Thus, one reason why all methods are not inlined is that the code expansion that inlining creates may not be tolerable.

Another reason for not inlining everything is that some methods cannot be inlined; for example, recursive methods cannot be inlined. Suppose some method A called itself. Any attempt to inline A would result in an infinite loop as the compiler continually attempted to insert A into A. (There are actually some cases in which a very clever compiler could inline a function, particularly if the variable that controls the recursion is passed in as a literal.) Thus, recursive methods generally cannot be inlined (though later we will discuss some mechanisms to achieve a similar net effect). Methods that are indirectly recursive can sometimes be inlined. For example if A calls B and B calls A, then B, although indirectly recursive, can be inlined

Reasons for not inlining:

  • Code expansion.
  • Some version of inline functions cause problems with compiler.
  • Recursive inlining may cause problems.
  • Recompilation in case of changing any code in the inlined function.

Singleton methods are good candidate to be inlined.

Trivials are small methods with generally less than four simple source level statements that compile into ten or fewer assembly language instructions. These methods are so small that they do not provide any possibility for significant code expansion. It’s strongly recommended to inline them.

Key Points:

Inlining may backfire, and overly aggressive inlining will almost certainly do so. Inlining can increase code size. Large code size suffers a higher rate of cache misses and page faults than smaller code.