Sharing Variables Between Several Instances From the Same exe or dll

When you create a new process for an application that is already running, the system simply opens another memory-mapped view of the file-mapping object that identifies the executable file’s image and creates a new process object and a new thread object (for the primary thread). The system also assigns new process and thread IDs to these objects. By using memory-mapped files, multiple running instances of the same application can share the same code and data in RAM.

Note one small problem here. Processes use a flat address space. When you compile and link your program, all the code and data are thrown together as one large entity. The data is separated from the code but only to the extent that it follows the code in the .exe file. (See the following note for more detail.) The following illustration shows a simplified view of how the code and data for an application are loaded into virtual memory and then mapped into an application’s address space

As an example, let’s say that a second instance of an application is run. The system simply maps the pages of virtual memory containing the file’s code and data into the second application’s address space, as shown next

The system allocated a new page of virtual memory (labeled as "New page" in the image above) and copied the contents of data page 2 into it. The first instance’s address space is changed so that the new data page is mapped into the address space at the same location as the original address page. Now the system can let the process alter the global variable without fear of altering the data for another instance of the same application.

A similar sequence of events occurs when an application is being debugged. Let’s say that you’re running multiple instances of an application and want to debug only one instance. You access your debugger and set a breakpoint in a line of source code. The debugger modifies your code by changing one of your assembly language instructions to an instruction that causes the debugger to activate itself. So you have the same problem again. When the debugger modifies the code, it causes all instances of the application to activate the debugger when the changed assembly instruction is executed. To fix this situation, the system again uses copy-on-write memory. When the system senses that the debugger is attempting to change the code, it allocates a new block of memory, copies the page containing the instruction into the new page, and allows the debugger to modify the code in the page copy.

Sharing Static Data across Multiple Instances of an Executable or DLL

The fact that global and static data is not shared by multiple mappings of the same .exe or DLL is a safe default. However, on some occasions it is useful and convenient for multiple mappings of an .exe to share a single instance of a variable. For example, Windows offers no easy way to determine whether the user is running multiple instances of an application. But if you could get all the instances to share a single global variable, this global variable could reflect the number of instances running. When the user invoked an instance of the application, the new instance’s thread could simply check the value of the global variable (which had been updated by another instance), and if the count were greater than 1, the second instance could notify the user that only one instance of the application is allowed to run and the second instance would terminate.

Every .exe or DLL file image is composed of a collection of sections. By convention, each standard section name begins with a period. For example, when you compile your program, the compiler places all the code in a section called .text. The compiler also places all the uninitialized data in a .bss section and all the initialized data in a .data section.

Section Attributes

Executable Common Sections

In addition to using the standard sections created by the compiler and the linker, you can create your own sections when you compile using the following directive:

#pragma data_seg("sectionname")

So, for example, I can create a section called "Shared" that contains a single LONG value, as follows:

#pragma data_seg("Shared")
LONG g_lInstanceCount = 0;
#pragma data_seg()

When the compiler compiles this code, it creates a new section called Shared and places all the initialized data variables that it sees after the pragma in this new section. In the preceding example, the variable is placed in the Shared section. Following the variable, the #pragma data_seg() line tells the compiler to stop putting initialized variables in the Shared section and to start putting them back in the default data section. It is extremely important to remember that the compiler will store only initialized variables in the new section

The Microsoft Visual C++ compiler offers an allocate declaration specifier, however, that does allow you to place uninitialized data in any section you desire. Take a look at the following code:

// Create Shared section & have compiler place initialized data in it.
#pragma data_seg("Shared")
 
// Initialized, in Shared section
int a = 0;
 
// Uninitialized, not in Shared section
int b;
 
// Have compiler stop placing initialized data in Shared section.
#pragma data_seg()
 
// Initialized, in Shared section
__declspec(allocate("Shared")) int c = 0;
 
// Uninitialized, in Shared section
__declspec(allocate("Shared")) int d;
 
// Initialized, not in Shared section
int e = 0;
 
// Uninitialized, not in Shared section
int f;
 
Simply telling the compiler to place certain variables in their own section is not enough to share those variables.
 You must also tell the linker that the variables in a particular section are to be shared. You can do this by using the
 /SECTION switch on the linker's command line:
/SECTION:name,attributes

Following the colon, type the name of the section for which you want to alter attributes. In our example, we want to change the attributes of the Shared section. So we’d construct our linker switch as follows:

/SECTION:Shared,RWS

After the comma, we specify the desired attributes: use R for READ, W for WRITE, E for EXECUTE, and S for SHARED. The switch shown indicates that the data in the Shared section is readable, writable, and shared. If you want to change the attributes of more than one section, you must specify the /SECTION switch multiple times—once for each section for which you want to change attributes.

You can also embed linker switches right inside your source code using this syntax:

#pragma comment(linker, "/SECTION:Shared,RWS")

This line tells the compiler to embed the preceding string inside a special section of the generated .obj file named ".drectve". When the linker combines all the .obj modules together, the linker examines each .obj module’s ".drectve" section and pretends that all the strings were passed to the linker as command-line arguments. this technique should be used all the time because it is so convenient—if you move a source code file into a new project, you don’t have to remember to set linker switches in the Visual C++ Project Properties dialog box

Although you can create shared sections, Microsoft discourages the use of shared sections for two reasons. First, sharing memory in this way can potentially violate security. Second, sharing variables means that an error in one application can affect the operation of another application because there is no way to protect a block of data from being randomly written to by an application.

A Thread’s Stack

It’s common and known that each thread has its own stack. The default thread stack size is 1 MB but you can override this value easily. The system reserves in the process virtual memory a stack for the thread and commits few pages of the reserved pages (i.e. one or two pages) and sets the guard protector in the pre-last reserved page. When the stack uses the committed pages, the system tries to commit the next resaved page. This continues happening until the guarded page is touched, until then the OS throws a EXCEPTION_STACK_OVERFLOW
and uses this page. If the program handles this exception, the pre-last page is reused. After that, if the program uses the last reserved page, the system throws EXCEPTION _ACCESS_VIOLATION (through windows error reporting service) and terminates the process!

The system raises an EXCEPTION_STACK_OVERFLOW exception when a thread’s last guard page is touched. If this exception is caught and the thread’s execution continues, the system will not raise the exception for this thread again because there are no more guard pages. To receive future EXCEPTION_STACK_OVERFLOW exceptions for this thread, your application must reset the guard page. This is easily accomplished by calling the C run-time library’s _resetstkoflw function (defined in malloc.h)

The bottommost page of a stack’s region is always reserved. Doing so protects against accidental overwriting of other data being used by the process. That the stack grows from top to down.

Another difficult bug to catch is stack underflow. To see what a stack underflow is, examine the following code:

int WINAPI WinMain (HINSTANCE hInstExe, HINSTANCE,
   PTSTR pszCmdLine, int nCmdShow) {
   BYTE aBytes[100];
   aBytes[10000] = 0; // Stack underflow
   return(0);
}

When this function’s assignment statement is executed, an attempt is made to access memory beyond the end of the thread’s stack. Of course, the compiler and the linker will not catch the bug in the code just shown, and an access violation will not necessarily be raised when the statement executes because it is possible to have another region immediately after your thread’s stack. If this happens and you attempt to access memory beyond your stack, you might corrupt memory related to another part of your process—and the system will not detect this corruption. Here is a code snippet that shows a case where the stack underflow will always trigger a corruption because a memory block is allocated just after the stack of a thread:

DWORD WINAPI ThreadFunc(PVOID pvParam) {
   BYTE aBytes[0x10];
   // Figure out where the stack is in the virtual address space
   MEMORY_BASIC_INFORMATION mbi;
   SIZE_T size = VirtualQuery(aBytes, &mbi, sizeof(mbi));
   // Allocate a block of memory just after the 1 MB stack
   SIZE_T s = (SIZE_T)mbi.AllocationBase + 1024*1024;
   PBYTE pAddress = (PBYTE)s;
   BYTE* pBytes = (BYTE*)VirtualAlloc(pAddress, 0x10000, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
   // Trigger an unnoticeable stack underflow
   aBytes[0x10000] = 1; // Write in the allocated block, past the stack
   ...
   return(0);
}

Windows Memory Architecture

The memory architecture used by an operating system is the most important key to understanding how the operating system does what it does. When you start working with a new operating system, many questions come to mind. “How do I share data between two applications?”, “Where does the system store the information I’m looking for?”, and “How can I make my program run more efficiently?” are just a few.

Every process is given its very own virtual address space. For 32-bit processes, this address space is 4 GB because a 32-bit pointer can have any value from 0x00000000 through 0xFFFFFFFF. This range allows a pointer to have one of 4,294,967,296 values, which covers a process’ 4-GB range. For 64-bit processes, this address space is 16 EB (exabytes) because a 64-bit pointer can have any value from 0x00000000’00000000 through 0xFFFFFFFF’FFFFFFFF. This range allows a pointer to have one of 18,446,744,073,709,551,616 values, which covers a process’ 16-EB range. This is quite a range!

Because every process receives its own private address space, when a thread in a process is running, that thread can access memory that belongs only to its process. The memory that belongs to all other processes is hidden and inaccessible to the running thread.

In Windows, the memory belonging to the operating system itself is also hidden from the running thread, which means that the thread cannot accidentally access the operating system’s data.

Before you get all excited about having so much address space for your application, keep in mind that this is virtual address space—not physical storage. This address space is simply a range of memory addresses. Physical storage needs to be assigned or mapped to portions of the address space before you can successfully access data without raising access violations.

Each process’ virtual address space is split into partitions. The address space is partitioned based on the underlying implementation of the operating system. Partitions vary slightly among the different Microsoft Windows kernels. Table below shows how each platform partitions a process’ address space

Null-Pointer Assignment Partition

The partition of the process’ address space from 0x00000000 to 0x0000FFFF inclusive is set aside to help programmers catch NULL-pointer assignments. If a thread in your process attempts to read from or write to a memory address in this partition, an access violation is raised.

Error checking is often not performed religiously in C/C++ programs. For example, the following code performs no error checking:

int* pnSomeInteger = (int*) malloc(sizeof(int));

*pnSomeInteger = 5;

If malloc cannot find enough memory to satisfy the request, it returns NULL. However, this code doesn’t check for that possibility—it assumes that the allocation was successful and proceeds to access memory at address 0x00000000. Because this partition of the address space is off-limits, a memory access violation occurs and the process is terminated. This feature helps developers find bugs in their applications. Notice that you can’t even reserve virtual memory in this address range with functions of the Win32 application programming interface (API

User-Mode Partition

This partition is where the process’ address space resides. The usable address range and approximate size of the user-mode partition depends on the CPU architecture, as shown in next table

A process cannot use pointers to read from, write to, or in any way access another process’ data residing in this partition. For all applications, this partition is where the bulk of the process’ data is maintained. Because each process gets its own partition for data, applications are far less likely to be corrupted by other applications, making the whole system more robust.

In Windows, all .exe and dynamic-link library (DLL) modules load in this area. Each process might load these DLLs at a different address within this partition (although this is very unlikely). The system also maps all memory-mapped files accessible to this process within this partition.

When I first looked at my 32-bit process’ address space, I was surprised to see that the amount of usable address space was less than half of my process’ overall address space. After all, does the kernel-mode partition really need the top half of the address space? Actually, the answer is yes. The system needs this space for the kernel code, device driver code, device I/O cache buffers, nonpaged pool allocations, process page tables, and so on. In fact, Microsoft is squeezing the kernel into this 2-GB space. In 64-bit Windows, the kernel finally gets the room it truly needs.

Getting a Larger User-Mode Partition in x86 Windows

Some applications, such as Microsoft SQL Server, would benefit from a user-mode address space larger than 2 GB in order to improve performance and scalability by having more application data addressable. So the x86 version of Windows offers a mode to increase the user-mode partition up to a maximum of 3 GB. To have all processes use a larger-than-2-GB user-mode partition and a smaller-than-1-GB kernel-mode partition, you need to configure the boot configuration data (BCD) in Windows and then reboot the machine. (Read the white paper available at http://www.microsoft.com/whdc/system/platform/firmware/bcd.mspx for more details about the BCD.)

To configure the BCD, you need to execute BCDEdit.exe with the /set switch with the IncreaseUserVA parameter. For example, bcdedit /set IncreaseUserVa 3072 tells Windows to reserve, for all processes, a 3-GB user-mode address space region and a 1-GB kernel-mode address space region. The “x86 w/3 GB” row in Table 13-2 shows how the address space looks when the IncreaseUserVa value is set to 3072. The minimum value accepted for IncreaseUserVa is 2048, corresponding to the 2-GB default. If you want to explicitly reset this parameter, execute the following command: bcdedit /deletevalue IncreaseUserVa.

When you need to figure out the current value of the parameters of the BCD, simply type bcdedit /enum on the command line. (Go to http://msdn2.microsoft.com/en-us/library/aa906211.aspx for more information about BCDEdit parameters.)

Microsoft had to create a solution that allowed this application to work in a large user-mode address space environment. When the system is about to run an application, it checks to see if the application was linked with the /LARGEADDRESSAWARE linker switch. If so, the application is claiming that it does not do anything funny with memory addresses and is fully prepared to take advantage of a large user-mode address space. On the other hand, if the application was not linked with the /LARGEADDRESSAWARE switch, the operating system reserves any user-mode space between 2 GB and the start of kernel mode. This prevents any memory allocations from being created at a memory address whose high bit is set.

Note that all the code and data required by the kernel is squeezed tightly into a 2-GB partition. So reducing the kernel address space to less than 2 GB restricts the number of threads, stacks, and other resources that the system can create. In addition, the system can use a maximum of only 64 GB of RAM, unlike the 128-GB maximum available when the default of 2 GB is used.

An executable’s LARGEADDRESSAWARE flag is checked when the operating system creates the process’ address space. The system ignores this flag for DLLs. DLLs must be written to behave correctly in a large 2+ GB user-mode partition or their behavior is undefined.

In 64-bit Windows, the 8-TB user-mode partition looks greatly out of proportion to the 16,777,208-TB kernel-mode partition. It’s not that the kernel-mode partition requires all of this virtual address space. It’s just that a 64-bit address space is enormous and most of that address space is unused. The system allows our applications to use 8 TB and allows the kernel to use what it needs; the majority of the kernel-mode partition is just not used. Fortunately, the system does not require any internal data structures to maintain the unused portions of the kernel-mode partition.

Regions in an Address Space

When a process is created and given its address space, the bulk of this usable address space is free, or unallocated. To use portions of this address space, you must allocate regions within it by calling VirtualAlloc. The act of allocating a region is called reserving.

Whenever you reserve a region of address space, the system ensures that the region begins on an allocation granularity boundary. The allocation granularity can vary from one CPU platform to another. However, as of this writing, all the CPU platforms use the same allocation granularity of 64 KB—that is, allocation requests are rounded to a 64-KB boundary.

When you reserve a region of address space, the system ensures that the size of the region is a multiple of the system’s page size. A page is a unit of memory that the system uses in managing memory. Like the allocation granularity, the page size can vary from one CPU to another. The x86 and x64 systems use a 4-KB page size, but the IA-64 uses an 8-KB page size.

Sometimes the system reserves regions of address space on behalf of your process. For example, the system allocates a region of address space to store a process environment block (PEB). A PEB is a small data structure created, manipulated, and destroyed entirely by the system. When a process is created, the system allocates a region of address space for the PEB.

The system also needs to create thread environment blocks (TEBs) to help manage all the threads that currently exist in the process. The regions for these TEBs will be reserved and released as threads in the process are created and destroyed.

Although the system demands that any of your requests to reserve address space regions begin on an allocation granularity boundary (64 KB on all platforms), the system itself is not subjected to the same limitation. It is extremely likely that the region reserved for your process’ PEB and TEBs will not start on a 64-KB boundary. However, these reserved regions will still have to be a multiple of the CPU’s page size.

If you attempt to reserve a 10-KB region of address space, the system will automatically round up your request and reserve a region whose size is a multiple of the page size. This means that on x86 and x64 systems, the system will reserve a region that is 12 KB; on an IA-64 system, the system will reserve a 16-KB region.

When your program’s algorithms no longer need to access a reserved region of address space, the region should be freed. This process is called releasing the region of address space and is accomplished by calling the VirtualFree function.

Committing Physical Storage within a Region

To use a reserved region of address space, you must allocate physical storage and then map this storage to the reserved region. This process is called committing physical storage. Physical storage is always committed in pages. To commit physical storage to a reserved region, you again call the VirtualAlloc function.

When you commit physical storage to regions, you do not have to commit physical storage to the entire region. For example, you can reserve a region that is 64 KB and then commit physical storage to the second and fourth pages within the region. Figure below shows what a process’ address space might look like. Notice that the address space is different depending on which CPU platform you’re running on. The address space on the left shows what happens on x86/x64 machines (which have a 4-KB page), and the address space on the right shows what happens on an IA-64 machine (which has 8-KB pages).

When your program’s algorithms no longer need to access committed physical storage in the reserved region, the physical storage should be freed. This process is called decommitting the physical storage and is accomplished by calling the VirtualFree function.

Physical Storage and the Paging File

In older operating systems, physical storage was considered to be the amount of RAM that you had in your machine. In other words, if you had 16 MB of RAM in your machine, you could load and run applications that used up to 16 MB of RAM. Today’s operating systems have the ability to make disk space look like memory. The file on the disk is typically called a paging file, and it contains the virtual memory that is available to all processes.

Of course, for virtual memory to work, a great deal of assistance is required from the CPU itself. When a thread attempts to access a byte of storage, the CPU must know whether that byte is in RAM or on the disk.

From an application’s perspective, a paging file transparently increases the amount of RAM (or storage) that the application can use. If you have 1 GB of RAM in your machine and also have a 1-GB paging file on your hard disk, the applications you’re running believe that your machine has a grand total of 2 GB of RAM.

Of course, you don’t actually have 2 GB of RAM. Instead, the operating system, in coordination with the CPU, saves portions of RAM to the paging file and loads portions of the paging file back into RAM as the running applications need them. Because a paging file increases the apparent amount of RAM available for applications, the use of a paging file is optional. If you don’t have a paging file, the system just thinks that there is less RAM available for applications to use. However, users are strongly encouraged to use paging files so that they can run more applications and those applications can work on larger data sets. It is best to think of physical storage as data stored in a paging file on a disk drive (usually a hard disk drive). So when an application commits physical storage to a region of address space by calling the VirtualAlloc function, space is actually allocated from a file on the hard disk. The size of the system’s paging file is the most important factor in determining how much physical storage is available to applications; the amount of RAM you have has very little effect.

Now when a thread in your process attempts to access a block of data in the process’ address space, one of two things can happen, as shown in the simplified flowchart in Figure below.

The more often the system needs to copy pages of memory to the paging file and vice versa, the more your hard disk thrashes and the slower the system runs. (Thrashing means that the operating system spends all its time swapping pages in and out of memory instead of running programs.) Thus by adding more RAM to your computer, you reduce the amount of thrashing necessary to run your applications, which will, of course, greatly improve the system’s performance. So here is a general rule of thumb: to make your machine run faster, add more RAM. In fact, for most situations, you’ll get a better performance boost from adding RAM than you will by getting a faster CPU.

Physical Storage not Maintained in the Paging File

After reading the previous section, you must be thinking that the paging file can get pretty large if many programs are all running at once—especially if you’re thinking that every time you run a program the system must reserve regions of address space for the process’ code and data, commit physical storage to these regions, and then copy the code and data from the program’s file on the hard disk to the committed physical storage in the paging file.

The system does not do what I just described; if it did, it would take a very long time to load a program and start it running. Instead, when you invoke an application, the system opens the application’s .exe file and determines the size of the application’s code and data. Then the system reserves a region of address space and notes that the physical storage associated with this region is the .exe file itself. That’s right—instead of allocating space from the paging file, the system uses the actual contents, or image, of the .exe file as the program’s reserved region of address space. This, of course, makes loading an application very fast and allows the size of the paging file to remain small.

When a program’s file image (that is, an .exe or a DLL file) on the hard disk is used as the physical storage for a region of address space, it is called a memory-mapped file. When an .exe or a DLL is loaded, the system automatically reserves a region of address space and maps the file’s image to this region. However, the system also offers a set of functions that allow you to map data files to a region of address space.

Microsoft was forced to make image files executed from floppies work this way so that setup applications would work correctly. Often a setup program begins with one floppy, which the user removes from the drive in order to insert another floppy. If the system needs to go back to the first floppy to load some of the .exe’s or the DLL’s code, it is, of course, no longer in the floppy drive. However, because the system copied the file to RAM (and is backed by the paging file), it will have no trouble accessing the setup program.

The system does not copy to RAM image files on other removable media such as CD-ROMs or network drives unless the image is linked using the /SWAPRUN:CD or /SWAPRUN:NET switches

Protection Attributes

Individual pages of physical storage allocated can be assigned different protection attributes. The protection attributes are shown this table

Some malware applications write code into areas of memory intended for data (such as a thread’s stack) and then the application executes the malicious code. Windows’ Data Execution Prevention (DEP) feature provides protection against this type of malware attack. With DEP enabled, the operating system uses the PAGE_EXECUTE_* protections only on regions of memory that are intended to have code execute; other protections (typically PAGE_READWRITE) are used for regions of memory intended to have data in them (such as thread stacks and the application’s heaps.)

Region Physical Storage Types:

The Importance of Data Alignment

Data alignment is not so much a part of the operating system’s memory architecture as it is a part of the CPU’s architecture.

CPUs operate most efficiently when they access properly aligned data. Data is aligned when the memory address of the data modulo of the data’s size is 0. For example, a WORD value should always start on an address that is evenly divided by 2, a DWORD value should always start on an address that is evenly divided by 4, and so on. When the CPU attempts to read a data value that is not properly aligned, the CPU will do one of two things. It will either raise an exception or the CPU will perform multiple, aligned memory accesses to read the full misaligned data value.

Here is some code that accesses misaligned data:

VOID SomeFunc(PVOID pvDataBuffer) {

   // The first byte in the buffer is some byte of information
   char c = * (PBYTE) pvDataBuffer;

   // Increment past the first byte in the buffer
   pvDataBuffer = (PVOID)((PBYTE) pvDataBuffer + 1);

   // Bytes 2-5 contain a double-word value
   DWORD dw = * (DWORD *) pvDataBuffer;

   // The line above raises a data misalignment exception on some CPUs
...

Obviously, if the CPU performs multiple memory accesses, the performance of your application is hampered. At best, it will take the system twice as long to access a misaligned value as it will to access an aligned value—but the access time could be even worse! To get the best performance for your application, you’ll want to write your code so that the data is properly aligned.

Let’s take a closer look at how the x86 CPU handles data alignment. The x86 CPU contains a special bit flag in its EFLAGS register called the AC (alignment check) flag. By default, this flag is set to zero when the CPU first receives power. When this flag is zero, the CPU automatically does whatever it has to in order to successfully access misaligned data values. However, if this flag is set to 1, the CPU issues an INT 17H interrupt whenever there is an attempt to access misaligned data. The x86 version of Windows never alters this CPU flag bit. Therefore, you will never see a data misalignment exception occur in an application when it is running on an x86 processor. The same behavior happens when running on an AMD x86-64 CPU, where, by default, the hardware takes care of misalignment fault fixup.

Now let’s turn our attention to the IA-64 CPU. The IA-64 CPU cannot automatically fix up misaligned data accesses. Instead, when a misaligned data access occurs, the CPU notifies the operating system. Windows now decides if it should raise a data misalignment exception—or it can execute additional instructions that silently correct the problem and allow your code to continue executing. By default, when you install Windows on an IA-64 machine, the operating system automatically transforms a misalignment fault into an EXCEPTION_DATATYPE_MISALIGNMENT exception. However, you can alter this behavior. You can tell the system to silently correct misaligned data accesses for all threads in your process by having one of your process’ threads call the SetErrorMode function:

UINT SetErrorMode(UINT fuErrorMode);

For our discussion, the flag in question is the SEM_NOALIGNMENTFAULTEXCEPT flag. When this flag is set, the system automatically corrects for misaligned data accesses. When this flag is reset, the system does not correct for misaligned data accesses but instead raises data misalignment exceptions. Once you change this flag, you can’t update it again during the process’ lifetime.

Note that changing this flag affects all threads contained within the process that owns the thread that makes the call. In other words, changing this flag will not affect any threads contained in any other processes. You should also note that a process’ error mode flags are inherited by any child processes. Therefore, you might want to temporarily reset this flag before calling the CreateProcess function (although you usually don’t do this for the SEM_NOALIGNMENTFAULTEXCEPT flag because it can’t be reset once set).

Of course, you can call SetErrorMode, passing the SEM_NOALIGNMENTFAULTEXCEPT flag, regardless of which CPU platform you are running on. However, the results are not always the same. For x86 and x64 systems, this flag is always on and cannot be turned off. You can use the Windows Reliability and Performance Monitor to see how many alignment fixups per second the system is performing. The following figure shows what the Add Counters dialog box looks like just before you add this counter to the chart:

image

What this counter really shows is the number of times per second the CPU notifies the operating system of misaligned data accesses. If you monitor this counter on an x86 machine, you’ll see that it always reports zero fixups per second. This is because the x86 CPU itself is performing the fixups and doesn’t notify the operating system. Because the x86 CPU performs the fixup instead of the operating system, accessing misaligned data on an x86 machine is not nearly as bad a performance hit as that of CPUs that require software (the Windows operating system code) to do the fixup. As you can see, simply calling SetErrorMode is enough to make your application work correctly. But this solution is definitely not the most efficient.

Microsoft’s C/C++ compiler for the IA-64 supports a special keyword called __unaligned. You use the __unaligned modifier just as you would use the const or volatile modifiers, except that the __unaligned modifier is meaningful only when applied to pointer variables. When you access data via an unaligned pointer, the compiler generates code that assumes that the data is not aligned properly and adds the additional CPU instructions necessary to access the data. The code shown here is a modified version of the code shown earlier. This new version takes advantage of the __unaligned keyword:

VOID SomeFunc(PVOID pvDataBuffer) {

   // The first byte in the buffer is some byte of information
   char c = * (PBYTE) pvDataBuffer;

   // Increment past the first byte in the buffer
   pvDataBuffer = (PVOID)((PBYTE) pvDataBuffer + 1);

   // Bytes 2-5 contain a double-word value
   DWORD dw = * (__unaligned DWORD *) pvDataBuffer;

   // The line above causes the compiler to generate additional
   // instructions so that several aligned data accesses are performed
   // to read the DWORD.
   // Note that a data misalignment exception is not raised.
...

The instructions added by the compiler are still much more efficient than letting the CPU trap the misaligned data access and having the operating system correct the problem. In fact, if you monitor the Alignment Fixups/sec counter, you’ll see that accesses via unaligned pointers have no effect on the chart. Notice that the compiler will generate the additional instructions even in the case where the structure is aligned and, so, make the code less efficient in that case.

Finally, the __unaligned keyword is not supported by the x86 version of the Microsoft Visual C/C++ compiler. I assume that Microsoft felt that this wasn’t necessary because of the speed at which the CPU itself can perform the fixups. However, this also means that the x86 compiler will generate errors when it encounters the __unaligned keyword. So if you are trying to create a single source code base for your application, you’ll want to use the UNALIGNED and UNALIGNED64 macros instead of the __unaligned keyword. The UNALIGNED* macros are defined in WinNT.h as follows:

#if defined(_M_MRX000) || defined(_M_ALPHA) || defined(_M_PPC) ||
     defined(_M_IA64) || defined(_M_AMD64)
    #define ALIGNMENT_MACHINE
    #define UNALIGNED __unaligned
    #if defined(_WIN64)
       #define UNALIGNED64 __unaligned
    #else
       #define UNALIGNED64
    #endif


   #else
       #undef ALIGNMENT_MACHINE
       #define UNALIGNED
       #define UNALIGNED64
   #endif

Fibers

      Microsoft added fibers to Windows to make it easy to port existing UNIX server applications to Windows. UNIX server applications are single-threaded (by the Windows definition) but can serve multiple clients. In other words, the developers of UNIX applications have created their own threading architecture library, which they use to simulate pure threads. This threading package creates multiple stacks, saves certain CPU registers, and switches among them to service the client requests.

      Obviously, to get the best performance, these UNIX applications must be redesigned; the simulated threading library should be replaced with the pure threads offered by Windows. However, this redesign can take several months or longer to complete, so companies are first porting their existing UNIX code to Windows so that they can ship something to the Windows market.

      To help companies port their code more quickly and correctly to Windows, Microsoft added fibers to the operating system.

      In this post, we’ll examine the concept of a fiber, the functions that manipulate fibers, and how to take advantage of fibers. Keep in mind, of course, that you should avoid fibers in favor of more properly designed applications that use Windows native threads.

      The first thing to note is that the Windows kernel implements threads. The operating system has intimate knowledge of threads, and it schedules them according to the algorithm defined by Microsoft. A fiber is implemented in user-mode code; the kernel does not have knowledge of fibers, and they are scheduled according to the algorithm you define. Because you define the fiber scheduling algorithm, fibers are non-preemptively scheduled as far as the kernel is concerned.

      The next thing to be aware of is that a single thread can contain one or more fibers. As far as the kernel is concerned, a thread is preemptively scheduled and is executing code. However, the thread executes one fiber’s code at a time—you decide which fiber.

      The first step you must perform when you use fibers is to turn your existing thread into a fiber. You do this by calling ConvertThreadToFiber:

      PVOID ConvertThreadToFiber(PVOID pvParam);

      This function allocates memory (about 200 bytes) for the fiber’s execution context. This execution context consists of the following elements:

      1) A user-defined value that is initialized to the value passed to ConvertThreadToFiber‘s pvParam argument

      2) The head of a structured exception-handling chain

      3) The top and bottom memory addresses of the fiber’s stack (When you convert a thread to a fiber, this is also the thread’s stack.)

      4) Various CPU registers, including a stack pointer, an instruction pointer, and others

      By default, on an x86 system, the CPU’s floating-point state information is not part of the CPU registers that are maintained on a per-fiber basis, which can cause data corruption to occur if your fiber performs floating-point operations. To override the default, you should call the new ConvertThreadToFiberEx function, which allows you to pass FIBER_FLAG_FLOAT_SWITCH for the dwFlags parameter:

      PVOID ConvertThreadToFiberEx(PVOID pvParam, DWORD dwFlags);

      After you allocate and initialize the fiber execution context, you associate the address of the execution context with the thread. The thread has been converted to a fiber, and the fiber is running on this thread. ConvertThreadToFiber actually returns the memory address of the fiber’s execution context. You need to use this address later, but you should never read from or write to the execution context data yourself—the fiber functions manipulate the contents of the structure for you when necessary. Now if your fiber (thread) returns or calls ExitThread, the fiber and thread both die.

      There is no reason to convert a thread to a fiber unless you plan to create additional fibers to run on the same thread. To create another fiber, the thread (the currently running fiber) calls CreateFiber:

      PVOID CreateFiber(DWORD dwStackSize, PFIBER_START_ROUTINE pfnStartAddress, PVOID pvParam);

      CreateFiber first attempts to create a new stack whose size is indicated by the dwStackSize parameter. Usually 0 is passed, which, by default, creates a stack that can grow to 1 MB in size but initially has two pages of storage committed to it. If you specify a nonzero size, a stack is reserved and committed using the specified size. If you are using a lot of fibers, you might want to consume less memory for their respective stacks. In that case, instead of calling CreateFiber, you can use the following function:

      PVOID CreateFiberEx(SIZE_T dwStackCommitSize, SIZE_T dwStackReserveSize, DWORD dwFlags, PFIBER_START_ROUTINE pStartAddress, PVOID pvParam);

      The dwStackCommitSize parameter sets the part of the stack that is initially committed. The dwStackReserveSize parameter allows you to reserve an amount of virtual memory. The dwFlags parameter accepts the same FIBER_FLAG_FLOAT_SWITCH value as ConvertThreadToFiberEx does to add the floating-point state to the fiber context. The other parameters are the same as for CreateFiber.

      Next, CreateFiber(Ex) allocates a new fiber execution context structure and initializes it. The user-defined value is set to the value passed to the pvParam parameter, the top and bottom memory addresses of the new stack are saved, and the memory address of the fiber function (passed as the pfnStartAddress argument) is saved.

      The pfnStartAddress argument specifies the address of a fiber routine that you must implement and that must have the following prototype:

      VOID WINAPI FiberFunc(PVOID pvParam);

      When the fiber is scheduled for the first time, this function executes and is passed the pvParam value that was originally passed to CreateFiber. You can do whatever you like in

      this fiber function. However, the function is prototyped as returning VOID—not because the return value has no meaning, but because this function should never return at all! If a fiber function does return, the thread and all the fibers created on it are destroyed immediately.

      Like ConvertThreadToFiber(Ex), CreateFiber(Ex) returns the memory address of the fiber’s execution context. However, unlike ConvertThreadToFiber(Ex), this new fiber does not execute because the currently running fiber is still executing. Only one fiber at a time can execute on a single thread. To make the new fiber execute, you call SwitchToFiber:

      VOID SwitchToFiber(PVOID pvFiberExecutionContext);

      SwitchToFiber takes a single parameter, pvFiberExecutionContext, which is the memory address of a fiber’s execution context as returned by a previous call to

      ConvertThreadToFiber(Ex) or CreateFiber(Ex). This memory address tells the function which fiber to schedule. Internally, SwitchToFiber performs the following steps:

      1) It saves some of the current CPU registers, including the instruction pointer register and the stack pointer register, in the currently running fiber’s execution context.

      2) It loads the registers previously saved in the soon-to-be-running fiber’s execution context into the CPU registers. These registers include the stack pointer register so that this fiber’s stack is used when the thread continues execution.

      3) It associates the fiber’s execution context with the thread; the thread runs the specified fiber.

      4) It sets the thread’s instruction pointer to the saved instruction pointer. The thread (fiber) continues execution where this fiber last executed.

      SwitchToFiber is the only way for a fiber to get any CPU time. Because your code must explicitly call SwitchToFiber at the appropriate times, you are in complete control of the fiber scheduling. Keep in mind that fiber scheduling has nothing to do with thread scheduling. The thread that the fibers run on can always be preempted by the operating system. When the thread is scheduled, the currently selected fiber runs—no other fiber runs unless SwitchToFiber is explicitly called.

      To destroy a fiber, you call DeleteFiber:

      VOID DeleteFiber(PVOID pvFiberExecutionContext);

      This function deletes the fiber indicated by the pvFiberExecutionContext parameter, which is, of course, the address of a fiber’s execution context. This function frees the memory

      used by the fiber’s stack and then destroys the fiber’s execution context. But if you pass the address of the fiber that is currently associated with the thread, the function calls ExitThread internally, which causes the thread and all the fibers created on the thread to die.

      DeleteFiber is usually called by one fiber to delete another. The deleted fiber’s stack is destroyed, and the fiber’s execution context is freed. Notice the difference here between fibers and threads: threads usually kill themselves by calling ExitThread. In fact, it is considered bad form for one thread to terminate another thread using TerminateThread. If you do call TerminateThread, the system does not destroy the terminated thread’s stack. We can take advantage of this ability of a fiber to cleanly delete another fiber—I’ll discuss how when I explain the sample application later in this chapter. When all fibers are deleted, it is also possible to remove the fiber state from the original thread that called ConvertThreadToFiber(Ex) by using ConvertFiberToThread, releasing the last pieces of memory that made the thread a fiber.

      If you need to store information on a per-fiber basis, you can use the Fiber Local Storage, or FLS, functions. These functions do for fibers what the TLS functions do for threads. You first call FlsAlloc to allocate an FLS slot that can be used by all fibers running in the current process. This function takes a single parameter: a callback function that is called either when a fiber gets destroyed or when the FLS slot is deleted by a call to FlsFree. You store per-fiber data in an FLS slot by calling FlsSetValue, and you retrieve it with FlsGetValue. If you need to know whether or not you are running in a fiber execution context, simply check the Boolean return value of IsThreadAFiber.

      Several additional fiber functions are provided for your convenience. A thread can execute a single fiber at a time, and the operating system always knows which fiber is currently associated with the thread. If you want to get the address of the currently running fiber’s execution context, you can call GetCurrentFiber:

      PVOID GetCurrentFiber();

      PVOID GetFiberData();

      As I’ve mentioned, each fiber’s execution context contains a user-defined value. This value is initialized with the value that is passed as the pvParam argument to ConvertThreadToFiber(Ex) or CreateFiber(Ex). This value is also passed as an argument to a fiber function. GetFiberData simply looks in the currently executing fiber’s execution context and returns the saved value.

      Both GetCurrentFiber and GetFiberData are fast and are usually implemented as intrinsic functions, which means that the compiler generates the code for these functions inline.

    Synchronous and Asynchronous Device IO

    Common windows I/O devises:

    This article discusses how an application’s threads communicate with these devices without waiting for the devices to respond

    Below is way of haw to get handle of I/O device

    if you have a handle to a device, you can find out what type of device it is by calling GetFileType:

    DWORD GetFileType(HANDLE hDevice);

    All you do is pass to the GetFileType function the handle to a device, and the function returns one of the values listed in next table:

    To manage a file, the cache manager must maintain some internal data structures for the file—the larger the file, the more data structures required. When working with extremely large files, the cache manager might not be able to allocate the internal data structures it requires and will fail to open the file. To access extremely large files, you must open the file using the FILE_FLAG_NO_BUFFERING flag.

    Because device I/O is slow when compared with most other operations, you might want to consider communicating with some devices asynchronously. Here’s how it works: Basically, you call a function to tell the operating system to read or write data, but instead of waiting for the I/O to complete, your call returns immediately, and the operating system completes the I/O on your behalf using its own threads. When the operating system has finished performing your requested I/O, you can be notified. Asynchronous I/O is the key to creating high-performance, scalable, responsive, and robust applications.

    Most Windows functions that return a handle return NULL when the function fails. However, CreateFile returns INVALID_HANDLE_VALUE (defined as –1) instead. You may see code like this, which is incorrect

    HANDLE hFile = CreateFile(…);

    if (hFile == NULL) {

    // We’ll never get in here

    } else {

    // File might or might not be created OK

    }

    Here’s the correct way to check for an invalid file handle:

    HANDLE hFile = CreateFile(…);

    if (hFile == INVALID_HANDLE_VALUE) {

    // File not created

    } else {

    // File created OK

    }

    The first issue you must be aware of is that Windows was designed to work with extremely large files. Instead of representing a file’s size using 32-bit values, the original Microsoft designers chose to use 64-bit values. This means that theoretically a file can reach a size of 16 EB (exabytes).

    Dealing with 64-bit values in a 32-bit operating system makes working with files a little unpleasant because a lot of Windows functions require you to pass a 64-bit value as two separate 32-bit values. But as you’ll see, working with the values is not too difficult and, in normal day-to-day operations, you probably won’t need to work with a file greater than 4 GB. This means that the high 32 bits of the file’s 64-bit size will frequently be 0 anyway.

    The idea behind how the windows treats 32-bit applications as 64-bit application is pretty easy. Imagine a union like this:

    typedef union _ULARGE_INTEGER {

    struct {

    DWORD LowPart; // Low 32-bit unsigned value

    DWORD HighPart; // High 32-bit unsigned value

    };

    ULONGLONG QuadPart; // Full 64-bit unsigned value

    } ULARGE_INTEGER, *PULARGE_INTEGER;

    So, when you are working on a 32-bit application, the HighPart is going to be 0.

    If you opened the same file twice each open will has its own file pointer.

    Windows does not offer a GetFilePointerEx function, but you can use SetFilePointerEx to move the pointer by 0 bytes to get the desired effect, as shown in the following code snippet:

    LARGE_INTEGER liCurrentPosition = { 0 };

    SetFilePointerEx(hFile, liCurrentPosition, &liCurrentPosition, FILE_CURRENT);

    Functions that do synchronous I/O are easy to use, but they block any other operations from occurring on the thread that issued the I/O until the request is completed. A great example of this is a CreateFile operation. When a user performs mouse and keyboard input, window messages are inserted into a queue that is associated with the thread that created the window that the input is destined for. If that thread is stuck inside a call to CreateFile, waiting for CreateFile to return, the window messages are not getting processed and all the windows created by the thread are frozen. The most common reason why applications hang is because their threads are stuck waiting for synchronous I/O operations to complete!

    To build a responsive application, you should try to perform asynchronous I/O operations as much as possible. This typically also allows you to use very few threads in your application, thereby saving resources (such as thread kernel objects and stacks). Also, it is usually easy to offer your users the ability to cancel an operation when you initiate it asynchronously. For example, Internet Explorer allows the user to cancel (via a red X button or the Esc key) a Web request if it is taking too long and the user is impatient.

    Basic of Asynchronous Device I/O

    Compared to most other operations carried out by a computer, device I/O is one of the slowest and most unpredictable. The CPU performs arithmetic operations and even paints the screen much faster than it reads data from or writes data to a file or across a network. However, using asynchronous device I/O enables you to better use resources and thus create more efficient applications.
    Consider a thread that issues an asynchronous I/O request to a device. This I/O request is passed to a device driver, which assumes the responsibility of actually performing the I/O. While the device driver waits for the device to respond, the application’s thread is not suspended as it waits for the I/O request to complete. Instead, this thread continues executing and performs other useful tasks.

    You should be aware of a couple of issues when performing asynchronous I/O. First, the device driver doesn’t have to process queued I/O requests in a first-in first-out (FIFO) fashion. For example, if a thread executes the following code, the device driver will quite possibly write to the file and then read from the file:

    OVERLAPPED o1 = { 0 };

    OVERLAPPED o2 = { 0 };

    BYTE bBuffer[100];

    ReadFile (hFile, bBuffer, 100, NULL, &o1);

    WriteFile(hFile, bBuffer, 100, NULL, &o2);

    A device driver typically executes I/O requests out of order if doing so helps performance. For example, to reduce head movement and seek times, a file system driver might scan the queued I/O request list looking for requests that are near the same physical location on the hard drive.

    The second issue you should be aware of is the proper way to perform error checking. Most Windows functions return FALSE to indicate failure or nonzero to indicate success. However, the ReadFile and WriteFile functions behave a little differently. An example might help to explain.

    When attempting to queue an asynchronous I/O request, the device driver might choose to process the request synchronously. This can occur if you’re reading from a file and the system checks whether the data you want is already in the system’s cache. If the data is available, your I/O request is not queued to the device driver; instead, the system copies the data from the cache to your buffer, and the I/O operation is complete. The driver always performs certain operations synchronously, such as NTFS file compression, extending the length of a file or appending information to a file. For more information about operations that are always performed synchronously.

    of the most common bugs developers introduce when implementing an asynchronous device I/O architecture. Here’s an example of what not to do:

    VOID ReadData(HANDLE hFile) {

    OVERLAPPED o = { 0 };

    BYTE b[100];

    ReadFile(hFile, b, 100, NULL, &o);

    }

    This code looks fairly harmless, and the call to ReadFile is perfect. The only problem is that the function returns after queuing the asynchronous I/O request. Returning from the function essentially frees the buffer and the OVERLAPPED structure from the thread’s stack, but the device driver is not aware that ReadData returned. The device driver still has two memory addresses that point to the thread’s stack. When the I/O completes, the device driver is going to modify memory on the thread’s stack, corrupting whatever happens to be occupying that spot in memory at the time. This bug is particularly difficult to find because the memory modification occurs asynchronously. Sometimes the device driver might perform I/O synchronously, in which case you won’t see the bug. Sometimes the I/O might complete right after the function returns, or it might complete over an hour later, and who knows what the stack is being used for then.

    Receiving Completed I/O Request Notifications

    Windows offers four different methods (briefly described in Table 10-9) for receiving I/O completion notifications, and this chapter covers all of them. The methods are shown in order of complexity, from the easiest to understand and implement (signaling a device kernel object) to the hardest to understand and implement (I/O completion ports).

    Whenever a thread is created, the system also creates a queue that is associated with the thread. This queue is called the asynchronous procedure call (APC) queue. When issuing an I/O request, you can tell the device driver to append an entry to the calling thread’s APC queue. To have completed I/O notifications queued to your thread’s APC queue.

    The APC queue is maintained internally by the system. You’ll also notice from the list that the system can execute your queued I/O requests in any order, and that the I/O requests that you issue last might be completed first and vice versa. Each entry in your thread’s APC queue contains the address of a callback function and a value that is passed to the function.

    I/O Completion Ports

    Windows is designed to be a secure, robust operating system running applications that service literally thousands of users. Historically, you’ve been able to architect a service application by following one of two models:

    • Serial model A single thread waits for a client to make a request (usually over the network). When the request comes in, the thread wakes and handles the client’s request.
    • Concurrent model A single thread waits for a client request and then creates a new thread to handle the request. While the new thread is handling the client’s request, the original thread loops back around and waits for another client request. When the thread that is handling the client’s request is completely processed, the thread dies.

    The problem with the serial model is that it does not handle multiple, simultaneous requests well. If two clients make requests at the same time, only one can be processed at a time; the second request must wait for the first request to finish processing. A service that is designed using the serial approach cannot take advantage of multiprocessor machines. Obviously, the serial model is good only for the simplest of server applications, in which few client requests are made and requests can be handled very quickly. A Ping server is a good example of a serial server.

    Service applications using the concurrent model were implemented using Windows. The Windows team noticed that application performance was not as high as desired. In particular, the team noticed that handling many simultaneous client requests meant that many threads were running in the system concurrently. Because all these threads were runnable (not suspended and waiting for something to happen), Microsoft realized that the Windows kernel spent too much time context switching between the running threads, and the threads were not getting as much CPU time to do their work. To make Windows an awesome server environment, Microsoft needed to address this problem. The result is the I/O completion port kernel object.

    As you would expect, entries are removed from the I/O completion queue in a first-in first-out fashion. However, as you might not expect, threads that call GetQueuedCompletionStatus are awakened in a last-in first-out (LIFO) fashion. The reason for this is again to improve performance. For example, say that four threads are waiting in the waiting thread queue. If a single completed I/O entry appears, the last thread to call GetQueuedCompletionStatus wakes up to process the entry. When this last thread is finished processing the entry, the thread again calls GetQueuedCompletionStatus to enter the waiting thread queue. Now if another I/O completion entry appears, the same thread that processed the first entry is awakened to process the new entry.

    As long as I/O requests complete so slowly that a single thread can handle them, the system just keeps waking the one thread, and the other three threads continue to sleep. By using this LIFO algorithm, threads that don’t get scheduled can have their memory resources (such as stack space) swapped out to the disk and flushed from a processor’s cache. This means having many threads waiting on a completion port isn’t bad. If you do have several threads waiting but few I/O requests completing, the extra threads have most of their resources swapped out of the system anyway

    Now it’s time to discuss why I/O completion ports are so useful. First, when you create the I/O completion port, you specify the number of threads that can run concurrently. As I said, you usually set this value to the number of CPUs on the host machine. As completed I/O entries are queued, the I/O completion port wants to wake up waiting threads. However, the completion port wakes up only as many threads as you have specified. So if four I/O requests complete and four threads are waiting in a call to GetQueuedCompletionStatus, the I/O completion port will allow only two threads to wake up; the other two threads continue to sleep. As each thread processes a completed I/O entry, the thread again calls GetQueuedCompletionStatus. The system sees that more entries are queued and wakes the same threads to process the remaining entries.

    If you’re thinking about this carefully, you should notice that something just doesn’t make a lot of sense: if the completion port only ever allows the specified number of threads to wake up concurrently, why have more threads waiting in the thread pool? For example, suppose I’m running on a machine with two CPUs and I create the I/O completion port, telling it to allow no more than two threads to process entries concurrently. But I create four threads (twice the number of CPUs) in the thread pool. It seems as though I am creating two additional threads that will never be awakened to process anything.

    But I/O completion ports are very smart. When a completion port wakes a thread, the completion port places the thread’s ID in the fourth data structure associated with the completion port, a released thread list. This allows the completion port to remember which threads it awakened and to monitor the execution of these threads. If a released thread calls any function that places the thread in a wait state, the completion port detects this and updates its internal data structures by moving the thread’s ID from the released thread list to the paused thread list (the fifth and final data structure that is part of an I/O completion port

    Let’s tie all of this together now. Say that we are again running on a machine with two CPUs. We create a completion port that allows no more than two threads to wake concurrently, and we create four threads that are waiting for completed I/O requests. If three completed I/O requests get queued to the port, only two threads are awakened to process the requests, reducing the number of runnable threads and saving context-switching time. Now if one of the running threads calls Sleep, WaitForSingleObject, WaitForMultipleObjects, SignalObjectAndWait, a synchronous I/O call, or any function that would cause the thread not to be runnable, the I/O completion port would detect this and wake a third thread immediately. The goal of the completion port is to keep the CPUs saturated with work.

    Eventually, the first thread will become runnable again. When this happens, the number of runnable threads will be higher than the number of CPUs in the system. However, the completion port again is aware of this and will not allow any additional threads to wake up until the number of threads drops below the number of CPUs. The I/O completion port architecture presumes that the number of runnable threads will stay above the maximum for only a short time and will die down quickly as the threads loop around and again call GetQueuedCompletionStatus. This explains why the thread pool should contain more threads than the concurrent thread count set in the completion port

    What’s Performance and Footprint Optimization?

     

    Performance is an expression of the amount of work that is done during a certain period of time. The more work a program does per unit of time, the better its performance. Put differently, the performance of a program is measured by the number of input (data) units it manages to transform into output (data) units in a given time. This translates directly into the number of algorithmic steps that need to be taken to complete this transformation. For example, an algorithm that executes 10 program statements to store a name in a database performs poorly compared to one that stores the same name in five statements.

    Performance can be affected by the following:

    • Performance of Physical Devices (i.e. Printer)
    • Performance of System Resources (i.e. RAM, ROM, EPROM)
    • Performance of Subsystems (i.e. using third party software)
    • Performance of Communication.
      • Call back functions are one of the common tools used in communication optimization.
    • Application Look and Fell (i.e. GUI)
      • Common GUI Problems:
        • Unexplained waiting time (i.e. adding progress bar)
        • Illogical setup of user interface
        • Problematic Interface Access (i.e. delay of rendering the instant controls)
        • Not Sufficiently Aiding the Learning Curve (i.e. adding help pop-ups)

     

    When Do Performance Problems Arise

    Often a program is closely tailored to fit the current intended use, causing it to run into performance problems almost as soon as the slightest alteration is made to the way it is used—more often than not this is because developers work under strict time constraints

    • Extending program functionality
    • Code reuse
    • Test cases and target systems
    • Side effects of long-term use
      • Disk fragmentation
      • Spawned processes that never terminate
      • Memory leaks
      • Memory fragmentation
      • Files that are opened but never closed
      • Interrupts that are never cleared
      • Log files that grow too large
      • Semaphores that are claimed but never freed (locking problems)
      • Queues and arrays that exceed their maximum size
      • Buffers that wrap around when full
      • Counters that wrap to negative numbers
      • Tasks that are not handled often enough because their priority is set too low

    Put in mind the tradeoff between software flexibility and performance.

    Software Footprint is used as measurement term of the software. Several aspects are considered as:

    • Storage Requirement: when the program is inactive and stored on hard disk.
    • Runtime Memory Requirement: This is the amount of memory needed while the program is being executed.
      • Compression
      • Data Structures
      • Overlay Techniques
      • Working Memory
      • Cache
      • Memory Fragmentation

    The main considerations for keeping footprint sizes in check are the impact on available resources and the impact on performance.

    Also, a program’s usability is affected by the size of the runtime footprint. If a program uses a lot of internal memory, it might force the operating system to start swapping memory to the hard disk and back. Remember also that programs virtually never have the sole use of the system. When there is little free internal memory left, an increasingly large part of the memory that is temporarily not needed is swapped out onto the hard disk. The chance that a certain operation will require data that is not found in memory increases. The result is a hard disk that makes a lot of noise and programs that halt and stutter, causing overall slowdown and annoyance.

    The ideal program, as seen by the user, has the following characteristics:

    • It needs little user interaction.
    • It has an intuitive user interface.
    • It has a short learning curve.
    • It is highly flexible.
    • It contains, and is accompanied by, extensive but easily readable user documentation.
    • It has no waiting times during user interaction; any slow actions be performed off line and so on.
    • It has readily available information for users at any given time

     

    The ideal program, as seen by the developer, contains the following attributes:

    • It is geared toward future developments—added functionality, handling larger bases of data.
    • It is easily maintainable.
    • It has a good and intuitive design.
    • It is accompanied by well-written technical documentation.
    • It can be passed on to any developer