Application Design and Multiprocessor Performance
The design of applications and services has a significant influence on the performance of those programs in an SMP environment. This section is a brief summary of techniques that applications designers can implement to maximize the efficiency of their program on multiprocessing systems. More detailed discussions of writing applications that scale well to multiple processors appear on the Microsoft® Developer Network (MSDN).
Application developers can optimize application performance for SMP systems in the following ways:
Keep the number of threads to a minimum. Generally, two to four application or server threads per processor works well.
Limit processor queue depth. Keep the processor queue length (the number of ready threads waiting to run) in the range of two to three per processor. Depending on the characteristics of the application, such as the time a thread spends blocked or waiting for an I/O operation, the number of threads can be adjusted.
Use a thread pool rather than one thread per client. It is more efficient to use a thread pool with an I/O completion port rather than to have a thread for each client, with the thread pool partitioned to each processor.
Use I/O completion ports. I/O completion ports control the number of active threads to yield optimal throughput. Per-processor I/O completion ports can be implemented in an application or server to ensure completion of a work item from start to finish on the same processor.
To minimize the cost of synchronization mechanisms, keep critical sections small and avoid shared data whenever possible. Synchronize shared data but do not try to synchronize code paths. Although critical sections — synchronization objects defined in the Win32 API — are a very fast method for mutual exclusion within a single multithreaded process, when contention arises, critical sections initiate context switching. Large numbers of critical section or spinlock acquisitions cause heavy data-access sharing and should be avoided.
Spinlocks are an extension of IRQL on SMP systems. They are used to synchronize kernel and driver data structures among interrupts, DPCs, or threads of execution running concurrently on an SMP computer. A thread acquires a spinlock before accessing protected resources. The spinlock keeps other processors from accessing the critical section (shared data) until the spinlock is released. A processor that is waiting for the spinlock loops until the spinlock is released.
Another characteristic of spinlocks is the associated IRQL. Attempted acquisition of a spinlock temporarily raises the IRQL of the requesting processor to the IRQL associated with the spinlock. This prevents all lower IRQL activity on the same processor from running until IRQL is lowered. Interrupts at a higher IRQL can preempt the executing thread. In a driver, if IRQL is already at the desired level, use the spinlock acquire-and-release APIs that don't change IRQL. For more information about spinlocks and associated IRQLs, see the Driver Development Kit link on the Web Resources page at http://windows.microsoft.com/windows2000/reskit/webresources .
Avoid nested locks. Nesting locks can cause performance problems and reliability problems, such as deadlock. Always try to avoid nesting critical sections and spinlocks.
Partition the workload, including interrupts. Whenever possible, partition the workload a server or application handles. Partitioning allows very effective use of system resources.
In addition, intensive memory access due to copying, zeroing (for C2 security), and checksum operations reduces the ability to scale effectively across multiple processors. You can identify some of these problems through profiling. For more information about profiling tools, see the Platform Software Development Kit (SDK) link on the Web Resources page at http://windows.microsoft.com/windows2000/reskit/webresources .
Change applications to prevent data that might be used concurrently by threads on different processors from residing in the same cache block, or use processor affinity to force use of updated cache blocks only on a single processor.
Use asynchronous, overlapped I/O. In overlapped mode, a server application can initiate multiple I/O requests without waiting for previous requests to complete, thereby enabling it to service multiple clients asynchronously using a single thread.