Chapter 16 - Monitoring Multiple Processor Computers

Archived content. No warranty is made as to technical accuracy. Content may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist.

In an ideal world, five processors would do five times the work of one processor. But we live in a world of contention for shared resources, of disk and memory bottlenecks, single-threaded applications, multithreaded applications that require synchronous processing, and poorly coordinated processors. In our world, five processors can be five times as much work for the systems administrator!

Fortunately, Windows NT 4.0 is designed to make the most of multiprocessor configurations. Multiple processors enable multiple threads to execute simultaneously, with different threads of the same process running on each processor. The Windows NT 4.0 microkernel implements symmetric multiprocessing (SMP), wherein any processes—including those of the operating system—can run on any available processor, and the threads of a single process can run on different processors at the same time.

The most common bottlenecks on multiprocessor systems arise when all processors contend for the same operating system or hardware resource. If this resource is in short supply, the system can't benefit from the additional processors.

Shared memory is the Achilles' Heel of multiprocessor systems: Although it enables the threads of a single process to be executed on different processors, it makes multiprocessor systems highly vulnerable to memory shortages, to the design of the cache controller, and to differences in cache management strategies.

Understanding the Multiple Processor Counters

Some Performance Monitor counters were designed for single processor systems and might not be entirely accurate for multiprocessor systems.

For example, on a multiprocessor computer, a process can (and often does) use more than the equivalent of 100% processor time on one processor. Although it is limited to 100% of any single processor, its threads can use several processors, totaling more than 100%. However, the Process: % Processor Time counter never displays more than 100%. To determine how much total processor time a process is getting, chart the Thread: % Processor Time counter for each of the process's threads.

Use the following counters to monitor multiple processor computers.



System: %Total Processor Time

A measure of processor activity for all processors in the computer.
This counter sums average non-idle time of all processors during the sample interval and divides it by the number of processors.
For example, if all processors are busy for half of the sample interval, on average, it displays 50%. It also displays 50% if half of the processors are busy for the entire interval, and the others are idle.

System: Processor Queue Length

The length of the processor queue. There is a single queue, even when there are many processors.

Processor: % Processor Time

A measure of the processor time of each processor.

Process: % Processor Time

The sum of processor time on each processor for all threads of the process.

Thread: % Processor Time

The amount of processor time for a thread.

Charting Multiple Processor Activity

Logging and charting are similar for both multiple processor systems and single-processor systems. And because the graphs can get crowded and complex, its best to log the System, Processor, Process, and Thread objects, and then chart them one at a time. If you need to compare charts, start several copies of Performance Monitor and have them all chart or report on data from the same log file.

When monitoring a complex occurrence, a comparison of graphs can be more useful than a single graph.

  • A chart of System: % Total Processor Time shows the overall performance of the system. This curve flattens into a horizontal line in a multiprocessor bottleneck.

  • A chart of Processor: % Processor Time for each processor shows patterns of processor use. You can determine each processor's start time, as well as its utilization. Use bookmarks when you start processes to see the effect they have on the processors. The sum of Processor: % Processor Time for each processor is System: % Total Processor Time.

  • A chart or histogram of Process: % Processor Time for all active processes reveals processor use by operating system services and network support, not to mention Performance Monitor.

  • A Thread: % Processor Time chart is essential for diagnosing the processor problems. Although the operating system executes processes, the threads send instructions to the processors. Also, Thread: % Processor Time is a better indicator of processor use than Process: % Processor Time, because the latter has a maximum value of 100%.

    Compare a Thread: % Processor Time chart with a Processor: % Processor Time chart. You can match the threads to their processors because their curves have similar shapes and values. You can also determine which threads are doing background work and which are contending for the foreground.

    A chart of Thread: % Processor Time for all threads on a busy system is likely to be confusing, so chart the threads of each process separately, either one at a time, or with different copies of Performance Monitor reading the same log file.

You can also test each of your processors independently or in different combinations with single and multithreaded applications. Add Process: Thread Count to a Performance Monitor report to see how many threads are in each active process. Edit the Boot.ini file in your root directory to change the number and combination of active processors.

Task Manager, a new administrative tool, lets you determine which processes run on which processors of a multiprocessor computer. On the Task Manager Processes tab, click a process with the right mouse button, then select Set Affinity. The process you selected will run only on the processors selected on the panel. This is a great testing tool.

Resource Contention

The following figures use histograms of the Process: % Processor Time counter to compare two active processes running on one processor to the same processes running on two processors.

The first graph shows the processes running on a single processor computer. Each process is getting about half of the processing time. All other processes are nearly idle.


The following figure shows the same processes running on a computer with two processors.


On the multiprocessor computer, each process is using 100% of a processor, and the system is doing twice the work. The processor time is the same as for a single process with a single processor all to itself.

However, to achieve this performance, the processes had to be entirely independent; the only thing they shared was their code. Each processor had a copy of the code in its primary and secondary memory caches, so the processes didn't even have to share physical memory or any common system resources. This is the ideal, simulated by CpuStress, a test tool designed for the purpose.

Cache Coherency

In the previous case, several processes were competing for the same resource. But resource contention occurs even among multiple threads of a single process. Threads within a process share and contend for the same address space and frequently are writing to the same memory location. Although this is a minor problem for single-processor configurations, it can become a bottleneck in multiprocessor systems.

Unfortunately, you can't see cache and memory contention directly with Performance Monitor because these conflicts occur at hardware level (where no counters exist). You can, however, get indirect evidence based on response time and total throughput: The processors simply appear to be busy.

In multiprocessor systems, shared memory must be kept consistent: that is, the values of memory cells in the caches of each processor must be kept the same. This is known as cache coherency. The responsibility for maintaining cache coherency in multiprocessor systems falls to the cache controller hardware. When a memory cell is written, if the cache controller finds that the memory cell is in use in the cache of any other processors, it invalidates or overwrites those cells with the new data and then updates main memory.

Two frequently used update strategies are known as write-through caching and write-back caching:

  • In write-through caching, the cache controller updates main memory immediately so that other caches can get the updated data from memory.

  • In write-back caching, the cache controller doesn't update main memory until it needs to reuse the memory cell. If another cache needs the data before it is written to main memory (which is more likely with more threads), the cache controller must obtain the data from the cache of the other processor. That processor's cache must listen in on bus requests and respond before main memory recognize the call.

Write-back caching usually causes fewer writes to main memory and reduces contention on the memory bus, but as the number of threads grows and the likelihood that they will need shared data increases, it actually causes more traffic and resource contention.

Resource sharing and contention is much more common than isolated processing. Although ample processors exist for the workload, they must share the single pool of virtual memory and contend for disk access. There is no easy solution to this problem. However, it demonstrates the limits of even the most sophisticated hardware. In this situation, the traditional solutions to a bottleneck—adding more processors, disk space, or memory—cannot overcome the limitations imposed by an application's dependence on a single subsystem.