Mainstream NUMA & the TCP/IP stack: Part 2: Programming ccNUMA machines

This is a continuation of Part I of this article posted here.

In Part 1 of this article, we looked at the capacity issues that are driving architectural changes in the TCP/IP networking stack. While network interfaces are increasing in throughput capacity, processor speeds in the multi-core era are not keeping pace. Meanwhile, the TCP/IP protocol has grown in complexity so that host processing requirements are increasing, too. The only way for networked computers to scale in the multi-core era is to begin distributing networking I/O operations across multiple processors. Since bigger server machines rely on NUMA architectures for scalability, high speed networking is also evolving to exploit machines with NUMA architectures in an optimal fashion.

Machines with NUMA (non-uniform memory access speeds) architectures are usually large scale multiprocessors that are assembled using building blocks, or nodes, that each contain some number of CPUs, some amount of RAM, and various other peripheral connections. Nodes are often configured on separate boards, for example, or specific segments of a board. Multiple nodes are then interconnected with high speed links of some sort that permit all the memory that is configured to be available to executing programs. There are many schools of thought on what the best interconnection technology is. Some manufacturers favor tree structures, some favor directory schemes, some favor network-like routing. A key feature of the architecture is that the latency of a memory fetch depends on the physical location of the RAM being accessed. Accessing RAM attached to the local node is faster than a memory fetch to a remote location that is physically located on another node.

Within one of the new Intel Nehalem many-core microprocessor, for example, all the processor cores and their logical processors can access local memory at a uniform speed. Figure 3 is a schematic diagram depicting a 4-way Nehalem multiprocessor chip that is connected to a bank of RAM. The configuration of processors and RAM shown in Figure 3 is a building block that is used in creating a larger scale machine by connecting two or more of such nodes together.

Quad-core node

Figure 3. A schematic diagram depicting a NUMA node showing locally-attached RAM and a multi-core socket.

A two-node NUMA server is illustrated in Figure 4, which shows a direct connection between the memory controller on node A and the memory controller on node B. This is the relatively simple case. A thread executing on node A can access any RAM location on either node, but an access to a local memory address is considerably faster. The latency to access to a remote memory location is several times slower. (Definitive timings are not available as of this writing because early versions of the hardware are just starting to become available.)

Two-node NUMA server based on Nehalem 

Figure 4. A two-NUMA server showing a cross-node link that is used when a thread on one node needs to access a remote memory location.

As the number of nodes increases, it is no longer feasible for every node to be directly connected to every other node, nor can each bank of RAM that is installed be accessed in a single hop. The specific technology used to link nodes may introduce additional variation in the cost of accessing remote memory. From any one node, it could take longer to access memory on some nodes than others. For instance, some nodes may be accessed in a single hop across a direct link, while other accesses may require multiple hops. Some manufacturers favor routing through a shared directory service, for example. Your mileage may vary.

Specifically, in the Intel architecture, manufacturers are supplying a cache coherent flavor of NUMA servers (ccNUMA). Cache coherence is implemented using a snooping protocol to ensure that threads executing on each NUMA node have access to the most current copy of the contents of the distributed memory. Details of the snooping protocol used in Intel ccNUMA machines are discussed here.

AMD has taken a somewhat different tack in building its multi-core processors. For communication on chip between processors, AMD uses a technology known as HyperTransport, which is a dedicated, per-processor 2-way high speed link. Multiple processors cores are then linked on the chip in a ring topology as depicted in Figure 5. The ring topology has the effect of scaling the bus bandwidth that is used as an interconnect linearly with the number of the processors. But the architecture leads to NUMA characteristics. A thread executing on CPU 0 can access a local memory location, a remote memory location that is local to CPU 1 at the cost of one hop across the HT link, or a remote memory location that is local to CPU 2 at the cost of two hops across HT links.

AMD multi-core socket

Figure 5. The AMD approach to multi-core processors has NUMA characteristics. A program executing on CPU 0 that accesses RAM that is local to CPU 2 requires two hops across the HyperTransport links that connect the processors in a ring.

Historically, application development for NUMA machines meant understanding the performance costs associated with accessing remote memory on a specific hardware platform. Since manufacturers employ different proprietary interconnection schemes in their multi-tiered NUMA machines, application developer are challenged to find the right balance in exploiting a specific proprietary architecture that may then limit the ability to port the application to a different platform in the future. It may be possible to connect nodes in a NUMA machine in an asymmetric configuration, for example, where the performance cost function associated with accessing different memory locations is decidedly irregular.

To scale well, a multi-threaded program running on a NUMA machine needs to be aware of the machine environment and understand which memory references are local to the node and which are remote. A thread that was running on one NUMA node that migrates to another node pays a heavy price every time it has to fetch results from remote memory locations. The difficulty programmers face when trying to develop a scalable, multi-threaded application for a NUMA architecture machine is understanding their memory usage pattern and how it maps to the NUMA topography. When NUMA considerations were confined to expensive, high-end supercomputers, the inherent complexities developers faced in programming them were considered relatively esoteric concerns. However, in the era of many-core processors, NUMA is poised to become a mainstream architecture.

In theory, it is possible to craft an optimal solution when threads and the memory they access are balanced across NUMA processing nodes. In order to achieve an optimal balancing of the machines resources without overloading any of them, programs need to understand the CPU and memory resources that individual tasks executing in parallel require and understand how to best map those resources to the topography of the machine. Then they require a suitable scheduling mechanism to achieve the desired result. Achieving an optimal balance, as a practical matter, is not easy, in the face of variability in the resources required by any of execution threads, a complication that may then require dynamic adjustments to the scheduling policy in effect.

The Windows OS is already NUMA-aware to a degree and, thus, supports a NUMA programming model. For example, once dispatched, threads have node affinity and tend to stay dispatched on an available processor within a node. Windows OS memory management is also NUMA-aware, maintaining per node allocation pools. The OS not only resists migrating threads to another node, it also tries to ensure that most memory allocated are satisfied locally using per node memory management data structures.

Windows also provides a number of NUMA-oriented APIs that applications can use to keep their threads from migrating off-node and also enable them to direct memory allocations to a specific physical processing node. For more information on the NUMA support in Windows, see the MSDN Help topic “NUMA Support.”

To help application developers deal better with the complexities of NUMA architectures in the future, the Windows NUMA support needs to evolve. One potential approach would be for the OS to attempt to calculate a performance cost function at start-up that it would then expose to driver and application programs when they start up and run. Conceivably, the OS might also need to adjust this performance cost function to response to configuration changes that occur dynamically, such as any power management event that affects memory latency. These changes would then have to be communicated to NUMA-aware drivers and applications somehow so they could adapt to changing conditions.

Continue to Part III of this article.