Mainstream NUMA and the TCP/IP stack: Final Thoughts

This is a continuation of Part IV of this article posted here

Note that a final version of a white paper tying this series of five blog entries together (and a Powerpoint presentation on the subject) are attached.

For many years, the effort to improve network performance on Windows and other platforms focused on reducing the host processing requirements associated with the need to service frequent interrupts from the NIC. In the many-core era where the clock speeds of processors are constrained by power considerations, this strategy is inadequate to the growing host processing requirements that accompany high-speed networking. It is necessary to augment technologies like interrupt moderation and TCP Offload Engine that improve the efficiency of network I/O with an approach that allows TCP/IP Receive packets to be processed in parallel across multiple CPUs. Together, MSI-X and RSS are technologies that enable host processing of TCP/IP packets to scale in the many-core world, albeit not without some compromises with the prevailing model of networking using isolated, layered components.

Using MSI-X and RSS, for example, the Intel 82598 10 Gigabit Ethernet Controller mentioned earlier can be mapped to a maximum of 16 processor cores that could then be devoted to networking I/O interrupt handling. Capacity-wise, this is still not sufficient processing capacity to handle the theoretical maximum load equation 3 predicts for a 10 Gb Ethernet card, but it does represent a substantial scalability improvement.

 With this understanding of what MSI-X and RSS accomplishes, let’s return for a moment to our NUMA server machine shown in Figure 6 below.

NUMA server with multiple RSS queues

With MSI-X and Receive-Side Scaling, CPU 0 on node A and CPU 1 on node B are both enabled for processing network interrupts. Since RSS schedules the NDIS DPC to run on the same processor as the ISR, even at moderate networking loads, CPU 0 and 1 for all practical purposes become dedicated to the processing of high priority networking interrupts.

Numerous economies of scale accrue using this approach. The same RSS process that sends all Receive packets from a single TCP connection to a specific CPU for processing improves the efficiency of that processing. The instruction execution rate of the TCP/IP protocol stack is enhanced significantly through this scheduling mechanism that enforces localization. Ultimately, TCP/IP application data buffers need to be allocated from local node memory and processed by threads confined to that node. Recently used data and instructions that networking ISRs and DPCs issue tend to reside in the dedicated cache (or caches) associated with the processor devoted to network I/O. Or, at the very least, they migrate to the last level cache that is shared by all the processors on the same NUMA node.

Ultimately, of course, the TCP layer hands data from the network I/O to an application layer that is ready to receive and process it. The implications of RSS for the application threads that process TCP receive packets and build responses for TCP/IP to send back to network clients ought to be obvious, but I will spell them out anyway. For optimal performance, these application processing threads also need to be directed to run on the same NUMA node where the TCP Receive packet was processed. This localization of the application’s threads should, of course, be subject to other load balancing considerations to prevent the ideal node from becoming severely over-committed while other CPUs on other nodes are idling or under-utilized. The performance penalty for an application thread that must run on a different node than the one that processed the original TCP/IP Receive packet is considerable because it must access the data payload of the request remotely. Networked applications need to understand these performance and capacity considerations and schedule their threads accordingly to balance the work across NUMA nodes optimally.

Consider the ASP.NET application threads that process incoming HTTP Requests and generate HTTP Response messages. If the HTTP Request packet is processed by CPU 0 on node A in a NUMA machine, the Request packet payload is allocated in node A local memory. The ASP.NET application thread running in User mode that processes that incoming HTTP Request will run much more efficiently if it is scheduled to run on one of the other processors on node A, where it can access the payload and build the Response message using local node memory.

There is currently no mechanism in Windows today for kernel mode drivers like ndis.sys and http.sys to communicate to the application layers above them and specify the NUMA node on which that packet was originally processed. Communicating that information to the application layer is another grievous violation of the principle of isolation in the network protocol stack, but it is a necessary step to improve the performance of networking applications in the many-core era where even moderately sized server machines have NUMA characteristics.

Herb Sutter, “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software.” Dr. Dobb’s Journal, March 1, 2005. 

NTttcp performance testing tool:

Windows Performance Toolkit (WPT, aka xperf):

David Kanter, “The Common System Interface: Intel's Future Interconnect,”  

Windows NUMA support:

Intel white paper: Accelerating High-Speed Networking with Intel® I/O Acceleration Technology

Mark B. Friedman, “An Introduction to SAN Capacity Planning,” Proceedings, Computer Measurement Group, Dec. 2001.

Jeffrey Mogul’s “TCP offload is a dumb idea whose time has come,” Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9, 2003.

Dell Computer Corporation, “Boosting Data Transfer with TCP Offload Engine Technology.”

Microsoft Corporation, KB 951037,

Microsoft Corporation, Windows Driver Development Kit (DDK) documentation,

Microsoft Corporation, KB 927168,

Microsoft Corporation, NDIS 6.0 Receive-Side Scaling documentation,|5CMG%20paper%208220%20draft|6.docx