Mainstream NUMA and the TCP/IP stack, Part III: A look back at older strategies to scale high-speed networking

This is a continuation of Part II of this article posted here.

By necessity, both the hardware and the software devoted to processing network traffic need to evolve in the many-core era to become multiprocessor-oriented. On servers that have NUMA architectures, that multiprocessing support needs to acquire a NUMA flavoring. The technology that allows network interrupts to be processed concurrently across multiple processors includes support for

· multiple Descriptor Queues in the networking hardware,

· Extended Message Signaled Interrupts (MSI-X) to allow hardware interrupts to be serviced concurrently on more than one processor, and

· the software support in Windows known as Receive-Side Scaling (or RSS).

With ccNUMA architecture machines becoming more mainstream, it is clear that multi-processor support should also include being NUMA-aware.

For an idea of how fast the NICs are getting, a typical 1 Gb Ethernet card supports 4 transmit and 4 receive interrupt queues and can spread interrupts across as many as 4 host processors for load balancing under RSS. A 10 Gb Ethernet card necessarily supports even high levels of parallelism. For example, the dual-ported Intel 82598 10 Gigabit Ethernet Controller provides 32 transmit queues and 64 receive queues per port, which can be mapped to a maximum of 16 processor cores. Note that this increase in parallel processing capacity is only a 4x improvement over recent 1 Gb Ethernet cards, which is probably inadequate to exploit the increased bandwidth fully.

Let’s consider briefly some of the ideas to improve TCP/IP performance that have been implemented in the recent past. The strategies discussed here either increase the efficiency of host computer processing of TCP/IP packets or attempt to off load some of these software functions onto networking hardware. These strategies have proven effective, but they do not offer enough capacity relief to keep pace with the steady advance of networking speeds. The way out of the current mismatch between high speed networks and the host processing requirements they generate is a parallel processing approach where TCP/IP interrupts are distributed across multiple CPUs.

Interestingly, some of the changes outlined here to streamline TCP/IP host processing fly in the face of a major precedent. The layered architecture of the networking stack is widely regarded as one of the storied accomplishments of software engineering. Several of the efforts to improve the performance of host processing of TCP/IP interrupts involve shattering the strict isolation of components that the layered networking model advocates.

In theory, at least, the layered approach simplifies complex software that is designed to function smoothly in diverse environments. Layering also supports development of components that can proceed independently and in parallel. In principle, each layer defines and adheres to a standard set of services, or interfaces, that it provides to the component in the layer immediately above it. An upper layer communicates with a level below it using only this predefined set of abstract interfaces. (The set of services provided and consumed by two adjacent layers, in effect, defines a contract.) Furthermore, in the design of the networking protocol, components are isolated. It is an article of faith among software engineers that layered architectures, when properly defined and implemented, greatly contribute to the robustness and reliability of the software built using those design principles.

TCP/IP Layers

Data unit





5. Application




4. Transport

End-to-end connections (sessions) and reliable delivery



3. Network

Logical addressing and routing; segmentation & re-assembly



2. Data link

Physical addressing (MAC)

Ethernet, ATM


1. Physical

Media, signal and binary transmission

Optical fiber, coax, twisted pair

Table 4. The TCP/IP layered networking model.

Table 4 is a standard representation of the layered networking model used in TCP/IP, which has gained almost universal acceptance in computer-computer communications. Take the ubiquitous IP layer, for instance. IP implements a Best Effort service model to deliver packets from one station to another using routing. By design, it is connectionless, session-less and unreliable. Delivery of packets to the correct destination is not guaranteed, but IP does take a “best effort” approach to accomplish this. For the applications that require these services, the higher-level TCP Host protocol guarantees that packets are delivered reliably and in order to the designated application layer above it. It does this using a session-oriented protocol that preserves the state of the messaging-passing session between packets. TCP has also evolved complicated, performance-oriented flow and congestion control mechanisms that are beyond the scope of the current discussion.

The layering approach to networking introduces one additional and crucial design constraint. When one station is transferring a message to another, only components at the same level in the protocol stack can exchange data and communicate with each other. For example, only the TCP component in the receiver is supposed to be able to understand and process information placed into the TCP packet header by the sender. However, both the TCP Offload Engine and Receive-Side Scaling utilize knowledge of what is going on the upper layer TCP protocol down in the Data Link (or MAC) layer in the receiver, a serious violation of the principle that the layers in the protocol stack remain totally isolated from each other. Apparently, this is a case where the serious performance issues trump the pure design principles, and the evolution of TCP/IP has always been sensitive to practical issues of scaling. It is not that you absolutely cannot violate the contract that governs the ways layers communicate, but it is something that should be done very thoughtfully so that your once clean interfaces don’t start to look like swiss cheese.

A crucial factor that works to encourage breaking with precedent is that the protocols from the TCP layer down to the hardware all adhere very strictly to the standards in order to promote interoperability. This strict compliance has the effect of hardening the services and interfaces between these layers in cement. This rigidity actually reduces the risk of side effects when a lower level component presumptuously usurps a service that architecturally is defined as the responsibility of some higher level.

Another factor is also at work. Within a layer, components that conform to the same contract layer can, in theory, be freely substituted for each other. This principle of the layered approach is supposed to promote development of a profusion of components that implement different sets of services, but still adhere to the strict requirements of the standard. In fact, the need for interoperability severely limits the proliferation of components that can be freely substituted for each other. Ethernet is almost always the hardware used at the bottom of the stack due to its superior cost/performance. Ethernet is always followed by IP, which is then usually followed by TCP. UDP can be freely substituted for TCP at the host processing layer, but only when the TCP services that ensure reliable delivery of packets and flow control can be dispensed with. In practice, there is very little variety among the components you will see operating in every networking protocol stack. TCP/IP over Ethernet, being ubiquitous, achieves the highest possible degree of interconnectivity and interoperability.

With the TCP/IP stack so pervasive and so stable and dominant, it then becomes possible to think the unthinkable. It becomes difficult to resist the temptation to violate the principle of isolation if you can demonstrate a big enough performance win. Having the Ethernet layer peek into the TCP packet headers and optimize their processing is acceptable when the violation of this sacrosanct principle of layering yields sufficient performance or scalability improvements.

Recent performance improvements to TCP/IP host processing.

Before we drill into the current set of architectural changes to the networking stack, let’s explore briefly some of the more successful strategies for reducing the host computer processing requirements associated with TCP/IP interrupt processing that have been explored in the past. Stateless processing associated with the IP layer were some the earliest functions identified that could be performed on the NIC and eliminate some amount of host processing. These offloaded functions include Checksum and segmentation for large Sends, both of which were supported in the Windows 2000 timeframe. Because the IP protocol is stateless and connectionless, there are virtually no side effects to performing these functions on the NIC, even if it does violate the principle of strict isolation between layers in the protocol stack.

Another set of performance improvements that have been implemented recently do potentially generate serious side effects that must be handled rather delicately. These include:

· interrupt moderation

· jumbo frames

· TCP Offload engine (TOE)

We will drill into these three approaches next, discussing some of the potential side effects, performance trade-offs, and other issues they raise. It is probably also worthwhile to mention netDMA, which is the Windows support for Intel’s I/O Acceleration Technology (I/OAT), in this context. I/OAT makes targeted improvements in the processor memory architecture to improve the efficiency of NIC-to-memory transfers.

Each of these approaches has worked to a degree, but none has produced enough of a breakthrough in performance to address the underlying condition, the growing mismatch between host processing requirements and network bandwidth.

As noted earlier, the CPU load associated with the processing Ethernet packets with TCP/IP at a server is a long-standing and persistent performance problem that has escaped a satisfactory solution in the past. For many years, the thrust of conventional solutions to the problem was straightforward – namely, any means possible for reducing the number of interrupts that the host computer needs to process. Two of the more effective approaches to reducing the number of interrupts are to use some form of interrupt moderation or so-called jumbo frames, basically larger packets than the Ethernet standard supports. Both approaches are effective to a degree, but also have serious built-in limitations and drawbacks.

Interrupt moderation on the NIC is widely used today to reduce the host interrupt processing rate. It is successful, but only to the extent of addressing the processing load associated with each interrupt, which, as indicated in the protocol overhead measurements discussed earlier, is relatively minor. A NIC that supports interrupt moderation can delay the host interrupt for up to a specified period of time with the hope that the NIC will receive additional networks packets to process during the delay. Then, instead of each packet causing an interrupt, the host processor can process multiple packets in a single interrupt. In the measurements reported in Part 1, interrupt moderation was used to cut the host processor interrupt rate in half. When you consider as we have earlier, the potential rate of networks interrupts that a 10Gb Ethernet card can drive, some form of interrupt moderation on the NIC becomes essential for the smooth operation of the host processor.

Interrupt moderation helps, but not enough to relieve the bottleneck at the host CPU. The host processing associated with the TCP/IP protocol appears to scale as a function of both the number of interrupts and the amount of data being transferred between the NIC and the host computer. As the average size of data payloads increases, the processing bottleneck shifts to memory latency. See, for example, the bottleneck analysis presented in the Intel white paper “Accelerating High-Speed Networking with Intel® I/O Acceleration Technology.”

Interrupt moderation should be used cautiously in situations where the fastest possible network latency is required, such as two communicating infrastructure servers connected to the same high-speed networking backbone. It also has to be implemented carefully to ensure it does not interfere with the TCP congestion control functions that try to measure round trip time (RTT).

Jumbo frames. Sending data across the wire in so-called jumbo frames also significantly reduces the number of host interrupts. And there is little question that the size of the Ethernet MTU is sub-optimal for many networking transmission workloads. Consider the relatively large data payloads that routinely need to be transferred between a back-end database machine and the clusters of front-end and middle tier machines in a typical clustered, multi-tier web service application today. Using jumbo frames of, say, 9K payloads on the high speed network backbone linking these servers leads to a 6:1 reduction in the number of host processor interrupts required to transfer sizable blocks of data. When servers are connected to a Storage Area Network (SAN) using iSCSI, even larger frames are desirable.

In fact, jumbo frames appears to be such a simple, effective solution within the confines of the data center that it naturally leads to consideration of what other aspects of the TCP/IP protocol that are sub-optimal in that environment could also be modified. For example, when there is frequent high speed communication between very reliable components, the TCP/IP requirement to acknowledge positively the receipt of every packet is overkill, and it very tempting to break with the standard and relax that requirement. The superior cost/performance of high speed Ethernet-based networking makes it very tempting to consider as an alternative interconnect technology to use with both SANs and High Performance Computing (HPC) clusters. In both these cases there are alternatives linkage technologies that outperform TCP/IP that are also considerably more expensive. For a further discussion of this issue in the context of SAN performance, see “An Introduction to SAN Capacity Planning.” And for the HPC flavor of this same discussion, see Jeffrey Mogul’s “TCP offload is a dumb idea whose time has come.”

Unfortunately, using non-standard jumbo frames introduces a significant compatibility problem that severely limits the effectiveness of the solution. The great majority of network clients will reject frames larger than the standard Ethernet MTU of 1500 bytes. In effect, you can send jumbo frames between specific host computers that are equipped to handle them on a dedicated backbone segment readily enough, but you cannot reliably send them to just any machine connected using the IP internetworking layer. So implementing jumbo frames requires more complicated routing schemes. TCP/IP RFC 2923 section 2.1, which is supported in Windows XP SP3, Vista, and Windows Server 2008, allows two TCP peers to negotiate the largest size MTU that can be transmitted between them. But the connectionless and stateless IP routing mechanism means that no single packet transmitted between station A and B need follow the same route twice. Given that the precise route to the destination station is dynamically constructed for each packet, any intermediate router that did not support jumbo frames would reject any non-standard packets it received and prevent successful transmission to the receiver.

TCP Offload Engine. A TCP Offload Engine (or TOE) is another solution that has been implemented to reduce the host processing required for TCP/IP interrupts. As the name suggests, in this approach, certain TCP/IP protocol functions are performed directly on the NIC, either reducing the amount of processing that must be performed in the host machine, or eliminating host interrupts associated with certain TCP/IP housekeeping operations entirely. Areas where significant performance gains are experienced with TOE include the elimination of expensive memory copy operations, offloading segmentation and reassembly (a function of the IP layer), and offloading some of the TCP housekeeping functions that ensure reliable connections (mainly, ACK processing and TCP retransmission timers). Moving these functions onto the NIC results in a reduction of the total number of interrupts that need to be processed by the host machine. Potential performance benefits associated with TOE are quantified here: “Boosting Data Transfer with TCP Offload Engine Technology.” You can see that the potential CPU savings are considerable.

TCP Offload Engine, however, is a grievous violation of the layered architecture of the networking protocol stack. The TCP Chimney Offload feature that provides TOE support in Windows, for example, required an extensive re-architecture of the TCP/IP stack. TOE introduces many breaking changes. See the KnowledgeBase article entitled ”Information about the TCP Chimney Offload feature in Windows Server 2008” detailing the many limitations, reflecting what networking functions can & can’t safely be offloaded in which computing environments. For instance, any networked machine that enforces an IPsec-based security policy where it is necessary to inspect each individual packet cannot use TOE. Neither is TOE currently compatible with either server virtualization technology or common forms of clustering based on virtual IP addresses.

The modest benefits in many environments and the complexities introduced due to explicit violations of the layered model of the network protocol argue against a general TOE solution. Another strong criticism of the TOE approach is that it merely moves the bottleneck from the host processor to the NIC. As RSS penetrates the market for high-speed networking, I believe that interest in the TOE approach will wane. If you do have a processing bottleneck on the host machine as a result of high-speed networking, with an RSS solution, at least, the bottleneck is visible, and there are inexpensive mechanisms to help deal with it. A processing bottleneck on the NIC is opaque and resists any capacity solution other than to swap in a more expensive card, assuming one exists, and hope that the new one is significantly faster and more powerful than the old one.

Intel I/OAT. Intel’s I/OAT introduces memory architecture improvements that give the NIC access to a dedicated DMA (direct memory access) engine for copying data between host memory and the NIC. These architectural changes are known as the Intel QuickData Technology DMA subsystem. With both interrupt moderation and the TCP Offload of IP segmentation and re-assembly, the processor tends to receive fewer interrupts to process, but each interrupt results in larger amounts of data that needs to be processed by the host. The networking protocol stack services the initial interrupt from the NIC and examines the Receive data block while it is running in kernel mode. Most data blocks associated with networking I/O subsequently need to be copied into the networking application’s private address space. A performance analysis showed that, especially with larger blocks of data, this second memory-to-memory copy operation was responsible for a very large portion of the host processor load. An Intel white paper here describes this analysis in some detail.

Ultimately, the result of this performance analysis was the set of I/OAT architectural improvements that permit this second memory-to-memory operation to be performed by a DMA provider engine (located on the Northbridge chip set currently) that requires no additional host processor bandwidth. The memory copy operation occupies the memory controller, but does not consume Front Side Bus bandwidth, which also frees up the host processor to perform other CPU tasks. Interestingly, Windows support for this technology, described in the Driver Development Kit (DDK) documentation, is actually very general, but to date the only netDMA client available is the tcpip.sys kernel mode driver that processes networking interrupts. It ought to be possible for disk I/O controllers to also exploit I/OAT architectural improvements sometime in the future. However, data blocks associated with disk I/O, which are cached by default in the system address space in Windows, are not necessarily subject to multiple copy operations, depending on the cache interface used.

Continue to Part IV.