Mainstream NUMA and the TCP/IP stack, Part IV: Parallelizing TCP/IP

This is a continuation of Part III of this article posted here.

In the many-core era, the host processor overhead associated with processing TCP/IP interrupts is not a capacity problem, since CPU cycles on the host computer are plentiful and becoming more plentiful all the time. The problem is that the individual processors themselves are not fast enough, nor are they growing much faster. To craft a solution that works in the many-core era, there is a clear need to enhance the hardware and software in the TCP/IP protocol stack to run in parallel across multiple processors and take advantage of the available capacity. There are two hardware and software technologies that are associated with that capability today:

· Extended Message-Signaled Interrupts (MSI-X): a hardware technology that allows the NIC to support multiple interrupt vectors, enabling multiple processor cores to handle interrupts from the NIC simultaneously.

· Receive-Side Scaling (RSS): the protocol used in the NDIS driver software to manage multiple interrupt vectors and communicate to the hardware to ensure that session-oriented TCP packets are delivered in sequence to a processor-specific interrupt queue.

MSI-X and RSS work together to allow the processing of TCP/IP Receive packets to scale in parallel across multiple processor cores

Message Signaled Interrupts (MSI-X) . MSI-X is an architectural change that allows a device to send interrupts to be processed on multiple CPUs. Historically, on the Intel architecture, devices were limited to sending interrupts to a single target. Concentrating all hardware interrupts on a single processor boosts the instruction execution rate of the Interrupt Service Routine (ISR) by increasing the chances of a processor cache warm start. In the many-core era, limiting the device to one processor that it can interrupt is a severe capacity constraint. MSI-X capabilities allow the NICs to scale on many-core processors.

One key feature of Windows’ support for MSI-X devices is the ability to specify a policy that automatically assigns MSI-X interrupts to CPUs based on the OS’s understanding of the underlying NUMA topology of the machine. An NDIS-driver that supports MSI-X devices can specify an IrqPolicySpreadMessagesAcrossAllProcessors policy that automatically distributes interrupts across an optimal set of eligible processors. On some NUMA machines, the performance of the device connection is affected by the underlying topology of the multi-node connections. For instance, certain device-to-processor node connections may be low latency local ones, while others are higher latency remote connections. For performance reasons, you want NIC interrupts to be processed on nodes that are connected locally and access local memory on that node exclusively. For optimal scalability, you then want to balance device interrupts across all the NUMA nodes that are interconnected. The IrqPolicySpreadMessagesAcrossAllProcessors policy understands these performance considerations, and distributes the device interrupts to the right set of processors automatically.

Figure 6 illustrates one way the IrqPolicySpreadMessagesAcrossAllProcessors policy could be used to distribute interrupts from the NIC across nodes in a simple NUMA machine. A server with two quad-core sockets is shown, with each socket connected to a block of local RAM. Memory accesses from a processor core to local RAM are considerably faster than an access to remote memory attached via a bridge to the other multi-core socket. An optimal configuration is to process TCP/IP interrupts on CPU 0 on the first node and on CPU 1 on the second node, as depicted, balancing the networking I/O load across nodes.

NUMA machine with two RSS queues

Figure 6. Two NUMA nodes in a Windows Server machine configured to use MSI-X and RSS to process TCP/IP Receive packets across multiple processors.

While Receive-Side Scaling (RSS) does not require MSI-X, the two technologies normally go hand-in-hand. We restrict the RSS discussion here to the manner in which MSI-X devices are supported, which is both the simplest and most common case.

Receive-Side Scaling (RSS) . RSS complements the Windows support for MSI-X. It allows the workload associated with processing network interrupts to be spread across multiple CPUs. With RSS, the DPC routine that we have seen is responsible for performing the bulk of the host processing is also scheduled to run on the same CPU where the interrupt service routine (ISR) just ran. Concentrating all the work associated with network interrupt processing on the same CPU improves instruction execution rates because data associated with the packet is likely to remain in the processor caches. It also dramatically reduces delays spent in unproductive spin lock code associated with serialization. Optimistic, non-blocking per processor locking strategies are effective under these circumstances. By default under RSS, even the Send processing associated with an ACK message is also processed on the same CPU where the Receive was processed to take advantage of the same performance considerations.

There is one complication, however, that arises when network interrupts are distributed across multiple CPUs that RSS is forced to address. If packets are distributed randomly across multiple CPUs, this can conflict with the important function of the TCP protocol that guarantees delivery of data in sequence to the application. Suppose packets for a group of TCP connections are processed across two CPUs and one CPU in the bunch is lightly loaded while the other is heavily loaded. Older packets received on the lightly loaded CPU could easily be processed first. Receiving packets out of order in TCP triggers Fast Retransmits, for example, that could degrade both the network and delay the application, not to mention serialization delays before TCP can safely notify the application layer that Request data is available for processing.

Given this complication, RSS distributes connections, not individual packets. RSS has a mechanism that sends all the packets associated with any one TCP connection to the same processor. This preserves the order of delivery of received data packets, which avoids needless requests for TCP retransmits. Crucially, the processor associated with the specific connection must be communicated to the NIC, which must arrange Received packets into the correct message queues accordingly, prior to signaling the host processor by raising an interrupt. This coordination, of course, is another violation of the isolation principle of the layered networking stack. It is worth noting that nasty side effects can arise as a result of this willful violation of the layered networking architecture; see, for example, KB927168 documenting a conflict between RSS and Internet Connection Sharing on Vista that was later fixed in WS2008 and Vista SP1.

To achieve good performance, however, it is absolutely necessary for the NIC to schedule all the packets for the same TCP session to same host processor. It can only do that by peeking into the TCP header and finding the port indicator, which it then uses to calculate the right CPU to deliver the packet to. This calculation is based on a hash table that is passed to the NIC by the NDIS driver software. RSS even includes a capability to adjust the load across the CPUs that are enabled for processing NIC interrupts dynamically. The protocol stack in Windows can re-balance the interrupt load by modifying the hashing table passed to the NIC that is used in determining the proper CPU. This mechanism can be used in case some CPUs remain overloaded for an extended period of time because, for example, some TCP connections are more chatty and persistent than others.

Speaking of maintaining a balanced system, long-running tasks such as large file copies associated with a single ftp, SMB or media server session present inherent difficulties under RSS. The general problem is that the throughput of any one session is ultimately limited by host processor speed. With many-core processors, it is important to figure out how to use parallel data divide-and-conquer techniques to break long serial operations into smaller sub-tasks that can be executed concurrently. Providing the capability to spread long, data-intensive operations across multiple TCP sessions is one possible approach.

For further technical details on RSS, see One interesting aspect of the RSS specification is that the DPC, not the ISR, is responsible for re-enabling the processor for more interrupts from the NIC. This prevents the NIC from sending any more Receive packets to the processor until the previous set has been completely processed. This effectively acts as both a serialization mechanism and a form of interrupt moderation that adaptively adjusts the delay between interrupts based on the specific processing load at the CPU.

This blog entry is continued here.