Core Protocol Stack Components and the TDI Interface

Article
10/08/2009

Applies To: Windows Server 2003, Windows Server 2003 R2, Windows Server 2003 with SP1, Windows Server 2003 with SP2

The core protocol stack components are those shown between the NDIS and TDI interfaces in Figure 1. They are implemented in the Windows Server 2003 Tcpip.sys driver. The TCP/IP stack is accessible through the TDI interface and the NDIS interface. The Winsock2 interface also provides some support for direct access to the protocol stack.

Address Resolution Protocol (ARP)

ARP performs IP address-to-MAC address resolution for outgoing packets. As each outgoing IP datagram is encapsulated in a frame, source and destination MAC addresses must be added. Determining the destination MAC address for each frame is the responsibility of ARP.

ARP compares the next-hop IP address on every outbound IP datagram to the ARP cache for the network interface card over which the frame will be sent. If there is a matching entry, the MAC address is retrieved from the cache. If not, ARP broadcasts an ARP Request frame on the local subnet, requesting that the owner of the IP address in question reply with its MAC address. If the packet is going through a router, the next-hop address is that of a neighboring router and ARP resolves the MAC address for that next-hop router, rather than the final destination host. When an ARP reply is received, the ARP cache is updated with the new information, and it is used to address the packet at the link layer.

ARP Cache

You can use the ARP utility to view, add, or delete entries in the ARP cache. Examples are shown below. Entries added manually are static and are not automatically removed from the cache, whereas dynamic entries are removed from the cache (see the “ARP Cache Aging” section for more information).

The arp command can be used to view the ARP cache, as shown here:

C:\>arp -a
 
Interface: 157.60.137.88 --- 0x10003
  Internet Address      Physical Address      Type
  157.60.136.1          00-0a-42-b0-54-0a     dynamic
  157.60.137.0          00-b0-d0-e9-41-43     dynamic
 
Interface: 10.0.0.3 --- 0x10004
  Internet Address      Physical Address      Type
  10.0.0.1              08-00-2b-c4-25-b6     dynamic

The computer in this example is multihomed—has more than one network interface card—so there is a separate ARP cache for each interface.

In the following example, the command arp –s is used to add a static entry to the ARP cache used by the second interface for the host whose IP address is 10.0.0.32 and whose network interface card address is 00-60-8C-0E-6C-6A:

C:\>arp -s 10.0.0.32 00-60-8c-0e-6c-6a 10.0.0.3
 
C:\>arp -a
 
Interface: 157.60.137.88 --- 0x10003
  Internet Address      Physical Address      Type
  157.60.136.1          00-0a-42-b0-54-0a     dynamic
  157.60.137.0          00-b0-d0-e9-41-43     dynamic
 
Interface: 10.0.0.3 --- 0x10004
  Internet Address      Physical Address      Type
  10.0.0.1              08-00-2b-c4-25-b6     dynamic
  10.0.0.32             00-60-8c-0e-6c-6a     static

ARP Cache Aging

Windows Server 2003 adjusts the size of the ARP cache automatically to meet the needs of the system. If an entry is not used by any outgoing datagram for two minutes, the entry is removed from the ARP cache. Entries that are being referenced are removed from the ARP cache after ten minutes. Entries added manually are not removed from the cache automatically. The registry parameter ArpCacheLife, described in Appendix A, allows more administrative control over aging.

Use the command arp –d to delete entries from the cache, as shown below:

C:\>arp -d 10.0.0.32
 
C:\>arp -a
 
Interface: 157.60.137.88 --- 0x10003
  Internet Address      Physical Address      Type
  157.60.136.1          00-0a-42-b0-54-0a     dynamic
  157.60.137.0          00-b0-d0-e9-41-43     dynamic
 
Interface: 10.0.0.3 --- 0x10004
  Internet Address      Physical Address      Type
  10.0.0.1              08-00-2b-c4-25-b6     dynamic

ARP queues only one outbound IP datagram for a specified destination address while that IP address is being resolved to a MAC address. If a UDP-based application sends multiple IP datagrams to a single destination address without any pauses between them, some of the datagrams may be dropped if there is no ARP cache entry already present. An application can compensate for this by calling the Iphlpapi.dll routine SendArp() to establish an ARP cache entry, before sending the stream of packets. See article 193059 in the Microsoft Knowledge Base (https://go.microsoft.com/fwlink/?linkid=67951) or MSDN (https://go.microsoft.com/fwlink/?linkid=67904) for IP Helper API details.

Internet Protocol (IP)

IP is the mailroom of the TCP/IP stack, where packet sorting and delivery take place. At this layer, each incoming or outgoing packet is referred to as a datagram. Each IP datagram bears the source IP address of the sender and the destination IP address of the intended recipient. Unlike MAC addresses, the IP addresses in a datagram remain the same throughout a packet’s journey across an internetwork, unless you are using source routing. IP layer functions are described below.

Routing

Routing is a primary function of IP. Datagrams are handed to IP from UDP and TCP above, and from the network interface card(s) below. Each datagram is labeled with a source and destination IP address. IP examines the destination address on each datagram, compares it to a locally maintained route table, and decides what action to take. There are three possibilities for each datagram:

It can be passed up to a protocol layer above IP on the local host.
It can be forwarded using one of the locally attached network interface cards.
It can be discarded.

The route table maintains four different types of routes:

Host route (a route to a single, specific destination IP address)
Subnet route (a route to a subnet)
Network route (a route to an entire network)
Default route (used when there is no other match)

To determine the single route to use to forward an IP datagram, IP uses the following process:

For each route in the route table, IP performs a bit-wise logical AND between the destination IP address and the netmask. IP compares the result with the network destination for a match. If they match, IP marks the route as one that matches the destination IP address.
From the list of matching routes, IP determines the route that has the most bits in the netmask. This is the route that matches the most bits to the destination IP address and is therefore the most specific route for the IP datagram. This is known as finding the longest or closest matching route.
If multiple closest matching routes are found, IP uses the route with the lowest metric. If multiple closest matching routes with the lowest metric are found, IP can choose to use any of those routes. For Windows Server 2003, IP uses the route corresponding to the adapter that is the highest in the binding order. You can view and modify the binding order from the Adapters and Bindings tab in the Advanced Settings dialog box for the Network Connections folder.

You can use the route print command to view the route table from the command prompt, as shown here:

C:\>route print
IPv4 Route Table
===========================================================================
Interface List
0x1 ........................... MS TCP Loopback interface
0x10002 ...00 53 45 00 00 00 ...... WAN (PPP/SLIP) Interface
0x10003 ...00 04 5a 56 10 06 ...... Linksys LNE100TX Fast Ethernet 
                                    Adapter(LNE100TX v4)
===========================================================================
===========================================================================
Active Routes:
Network Destination        Netmask          Gateway       Interface  Metric
0.0.0.0          0.0.0.0     157.60.136.1    157.60.137.88     20
157.60.136.0    255.255.252.0    157.60.137.88    157.60.137.88     20
157.60.137.88  255.255.255.255        127.0.0.1        127.0.0.1     20
157.60.255.255  255.255.255.255    157.60.137.88    157.60.137.88     20
127.0.0.0        255.0.0.0        127.0.0.1        127.0.0.1      1
224.0.0.0        240.0.0.0    157.60.137.88    157.60.137.88     20
255.255.255.255  255.255.255.255    157.60.137.88    157.60.137.88      1
Default Gateway:      157.60.136.1
===========================================================================
Persistent Routes:
None

If the IPv6 protocol is installed, the display of the route print command also lists IPv6 routes.

The route table above is for a computer with the IP address of 157.60.137.88, the subnet mask of 255.255.252.0, and the default gateway of 157.60.136.1. It contains the following entries:

The first entry, to destination 0.0.0.0, is the default route.
The second entry is for the subnet 157.60.136.0, on which this computer resides.
The third entry, to destination 157.60.137.88, is a host route for the local host. It specifies the loopback address, which makes sense because a datagram bound for the local host should be looped back internally.
The fourth entry is for the all-subnets-directed broadcast address corresponding to the original Class B network ID 157.60.0.0.
The fifth entry is for the loopback network, 127.0.0.0.
The sixth entry is for IP multicasting, which is discussed later in this article.
The final entry is for the limited broadcast (all ones) address.

The Default Gateway is the currently active default gateway. This is useful to know when multiple default gateways are configured.

On this host, if a packet is sent to 157.60.138.49, the closest matching route is the local subnet route (157.60.136.0 with the mask of 255.255.252.0). The packet is sent via the local interface that is assigned the IP address 157.60.137.88. If a packet is sent to 10.200.1.1, the closest matching route is the default route. In this case, the packet is forwarded to the default gateway at 157.60.136.1.

The route table is maintained automatically in most cases. When a host initializes, entries for the local network(s), loopback, multicast, and configured default gateway are added. More routes may appear in the table as the IP layer learns of them. For instance, the default gateway for a host may advise it of a better route to a specific address using ICMP redirect, which is explained later in this article. Routes also may be added manually using the route command, or by a routing protocol. The -p (persistent) switch can be used with the route command to specify permanent routes. Persistent routes are stored in the registry under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters \PersistentRoutes.

Windows Server 2003 TCP/IP introduces a new Automatic metric configuration option for interface-based and default gateway routing metrics. If selected for the interface, automatic metric configuration determines the metric for the routes associated with the interface configuration, such as subnet routes and host routes, based on the speed (bit rate) of the interface. The higher the speed, the lower the metric. For example, routes associated with 10 Mbps Ethernet interfaces have a metric of 30 and routes associated with 100 Mbps Ethernet interfaces have a metric of 20. If selected for the default gateway, automatic metric configuration determines the metric for the default route assigned to the interface, which is also based on the speed of the interface. Automatic metric configuration for both interface metrics and default routes are enabled by default and can be modified from the advanced configuration properties of the TCP/IP protocol for a connection in Network Connections. For more information, see article 299540 in the Microsoft Knowledge
Base (https://go.microsoft.com/fwlink/?linkid=67955).

DHCP servers can also provide a base metric and a list of default gateways. If a DHCP server provides a base of 100, and a list of three default gateways, the gateways will be configured with metrics of 100, 101, and 102 respectively. A DHCP-provided base metric does not apply to statically configured default gateways.

By default, Windows-based systems do not behave as routers and do not forward IP datagrams between interfaces. However, the Routing and Remote Access service is included in Windows Server 2003 and can be enabled and configured to provide dynamic IP routing services using RIP and OSPF. Windows XP Professional includes support for silent RIP.

When running multiple logical subnets on the same physical network, the following command can be used to have IP treat all subnets as local (all destinations are on the local link):

route add 0.0.0.0 MASK 0.0.0.0 <my local ip address>

Thus, packets destined for non-local subnets are transmitted directly onto the local media instead of being sent to a router. In essence, the local interface card can be designated as the default gateway. This can be useful where several class C networks are used on one physical network with no router to the outside world, or in a proxy-ARP environment.

Duplicate IP Address Detection

Duplicate address detection is an important feature. When the stack is first initialized or when a new IP address is added, gratuitous ARP requests are broadcast for the IP addresses of the local host. The number of ARP requests to send is controlled by the ArpRetryCount registry parameter described in Appendix A, which defaults to 3. If another host replies to any of these ARP requests, the IP address is already in use. When this happens for an interface with a single manually-configured address, the Windows-based computer still boots; however, the interface containing the offending address is disabled, a system log entry is generated, and an error message is displayed. If the host that is defending the address is also a Windows-based computer, a system log entry is generated, and an error message is displayed on that computer. In order to update the ARP caches on other computers, the offending computer re-broadcasts another ARP request, spoofing the MAC address of the defending computer, to restore the proper values in the ARP caches of the other computers.

A computer using a duplicate IP address can be started when it is not attached to the network, in which case no conflict would be detected. However, if it is then plugged into the network, the first time that it sends an ARP request for another IP address, any computer running a version of Windows with the Windows NT codebase with a conflicting address detects the conflict. The computer detecting the conflict displays an error message and logs a detailed event in the system log. A sample event log entry is shown below:

The system detected an address conflict for IP address 10.199.40.123 with the system having network hardware address 00:DD:01:0F:7A:B5. Network operations on this system may be disrupted as a result.

DHCP-enabled clients inform the DHCP server with a DHCPDecline message when an IP address conflict is detected and, instead of disabling the TCP/IP protocol, they request a new address from the DHCP server and request that the server mark the conflicting address as bad.

Multihoming

When a computer is configured with more than one IP address, it is referred to as a multihomed system. Multihoming is supported in three different ways:

Multiple IP addresses per network interface card
- To add addresses for an interface, obtain properties of the Internet Protocol (TCP/IP) protocol in Network Connection, and then click Advanced. In the Advanced Settings dialog box, click Add on the IP Settings tab to add IP addresses.
- NetBIOS over TCP/IP (NetBT) binds to only one IP address per interface card. When a NetBIOS name registration is sent out, only one IP address is registered per interface. This registration occurs over the IP address that is listed first on the IP Settings tab.
Multiple network interface cards per physical network. There are no restrictions, other than hardware limitations on the number of network interface cards.
Multiple networks and media types. There are no restrictions, other than hardware and media support. See the “The NDIS Interface and Below” section for supported media types.

When an IP datagram is sent from a multihomed host, it is passed to the interface with the best apparent route to the destination. Accordingly, the datagram may contain the source IP address of one interface in the multihomed host, yet be placed on the media by a different interface. The source MAC address on the frame is that of the interface that actually transmitted the frame to the media, and the source IP address is the one that the sending application sourced it from, not necessarily one of the IP addresses assigned to the sending interface.

When a computer is multihomed with network interface cards attached to disjoint networks (networks that are separate from and unaware of each other, such as a private network using private addressing and the Internet), routing problems may arise. It is often necessary to set up static routes to the private networks in this situation.

When configuring a computer to be multihomed on two disjoint networks, the best practice is to configure the default gateway on the interface connected to the largest and least-known network, in which the default route summarizes the most destinations. Then, either add static routes or use a routing protocol to provide connectivity to the hosts on the smaller or better-known network. Avoid configuring a different default gateway on each side; this can result in unpredictable behavior and loss of connectivity. For more information, see Default Gateway Behavior for Windows TCP/IP (https://go.microsoft.com/fwlink/?linkid=47705).

Note

There can only be one active default gateway for a computer at any moment in time.

More details on name registration, resolution, and choice of network interface card on outbound datagrams with multihomed computers are provided in the “Transmission Control Protocol (TCP),” “NetBIOS over TCP/IP,” and “Windows Sockets” sections of this article.

Classless Inter-Domain Routing (CIDR)

CIDR, described in RFCs 1518 and 1519, removes the concept of address classes from the IP address assignment and management process. In place of predefined, well-known boundaries, CIDR allocates addresses defined by a network prefix, which makes more efficient use of available space. The network prefix defines the portion of the address that is fixed. For example, an assignment from an ISP to a corporate client might be expressed as 157.60.1.128/25. In this prefix, the first 25 bits are fixed and the last 7 bits can be used for address assignment. This would result in a 128-address block for local use, with the upper 25 bits being the network identifier part of the address. A legacy, class-full prefix would be expressed as w.0.0.0/8, w.x.0.0/16, or w.x.y.0/24. As these are reclaimed, they will be reallocated using classless CIDR techniques.

Given the installed base of classful systems, the initial implementation of CIDR was to summarize portions of the Class C address space. This process was called supernetting. Supernetting can be used to consolidate several network IDs into one prefix. For example, the network IDs 131.107.4.0, 131.107.5.0, 131.107.6.0, and 131.107.7.0 can be summarized with the network ID 131.107.4.0 with a subnet mask of 255.255.252.0 (131.107.4.0/22). For example:

NET     131.107.4.0   (1100 0111.1100 0111.0000 0100.0000 0000)
NET     131.107.5.0   (1100 0111.1100 0111.0000 0101.0000 0000)
NET     131.107.6.0   (1100 0111.1100 0111.0000 0110.0000 0000)
NET     131.107.7.0   (1100 0111.1100 0111.0000 0111.0000 0000)
MASK  255.255.252.0   (1111 1111.1111 1111.1111 1100.0000 0000)

All four of the network IDs share the same high-order 22 bits (bolded).

When routing decisions are made, only the bits covered by the subnet mask are used, thus making all these addresses appear to be part of the same network for routing purposes. Any routers in use must also support CIDR and may require special configuration.

Windows Server 2003 TCP/IP includes support for the all-0's and all-1's subnets as described in RFC 1878.

IP Multicasting

IP multicasting is used to provide efficient multicast services to clients that may not be located on the same network segment. Windows Sockets applications can join a multicast group to participate in a wide-area conference, for instance.

Windows Server 2003 TCP/IP is level 2-compliant with RFC 1112 (send and receive). IGMP is the protocol used to track multicast membership on a subnet, which is described later in this article.

IP over ATM

Windows Server 2003 includes support for IP over ATM. RFC 1577 (and successors) define the basic operation of an IP over ATM network known as Classical IP over ATM (CLIP), which defines a Logical IP Subnet (LIS). A LIS is a set of IP hosts that can communicate directly with each other. Two hosts belonging to different LISs can communicate only through an IP router that is a member of both subnets. Windows Server 2003 also includes support for ATM LAN Emulation (LANE), which supports broadcasting.

ATM Address Resolution

Because an ATM network is non-broadcast, ARP broadcasts (as used by Ethernet or Token Ring) are not a suitable solution. Instead, a dedicated ARP server is used to provide IP-to-ATM address resolution.

One of the stations in a LIS is designated as an ARP server, and the ARP server software is loaded on it. Stations that use the services of the ARP server are referred to as ARP clients. All IP stations within a LIS are ARP clients. Each ARP client is configured with the ATM address of the ARP server. When an ARP client starts up, it makes an ATM connection to the ARP server, and sends a packet to the server that contains the client’s IP and ATM addresses. The ARP server builds a table of IP-address-to-ATM-address mappings. When a client has an IP packet to be sent to another client (whose IP address is known but whose ATM address is unknown), it first queries the ARP server for the ATM address of the desired client. When it receives a reply that contains the desired ATM address, the client establishes a direct ATM connection to the target client and sends IP packets for that client on this connection.

The clients close any ATM connection, including the connection to the server, if the connections are inactive. All clients refresh their IP and ATM address information with the server periodically (the default is 15 minutes). The server purges an entry that is not refreshed after 20 minutes (by default). The ATM ARP client and ARP server both support a number of adjustable registry parameters, which are listed in Appendix A.

Internet Control Message Protocol (ICMP)

ICMP is a maintenance protocol specified in RFC 792 and is normally considered part of the Internet layer. ICMP messages are encapsulated within IP datagrams, so that they can be routed throughout an internetwork. Windows Server 2003 TCP/IP uses ICMP to:

Report delivery problems encountered by routers or destination hosts.
Build and maintain route tables.
Perform router discovery.
Assist in Path Maximum Transmission Unit (PMTU) discovery.
Diagnose reachability problems (the ping, tracert, and pathping tools).
Adjust flow control to prevent link or router saturation.

ICMP Router Discovery

Windows Server 2003 TCP/IP can perform router discovery as specified in RFC 1256. Router discovery provides an improved method of configuring and detecting default gateways. Instead of using manually- or DHCP-configured default gateways, hosts can dynamically discover routers on their subnet. If the primary router fails or the network administrators change router preferences, hosts can automatically switch to a backup router.

When a host that supports router discovery initializes, it joins the all-systems IP multicast group (224.0.0.1), and then listens for the router advertisements that routers send to that group. Hosts can also send router-solicitation messages to the all-routers IP multicast address (224.0.0.2) when an interface initializes to avoid any delay in being configured. Windows Server 2003 TCP/IP sends a maximum of three solicitations at intervals of approximately 600 milliseconds.

The use of router discovery is controlled by the PerformRouterDiscovery and SolicitationAddressBCast registry parameters, and it defaults to DHCP controlled in Windows Server 2003. Setting SolicitationAddressBCast to 1 causes router solicitations to be broadcast, instead of multicast, as described in RFC 1256.

Maintaining Route Tables

When a Windows-based computer is initialized, the route table normally contains only a few entries. One of those entries specifies a default gateway. Datagrams that have a destination IP address with no better match in the route table are sent to the default gateway. However, because routers share information about network topology, the default gateway may know a better route to a given address. When this is the case, then upon receiving a datagram that could take the better path, the router forwards the datagram normally. It then advises the sender of the better route, using an ICMP Redirect message. These messages typically specify redirection for a specific destination address. When a Windows-based computer receives an ICMP redirect, a validity check is performed to be sure that it came from the first-hop gateway in the current route, and that the gateway is on a directly connected network. If so, a host route with a 10-minute lifetime is added to the route table for that destination IP address. If the ICMP redirect did not come from the first-hop gateway in the current route, or if that gateway is not on a directly connected network, the ICMP redirect is ignored.

In Windows Server 2003 Service Pack 1, the new MaxICMPHostRoutes registry value defines the maximum number of host routes that can be added through the receipt of ICMP Redirect messages. For more information, see Appendix A.

Path Maximum Transmission Unit (PMTU) Discovery

TCP employs Path Maximum Transmission Unit (PMTU) discovery, as described in the “Transmission Control Protocol (TCP)” section of this article. The mechanism relies on ICMP Destination Unreachable messages.

Use of ICMP to Diagnose Problems

The ping command-line utility is used to send ICMP echo requests to an IP address and wait for ICMP echo responses. Ping reports on the number of responses received and the time interval between sending the request and receiving the response. There are many different options that can be used with the ping utility.

Tracert is a route-tracing utility that can be very useful. Tracert works by sending ICMP echo requests to an IP address, while incrementing the Time to Live (TTL) field in the IP header, starting at 1, and analyzing the ICMP errors that are returned. Each succeeding echo request should get one hop further into the network before the TTL field reaches 0 and the router attempting to forward it returns an ICMP Time Exceeded-TTL Exceeded in Transit error message. Tracert prints out an ordered list of the routers in the path that returned these error messages, including the name and the IP address the nearside interface of each router. If the -d (do not do a DNS reverse query on each IP address) switch is used, only the IP address is reported. The example below illustrates using tracert to find the route from a computer dialed in over Point-to-Point Protocol (PPP) to an Internet service provider in Seattle to www.whitehouse.gov.

C:\>tracert www.whitehouse.gov
Tracing route to www.whitehouse.gov [128.102.252.1]
over a maximum of 30 hops:
1  300 ms  281 ms  280 ms roto.seanet.com [199.181.164.100]
2  300 ms  301 ms  310 ms sl-stk-1-S12-T1.sprintlink.net [144.228.192.65]
3  300 ms  311 ms  320 ms sl-stk-5-F0/0.sprintlink.net [144.228.40.5]
4  380 ms  311 ms  340 ms icm-fix-w-H2/0-T3.icp.net [144.228.10.22]
5  310 ms  301 ms  320 ms arc-nas-gw.arc.nasa.gov [192.203.230.3]
6  300 ms  321 ms  320 ms n254-ed-cisco7010.arc.nasa.gov [128.102.64.254]
7  360 ms  361 ms  371 ms www.whitehouse.gov [128.102.252.1]

Pathping is a command-line utility that combines the functionality of ping and tracert as well as introducing some new features. Along with the tracing functionality of tracert, pathping will ping each hop along the route multiple times and display delay and packet loss information per hop, which can help you determine if there is a high-loss link or router in the path.

Internet Protocol Security (IPsec)

Windows Server 2003 includes support for Internet Protocol security (IPsec). IPsec features and implementation details are very complex and are described in detail in a series of RFCs, Internet drafts, and in other Microsoft technical papers. For more information, see the Windows Server 2003 IPsec Web page (https://go.microsoft.com/fwlink/?linkid=63067).

IPsec uses cryptography-based security to provide data integrity, data origin authentication, replay protection, data confidentiality (encryption), and limited traffic-flow confidentiality. Because IPsec is provided at the IP layer, its services are available to the upper-layer protocols in the stack and, transparently, to existing applications.

IPsec enables a system to select security protocols, decide which algorithm(s) to use for the service(s), and establish and maintain cryptographic keys for each security relationship. IPsec can protect paths between hosts, between security gateways, or between hosts and security gateways. The services available and required for traffic are configured using IPsec policy. IPsec policy may be configured locally on a computer or can be assigned through Group Policy mechanisms using the Active Directory directory service. When using Active Directory, hosts detect policy assignment at startup, retrieve the policy, and then periodically check for policy updates. The IPsec policy specifies how computers trust each other. IPsec can use certificates, Kerberos, or preshared keys as authentication methods. The easiest trust to use is the Windows Server 2003 domain trust based on Kerberos.

Each IP datagram processed at the IP layer is compared to a set of filters that are provided by the security policy, which is maintained by an administrator for a computer that belongs to a domain. IP can do one of three things with any datagram:

Apply IPsec protections to it.
Allow it to pass unmodified.
Discard it.

An IPsec policy contains one or more rules, each of which contain a filter, a filter action, authentication methods, a tunnel setting, and a connection type. For example, two stand-alone computers can be configured to use IPsec between them and activate the secure server policy. If the two computers are not members of the same or a trusted domain, trust must be configured using a certificate or preshared key in a secure server mode by:

Setting up a filter that specifies all traffic between the two hosts
Choosing an authentication method
Selecting a negotiation policy (secure server in this case, indicating that all traffic matching the filter(s) must use IPsec)
Specifying a connection type (LAN, dial-up, or all)

Once the policy has been put in place, traffic that matches the filters uses the services provided by IPsec. When a host directs IP traffic to another host (including something as simple as ping traffic), a Security Association (SA) is established through an Internet Key Exchange (IKE) negotiation using UDP port 500, and then the traffic begins to flow. The following network trace illustrates setting up a TCP connection between two such IPsec-enabled hosts. The only parts of the IP datagram that are unencrypted and visible to Network Monitor after the SA is established are the MAC and IP headers:

Source IP      Dest IP       Prot  Description
192.168.10.47  10.197.14.91  UDP   Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 216 (0xD8)
10.197.14.91   192.168.10.47 UDP   Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 216 (0xD8)
192.168.10.47  10.197.14.91  UDP   Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 128 (0x80)
10.197.14.91   192.168.10.47 UDP   Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 128 (0x80)
192.168.10.47  10.197.14.91  UDP   Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 76 (0x4C)
10.197.14.91   192.168.10.47 UDP   Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 76 (0x4C)
192.168.10.47  10.197.14.91  UDP   Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 212 (0xD4)
10.197.14.91   192.168.10.47 UDP   Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 172 (0xAC)
192.168.10.47  10.197.14.91  UDP   Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 84 (0x54)
10.197.14.91   192.168.10.47 UDP   Src Port: ISAKMP, (500); Dst Port: ISAKMP (500); Length = 92 (0x5C)
192.168.10.47  10.197.14.91  IP    ID = 0xC906; Proto = 0x32; Len: 96
10.197.14.91   192.168.10.47 IP    ID = 0xA202; Proto = 0x32; Len: 96
192.168.10.47  10.197.14.91  IP    ID = 0xCA06; Proto = 0x32; Len: 88

Opening one of the IP datagrams sent after the SA is established reveals very little of what is actually in the datagram (a TCP SYN, or connection request). The only clear parts of the packet are the Ethernet and IP headers. Even the TCP header is encrypted and cannot be parsed by Network Monitor if ESP is used.

Src IP         Dest IP        Protoc  Description
===================================================
192.168.10.47  10.197.14.91   IP      ID = 0xC906; Proto = 0x32; 
                                      Len: 96
+ FRAME: Base frame properties
+ ETHERNET: ETYPE = 0x0800 : Protocol = IP:  DOD Internet Protocol
IP: ID = 0xC906; Proto = 0x32; Len: 96
IP: Version = 4 (0x4)
IP: Header Length = 20 (0x14)
IP: Precedence = Routine
IP: Type of Service = Normal Service
IP: Total Length = 96 (0x60)
IP: Identification = 51462 (0xC906)
+ IP: Flags Summary = 2 (0x2)
IP: Fragment Offset = 0 (0x0) bytes
IP: Time to Live = 128 (0x80)
IP: Protocol = 0x32
IP: Checksum = 0xD55A
IP: Source Address = 192.168.10.47
IP: Destination Address = 10.197.14.91
IP: Data: Number of data bytes remaining = 76 (0x004C)

Using a secure server policy also restricts all other types of traffic from reaching destinations that do not understand IPsec or are not part of the same trusted group. A secure initiator policy provides settings that apply best to servers; traffic security is attempted, but if the client does not understand IPsec, the negotiation falls back to sending clear text packets.

When IPsec is used to encrypt data, network performance generally drops, due to the processing overhead of encryption. One possible method for reducing the impact of this overhead is to offload the processing to a hardware device. Because NDIS 5.1 supports task offloading, it is feasible to include encryption hardware on network adapters. Network adapters supporting IPsec hardware offload are available from several vendors.

IPsec promises to be popular for protecting both public network traffic and internal corporate or government traffic that requires confidentiality. One common deployment is to apply secure server IPsec policies only to specific servers that are used to store and/or serve confidential information.

Internet Group Management Protocol (IGMP)

Windows Server 2003 TCP/IP provides level 2 (full) support for IP multicasting and IGMP versions 1 through 3, as described in RFCs 1112, 2236, and 3376. The introduction to RFC 1112 provides a good overall summary of IP multicasting. The text reads:

“IP multicasting is the transmission of an IP datagram to a host group—a set of zero or more hosts identified by a single IP destination address. A multicast datagram is delivered to all members of its destination host group with the same ‘best-effort’ reliability as regular unicast IP datagrams; that is, the datagram is not guaranteed to arrive intact to all members of the destination group or in the same order relative to other datagrams.

“The membership of a host group is dynamic; that is, hosts may join and leave groups at any time. There is no restriction on the location or number of members in a host group. A host may be a member of more than one group at a time. A host need not be a member of a group to send datagrams to it.

“A host group may be permanent or transient. A permanent group has a well-known, administratively assigned IP address. It is the address—not the membership of the group—that is permanent; at any time a permanent group may have any number of members, even zero. Those IP multicast addresses that are not reserved for permanent groups are available for dynamic assignment to transient groups that exist only as long as they have members.

“Internetwork forwarding of IP multicast datagrams is handled by multicast routers that may be co-resident with, or separate from, Internet gateways. A host transmits an IP multicast datagram as a local network multicast that reaches all immediately-neighboring members of the destination host group. If the datagram has an IP time-to-live greater than 1, the multicast router(s) attached to the local network take responsibility for forwarding it towards all other networks that have members of the destination group. On those other member networks that are reachable within the IP time-to-live, an attached multicast router completes delivery by transmitting the datagram as a local multicast.”

IP/ARP Extensions for IP Multicasting

To support IP multicasting, an additional route is defined on the host. The route (added by default) specifies that if a datagram is being sent to a multicast host group, it should be sent to the IP address of the host group through the local network interface card, and not forwarded to the default gateway. The following route (which you can discover using the route print command) illustrates this:

Network Address  Netmask    Gateway Address  Interface     Metric
224.0.0.0        240.0.0.0  157.60.137.88    157.60.137.88     20

IP multicast addresses are easily identified, as they are from the class D range, 224.0.0.0 to 239.255.255.255 (224.0.0.0/4). These IP addresses all have 1110 as their high-order bits.

To send a packet to a host group, using the local interface, the IP address must be resolved to a media access control address. As stated in RFC 1112:

“An IP host group address is mapped to an Ethernet multicast address by placing the low-order 23 bits of the IP address into the low-order 23 bits of the Ethernet multicast address 01-00-5E-00-00-00 (hexadecimal). Because there are 28 significant bits in an IP host group address, more than one host group address may map to the same Ethernet multicast address.”

For example, a datagram addressed to the multicast address 225.0.0.5 would be sent to the (Ethernet) MAC address 01-00-5E-00-00-05. This MAC address is formed by the junction of 01-00-5E and the 23 low-order bits of 225.0.0.5 (00-00-05).

Because more than one host group address can map to the same Ethernet multicast address, the interface may indicate hand-up multicasts for a host group for which no local applications have a registered interest. These extra multicasts are discarded by the TCP/IP protocol.

Multicast Extensions to Windows Sockets

Internet Protocol multicasting is currently supported only on AF_INET sockets of type SOCK_DGRAM and SOCK_RAW. By default, IP multicast datagrams are sent with a Time to Live (TTL) of 1. Applications can use the setsockopt() function to specify a TTL. By convention, multicast routers use TTL thresholds to determine how far to forward datagrams. These TTL thresholds are defined as follows:

Multicast datagrams with initial TTL 0 are restricted to the same host.
Multicast datagrams with initial TTL 1 are restricted to the same subnet.
Multicast datagrams with initial TTL 32 are restricted to the same site.
Multicast datagrams with initial TTL 64 are restricted to the same region.
Multicast datagrams with initial TTL 128 are restricted to the same continent.
Multicast datagrams with initial TTL 255 are unrestricted in scope.

Use of Multicast and IGMP by Windows Components

Some Windows Server 2003 components use IP multicast traffic. For example, router discovery uses multicasts, by default. WINS servers use multicasting when attempting to locate replication partners. The Windows Server 2003 Routing and Remote Access service supports multicast forwarding and the configuration of interfaces to operate in IGMP router or IGMP proxy mode.

Transmission Control Protocol (TCP)

TCP provides a connection-based, reliable byte-stream service to applications. Microsoft networking relies upon TCP for logon, file and print sharing, replication of information between domain controllers, transfer of browse lists, and other common functions. It can only be used for one-to-one communications.

TCP uses a checksum on both the TCP header and payload of each segment to reduce the chance that network corruption will go undetected. NDIS 5.1 provides support for task offloading, and Windows Server 2003 TCP/IP takes advantage of this by allowing the network interface card to perform the TCP checksum calculations if the network interface card driver offers support for this function. Offloading the checksum calculations to hardware can result in performance improvements in very high-throughput environments.

Windows Server 2003 TCP/IP has also been strengthened against a variety of attacks that were published over the past couple of years and has been subject to an internal security review intended to reduce susceptibility to future attacks. For instance, the initial sequence number (ISN) algorithm has been modified so that ISNs increase in random increments, using an RC4-based random number generator initialized with a 2048-bit random key upon system startup.

TCP Receive Window Size Calculation and Window Scaling (RFC 1323)

The TCP receive window size is the amount of receive data (in bytes) that can be buffered at one time on a connection. The sending host can send only that amount of data before waiting for an acknowledgment and window update from the receiving host. The Windows Server 2003 TCP/IP stack was designed to tune itself in most environments and uses larger default window sizes than earlier versions. Instead of using a hard-coded default receive window size, TCP adjusts to even increments of the maximum segment size (MSS) negotiated during connection setup. Matching the receive window to even increments of the MSS increases the percentage of full-sized TCP segments used during bulk data transmission.

The default advertised TCP receive-window size in Windows Server 2003 depends on the following, in order of precedence:

The SO_RCVBUF WinSock option for the connection.
The per-interface TcpWindowSize registry parameter.
The global TcpWindowSize registry parameter.
Autodetermination based on the NDIS-reported bit rate of the media The stack also tunes itself based on the media speed:
- Below 1 Mbps: 8 KB
- 1 Mbps – 100 Mbps: 17 KB
- Greater than 100 Mbps: 64 KB

For 10 and 100 Mbps Ethernet, the window is normally set to 17,520 bytes (17 KB rounded up to twelve 1460-byte segments.) There are two methods for setting the receive window size to specific values:

The TcpWindowSize registry parameter (see Appendix A)
The setsockopt() Windows Sockets function (on a per-socket basis)

To improve performance on high-bandwidth, high-delay networks, Windows Server 2003 TCP/IP supports scalable windows support as described in RFC 1323. This RFC details a method for supporting scalable windows by allowing TCP to negotiate a scaling factor for the window size at connection establishment. This allows for an actual receive window of up to 1 gigabyte (GB). Section 2.2 of RFC 1323 provides a good description:

“The three-byte Window Scale option may be sent in a SYN segment by a TCP. It has two purposes: 1. indicate that the TCP is prepared to do both send and receive window scaling, and 2. communicate a scale factor to be applied to its receive window. Thus, a TCP that is prepared to scale windows should send the option, even if its own scale factor is 1. The scale factor is limited to a power of two and encoded logarithmically, so it may be implemented by binary shift operations.

“This option is an offer, not a promise; both sides must send Window Scale options in their SYN segments to enable window scaling in either direction. If window scaling is enabled, then the TCP that sent this option will right-shift its true receive-window values by 'shift.cnt' bits for transmission in SEG.WND. The value shift.cnt may be zero (offering to scale, while applying a scale factor of 1 to the receive window).

“This option may be sent in an initial <SYN> segment (in other words, a segment with the SYN bit on and the ACK bit off). It may also be sent in a <SYN,ACK> segment, but only if a Window Scale option was received in the initial <SYN> segment. A Window Scale option in a segment without a SYN bit should be ignored.

“The Window field in a SYN (in other words, a <SYN> or <SYN,ACK>) segment itself is never scaled.”

When you read network traces of a connection that was established by two computers that support scalable windows, keep in mind that the window sizes advertised in the TCP headers must be scaled by the negotiated scale factor. The scale factor can be observed in the connection establishment (three-way handshake) packets, as illustrated in the following Network Monitor capture:

*************************************************************** Src Addr Dst Addr Protocol Description 10.0.0.1 10.0.0.9 TCP ....S., len:0, seq:725163-725163, ack:0, win:65535, src:1217 dst:139 + FRAME: Base frame properties + ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet Protocol + IP: ID = 0xB908; Proto = TCP; Len: 64 TCP: ....S., len:0, seq:725163-725163, ack:0, win:65535, src:1217 dst:139 (NBT Session) TCP: Source Port = 0x04C1 TCP: Destination Port = NETBIOS Session Service TCP: Sequence Number = 725163 (0xB10AB) TCP: Acknowledgement Number = 0 (0x0) TCP: Data Offset = 44 (0x2C) TCP: Reserved = 0 (0x0000) + TCP: Flags = 0x02 : ....S. TCP: Window = 65535 (0xFFFF) TCP: Checksum = 0x8565 TCP: Urgent Pointer = 0 (0x0) TCP: Options + TCP: Maximum Segment Size Option TCP: Option Nop = 1 (0x1) TCP: Window Scale Option TCP: Option Type = Window Scale TCP: Option Length = 3 (0x3) TCP: Window Scale = 5 (0x5) TCP: Option Nop = 1 (0x1) TCP: Option Nop = 1 (0x1) + TCP: Timestamps Option TCP: Option Nop = 1 (0x1) TCP: Option Nop = 1 (0x1) + TCP: SACK Permitted Option ***************************************************************

The computer sending the packet above is offering the Window Scale option, with a scaling factor of 5. If the target computer responds, accepting the Window Scale option in the SYN-ACK, then it is understood that any TCP window advertised by this computer needs to be left-shifted 5 bits from this point onward (the SYN itself is not scaled). For example, if the computer advertised a 32 KB window in its first send of data, this value would need to be left-shifted (shifting in 0's from the right) 5 bits as shown below:

32 KB = 0x7fff =    111 1111 1111 1111
Left-shift 5 bits = 1111 1111 1111 1110 0000 = 0xfffe0 (1,048,544 bytes)

As a check, left-shifting a number 5 bits is equivalent to multiplying it by 25, or 32. Performing the calculation in decimal, we get 32767 * 32 = 1,048,544.

The scale factor is not necessarily symmetrical, so it may be different for each direction of data flow.

TCP window scaling is negotiated on-demand in Windows Server 2003, based on the value set for the SO_RCVBUF WinSock option when a connection is initiated. Additionally, the Window Scale option is used by default on a connection if the received SYN segment for a connection initiated by a TCP peer contains the Window Scale option. You can modify this default behavior with the Tcp1323Opts registry parameter (see Appendix A).

Delayed Acknowledgments

As specified in RFC 1122, TCP uses delayed acknowledgments (ACKs) to reduce the number of packets sent on the media. The Windows Server 2003 TCP/IP stack takes a common approach to implementing delayed ACKs. As data is received by TCP on a connection, it only sends an acknowledgment back if one of the following conditions is met:

No ACK was sent for the previous segment received.
A segment is received, but no other segment arrives within 200 milliseconds for that connection.

In summary, normally an ACK is sent for every other TCP segment received on a connection, unless the delayed ACK timer (200 milliseconds) expires. The delayed ACK timer can be adjusted through the TcpDelAckTicks registry parameter.

TCP Selective Acknowledgment (RFC 2018)

Windows Server 2003 TCP/IP includes support for an important performance feature known as Selective Acknowledgement (SACK). SACK is especially important for connections using large TCP window sizes. Prior to SACK, a receiver could only acknowledge the latest sequence number of contiguous data that had been received, or the left edge of the receive window. When SACK is enabled, the receiver continues to use the ACK number to acknowledge the left edge of the receive window, but it can also acknowledge other non-contiguous blocks of received data individually. SACK uses two different TCP header options:

The Sack-Permitted option is used to indicate that the sender can receive and interpret the SACK option. This option is sent in TCP segments in which the SYN flag is set.
The SACK option is used to convey extended acknowledgment information from the receiver to the sender over an established TCP connection. Within the SACK option are up to four pairs of TCP sequence numbers acknowledging up to four non-continuous blocks. Each TCP sequence pair contains the left edge (the sequence number of the first byte in the block) and the right edge (the sequence number of the byte immediately following the last byte in the block) of the block being acknowledged.

When SACK is enabled (the default), a packet or series of packets can be dropped, and the receiver can inform the sender of exactly which data has been received, and where the holes in the data are. The sender can then selectively retransmit the missing data without needing to retransmit blocks of data that have already been received successfully. SACK is controlled by the SackOpts registry parameter and is enabled by default. The Network Monitor capture below illustrates a host acknowledging all data up to sequence number 54857340, plus the data from sequence number 54858789-54861684.

+ FRAME: Base frame properties
+ ETHERNET: ETYPE = 0x0800 : Protocol = IP:  DOD Internet 
                                             Protocol
+ IP: ID = 0x1A0D; Proto = TCP; Len: 64
TCP: .A...., len:0, seq:925104-925104, ack:54857341, win:32722, 
               src:1242  dst:139
TCP: Source Port = 0x04DA
TCP: Destination Port = NETBIOS Session Service
TCP: Sequence Number = 925104 (0xE1DB0)
TCP: Acknowledgement Number = 54857341 (0x3450E7D)
TCP: Data Offset = 44 (0x2C)
TCP: Reserved = 0 (0x0000)
+ TCP: Flags = 0x10 : .A....
TCP: Window = 32722 (0x7FD2)
TCP: Checksum = 0x4A72
TCP: Urgent Pointer = 0 (0x0)
TCP: Options
TCP: Option Nop = 1 (0x1)
TCP: Option Nop = 1 (0x1)
+ TCP: Timestamps Option
TCP: Option Nop = 1 (0x1)
TCP: Option Nop = 1 (0x1)
TCP: SACK Option
TCP: Option Type = 0x05
TCP: Option Length = 10 (0xA)
TCP: Left Edge of Block  = 54858789 (0x3451425)
TCP: Right Edge of Block = 54861685 (0x3451F75)

TCP Timestamps (RFC 1323)

Windows Server 2003 TCP/IP supports TCP time stamps, as described in RFC 1323. Like SACK, time stamps are important for connections using large window sizes. Time stamps were conceived to assist TCP in accurately measuring round-trip time (RTT) to adjust retransmission time-outs.

The Timestamps option carries two four-byte time stamp fields. The time-stamp value field (TSval) contains the current value of the time-stamp clock of the TCP sending the option. The Timestamp Echo Reply field (TSecr) is only valid if the ACK bit is set in the TCP header; if it is valid, it echoes a timestamp value that was sent by the remote TCP peer in the TSval field of a Timestamps option. When TSecr is not valid, its value must be zero. The TSecr value will generally be from the most recent Timestamp option that was received; however, there are exceptions that are explained below.

A TCP peer may send the Timestamps option (TSopt) in an initial SYN segment (i.e., segment containing a SYN bit and no ACK bit), and may send a TSopt in other segments only if it received a TSopt in the initial SYN segment for the connection.

The Timestamps option field can be viewed in a Network Monitor capture by expanding the TCP options field, as shown below:

TCP: Timestamps Option
TCP: Option Type = Timestamps
TCP: Option Length = 10 (0xA)
TCP: Timestamp = 2525186 (0x268802)
TCP: Reply Timestamp = 1823192 (0x1BD1D8)

By default, the Timestamp option is only used on a connection if the received SYN segment for a connection initiated by a TCP peer contains the Timestamp option. This default behavior can be modified with the Tcp1323Opts registry parameter (see Appendix A).

Path Maximum Transmission Unit (PMTU) Discovery

PMTU discovery is described in RFC 1191. When a connection is established, the two hosts involved exchange their TCP maximum segment size (MSS) values. The smaller of the two MSS values is used for the connection. Historically, the MSS for a host has been the MTU at the link layer minus 40 bytes for the IP and TCP headers. However, support for additional TCP options, such as time stamps, has increased the typical TCP+IP header to 52 or more bytes. Figure 2 shows the difference between MTU and MSS.

Art Image

Figure 2. MTU versus MSS

When TCP segments are destined to a non-local network, the Don’t Fragment (DF) flag is set in the IP header. Any link along the path can have an MTU that differs from that of the two hosts. If a link has an MTU that is too small for the IP datagram being routed, the router attempts to fragment the datagram. However, the Don’t Fragment bit is set in the IP header. At this point, the router should inform the sending host that the datagram couldn't be forwarded further without fragmentation. This is done with an ICMP Destination Unreachable-Fragmentation Needed and DF Set message. Most routers also specify the MTU for the next hop by putting the value for it in the low-order 16 bits of the ICMP header field that is unused in RFC 792. See RFC 1191, section 4, for the format of this message. Upon receiving this ICMP error message, TCP adjusts its MSS for the connection to the specified MTU minus the TCP and IP header size so that any further packets sent on the connection are no larger than the maximum size that can traverse the path without fragmentation.

Note

The minimum MTU permitted is 88 bytes, and Windows Server 2003 TCP/IP enforces this limit.

Some noncompliant routers may silently drop IP datagrams that cannot be fragmented or may not correctly report their next-hop MTU. If this occurs, it may be necessary to make a configuration change to the PMTU detection algorithm. There are two registry changes that can be made to Windows Server 2003 TCP/IP stack to work around these problematic devices:

EnablePMTUBHDetect—Adjusts the PMTU discovery algorithm to attempt to detect PMTU black hole routers. PMTU black hole detection is disabled by default.
EnablePMTUDiscovery—Completely enables or disables the PMTU discovery mechanism. When PMTU discovery is disabled, an MSS of 536 bytes is used for all non-local destination addresses. PMTU discovery is enabled by default (this is the recommended setting).

These registry entries are described in more detail in Appendix A.

The PMTU between two hosts can be discovered manually using the ping tool with the -f (don’t fragment) switch, as follows:

ping -f -n <number of pings> -l <size> <destination ip address>

As shown in the example below, the size parameter can be varied until the MTU is found. The size parameter used by ping is the size of the data buffer to send, not including headers. The ICMP header consumes 8 bytes, and the IP header is normally 20 bytes. In the case below (Ethernet), the link layer MTU is the maximum-sized ping buffer plus 28, or 1500 bytes:

C:\>ping -f -n 1 -l 1472 10.99.99.10
Pinging 10.99.99.10 with 1472 bytes of data:
Reply from 10.99.99.10: bytes=1472 time<10ms TTL=128
Ping statistics for 10.99.99.10:
Packets: Sent = 1, Received = 1, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 0ms, Maximum =  0ms, Average =  0ms

C:\>ping -f -n 1 -l 1473 10.99.99.10
Pinging 10.99.99.10 with 1473 bytes of data:
Packet needs to be fragmented but DF set.
Ping statistics for 10.99.99.10:
Packets: Sent = 1, Received = 0, Lost = 1 (100% loss),
Approximate round trip times in milliseconds:
Minimum = 0ms, Maximum =  0ms, Average =  0ms

In the example shown above, the IP layer returned an ICMP error message that ping interpreted. If the router had been a PMTU black hole router, ping would simply not be answered once its size exceeded the MTU that the router could handle. Ping can be used in this manner to detect such a router.

A sample ICMP Destination Unreachable-Fragmentation Needed and DF Set message is shown here:

*************************************************************** Src Addr Dst Addr Protocol Description 10.99.99.10 10.99.99.9 ICMP Destination Unreachable: 10.99.99.10 See frame 3 + FRAME: Base frame properties + ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet Protocol + IP: ID = 0x4401; Proto = ICMP; Len: 56 ICMP: Destination Unreachable: 10.99.99.10 See frame 3 ICMP: Packet Type = Destination Unreachable ICMP: Unreachable Code = Fragmentation Needed, DF Flag Set ICMP: Checksum = 0xA05B ICMP: Next Hop MTU = 576 (0x240) ICMP: Data: Number of data bytes remaining = 28 (0x001C) ICMP: Description of original IP frame

This error was generated by using ping -f –n 1 -l 1000 on an Ethernet-based host to send a large datagram across a router interface that only supports an MTU of 576 bytes. When the router tried to place the large frame onto the network with the smaller MTU, it found that fragmentation was not allowed. Therefore, it returned the error message indicating that the largest datagram that could be forwarded is 576 bytes.

Dead Gateway Detection

Dead gateway detection is used to allow TCP to detect failure of the default gateway and to adjust the IP route table to use another default gateway. Windows Server 2003 TCP/IP uses the triggered reselection method described in RFC 816, with slight modifications based upon customer experience and feedback.

When a TCP connection routed through the default gateway attempts to send a TCP packet to the destination a number of times (equal to one-half of the registry value TcpMaxDataRetransmissions) without receiving a response, the algorithm changes the Route Cache Entry (RCE) for that remote IP address to use the next default gateway in the list as its next-hop address. When 25 percent of the TCP connections have moved to the next default gateway, the algorithm advises IP to change the computer’s default gateway to the one that the connections are now using.

For example, assume that there are currently TCP connections to 11 different IP addresses that are being routed through the default gateway. Now assume that the default gateway fails, that there is a second default gateway configured, and that the value for TcpMaxDataRetransmissions is at the default of 5.

When the first TCP connection tries to send data, it does not receive any acknowledgments. After the third retransmission, the RCE for that remote IP address is switched to the next default gateway in the list. At this point, any TCP connections to that one remote IP address have switched over, but the remaining connections still try to use the original default gateway.

When the second TCP connection tries to send data, the same thing happens. Now, two of the 11 RCEs point to the new gateway.

When the third TCP connection tries to send data, after the third retransmission, three of 11 RCEs have been switched to the second default gateway. Because, at this point, over 25 percent of the RCEs have been moved, the default gateway for the whole computer is moved to the new one.

That default gateway remains the primary one for the computer until it experiences problems (causing the dead gateway algorithm to try the next one in the list again) or until the computer is restarted.

When the search reaches the last default gateway, it returns to the beginning of the list.

TCP Retransmission Behavior

TCP starts a retransmission timer when each outbound segment is handed down to IP. If no acknowledgment has been received for the data in a given segment before the timer expires, the segment is retransmitted. For new connection requests, the retransmission timer is initialized to 3 seconds (controllable using the TcpInitialRtt per-adapter registry parameter), and the request (SYN) is resent up to the value specified in TcpMaxConnectRetransmissions (the default for Windows Server 2003 is 2 times). On existing connections, the number of retransmissions is controlled by the TcpMaxDataRetransmissions registry parameter (5 by default). The retransmission time-out is adjusted on the fly to match the characteristics of the connection, using Smoothed Round Trip Time (SRTT) calculations as described in Van Jacobson’s article named "Congestion Avoidance and Control." The timer for a given segment is doubled after each retransmission of that segment. Using this algorithm, TCP tunes itself to the normal delay of a connection. TCP connections over high-delay links take much longer to time-out than those over low-delay links.3

The following Network Monitor capture shows the retransmission algorithm for two hosts that are connected over Ethernet on the same subnet. An FTP file transfer was in progress when the receiving host was disconnected from the network. Because the SRTT for this connection was very small, the first retransmission was sent after about one-half second. The timer was then doubled for each of the retransmissions that followed. After the fifth retransmission, the timer was once again doubled. If no acknowledgment was received before it expired, the connection was aborted.

Delta  Source IP    Dest IP      Pro Flags Description
----------------------------------------------------------
0.000  10.57.10.32  10.57.9.138  TCP .A.., len: 1460, seq: 8043781, ack: 8153124, win: 8760
0.521  10.57.10.32  10.57.9.138  TCP .A.., len: 1460, seq: 8043781, ack: 8153124, win: 8760
1.001  10.57.10.32  10.57.9.138  TCP .A.., len: 1460, seq: 8043781, ack: 8153124, win: 8760
2.003  10.57.10.32  10.57.9.138  TCP .A.., len: 1460, seq: 8043781, ack: 8153124, win: 8760
4.007  10.57.10.32  10.57.9.138  TCP .A.., len: 1460, seq: 8043781, ack: 8153124, win: 8760
8.130  10.57.10.32  10.57.9.138  TCP .A.., len: 1460, seq: 8043781, ack: 8153124, win: 8760

There are some circumstances under which TCP retransmits data prior to the time that the retransmission timer expires. The most common of these occurs due to a feature known as fast retransmit. When a receiver that supports fast retransmit receives data with a sequence number beyond the current expected one, it assumes that some data was dropped. To help make the sender aware of this event, the receiver immediately sends an ACK, with the ACK number set to the sequence number that it was expecting. It continues to do this for each additional TCP segment that arrives containing data subsequent to the missing data in the incoming stream. When the sender starts to receive a stream of ACKs that are acknowledging the same sequence number and that sequence number is earlier than the current sequence number being sent, it can infer that a segment (or more) must have been dropped. Senders that support the fast retransmit algorithm immediately resend the segment that the receiver is expecting to fill in the gap in the data, without waiting for the retransmission timer to expire for that segment. This optimization greatly improves performance in a busy network environment.

With fast retransmit, the sender retransmits the TCP segments to fill in the gap in the data before their retransmission timers expire. Because the retransmission timers did not expire for the missing TCP segments, the sender can more quickly send additional segments to the receiver. This is known as fast recovery. Fast retransmit and fast recovery are described in RFC 2581.

By default, Windows Server 2003 resends a segment if it receives three ACKs for the same sequence number and that sequence number lags the current one. This is controllable with the TcpMaxDupAcks registry parameter. See also the “TCP Selective Acknowledgment (RFC 2018)” section in this article.

TCP Keep-Alive Messages

A TCP keep-alive packet is simply an ACK with the sequence number set to one less than the current sequence number for the connection. A host receiving one of these ACKs responds with an ACK for the current sequence number. Keep-alives can be used to verify that the computer at the remote end of a connection is still available. TCP keep-alives can be sent once every KeepAliveTime (defaults to 7,200,000 milliseconds or two hours) if no other data or higher-level keep-alives have been carried over the TCP connection. If there is no response to a keep-alive, it is repeated once every KeepAliveInterval seconds. KeepAliveInterval defaults to 1 second. NetBT connections, such as those used by other Microsoft networking components, send NetBIOS keep-alives more frequently, so normally no TCP keep-alives are sent on a NetBIOS connection. TCP keep-alives are disabled by default, but Windows Sockets applications can use the setsockopt() function to enable them.

Slow Start Algorithm and Congestion Avoidance

When a connection is established, TCP starts slowly at first to assess the bandwidth of the connection, and to avoid overflowing the receiving host or any other devices or links in the path. The send window is set to two TCP segments, and if that is acknowledged, it is incremented to three segments.4 If those are acknowledged, it is incremented again, and so on until the amount of data being sent per burst reaches the size of the receive window on the remote host. At that point, the slow start algorithm is no longer in use, and flow control is governed by the receive window. However, congestion could still occur on a connection at any time during transmission. If this happens (evidenced by the need to retransmit), a congestion-avoidance algorithm is used to reduce the send window size temporarily and to grow it back towards the receive window size. Slow start and congestion avoidance are described in RFC 2581.

Silly Window Syndrome (SWS)

Silly Window Syndrome is described in RFC 1122 as follows:

“In brief, SWS is caused by the receiver advancing the right window edge whenever it has any new buffer space available to receive data and by the sender using any incremental window, no matter how small, to send more data [TCP:5]. The result can be a stable pattern of sending tiny data segments, even though both sender and receiver have a large total buffer space for the connection.”

Windows Server 2003 TCP/IP implements SWS avoidance, as specified in RFC 1122, by not sending more data until there is a sufficient window size advertised by the receiving end to send a full TCP segment. It also implements SWS avoidance on the receive end of a connection by not opening the receive window in increments of less than a TCP segment.

Nagle Algorithm

Windows Server 2003 TCP/IP implements the Nagle algorithm described in RFC 896. The purpose of this algorithm is to reduce the number of very small segments sent, especially on high-delay (remote) links. The Nagle algorithm allows only one small segment to be outstanding at a time without acknowledgment. If more small segments are generated while awaiting the ACK for the first one, these segments are coalesced into one larger segment. Any full-sized segment is always transmitted immediately, on the assumption that there is a sufficient receive window available. The Nagle algorithm is effective in reducing the number of packets sent by interactive applications, such as Telnet, especially over slow links.

The Nagle algorithm can be observed in the following Network Monitor capture. The trace was captured by using PPP to dial up an Internet service provider. A Telnet (character-mode) session was established, and then the Y key was held down on the computer. At all times, one segment was sent, and further Y characters were held by the stack until an acknowledgment was received for the previous segment. In this example, three to four Y characters were buffered each time and sent together in one segment. The Nagle algorithm resulted in a huge savings in the number of packets sent—the number of packets was reduced by a factor of about three.

Delta Source IP     Dest IP       Prot   Description
----------------------------------------------------------
0.644 204.182.66.83 199.181.164.4 TELNET To Server Port = 1901
0.144 199.181.164.4 204.182.66.83 TELNET To Client Port = 1901
0.000 204.182.66.83 199.181.164.4 TELNET To Server Port = 1901
0.145 199.181.164.4 204.182.66.83 TELNET To Client Port = 1901
0.000 204.182.66.83 199.181.164.4 TELNET To Server Port = 1901
0.144 199.181.164.4 204.182.66.83 TELNET To Client Port = 1901
. . .

Each segment contained several of the "y" characters. The first segment is shown more fully parsed below, and the data portion is pointed out in the hexadecimal display at the bottom.

*************************************************************** Time Source IP Dest IP Prot Description 0.644 204.182.66.83 199.181.164.4 TELNET To Server Port = 1901 + FRAME: Base frame properties + ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet Protocol + IP: ID = 0xEA83; Proto = TCP; Len: 43 + TCP: .AP..., len: 3, seq:1032660278, ack: 353339017, win: 7766, src: 1901 dst: 23 (TELNET) TELNET: To Server From Port = 1901 TELNET: Telnet Data D2 41 53 48 00 00 52 41 53 48 00 00 08 00 45 00 .ASH..RASH....E. 00 2B EA 83 40 00 20 06 F5 85 CC B6 42 53 C7 B5 .+..@. .....BS.. A4 04 07 6D 00 17 3D 8D 25 36 15 0F 86 89 50 18 ...m..=.%6....P. 1E 56 1E 56 00 00 79 79 79 .V.V..yyy ^^^ data

Windows Sockets applications can disable the Nagle algorithm for their connections by setting the TCP_NODELAY socket option. However, this practice should be avoided unless it is absolutely necessary because it increases network utilization. Some network applications may not perform well if their design does not take into account the effects of transmitting large numbers of small packets and the Nagle algorithm. The Nagle algorithm is not applied to loopback TCP connections for performance reasons. Windows Server 2003 NetBT disables Nagling for NetBIOS over TCP connections as well as direct-hosted redirector/server connections, which can improve performance for applications issuing numerous small file manipulation commands. An example is an application that uses file locking/unlocking frequently.

TCP TIME-WAIT Delay

When a TCP connection is closed, the socket-pair is placed into a state known as TIME-WAIT. This is done so that a new connection does not use the same protocol, source IP address, destination IP address, source port, and destination port until enough time has passed to ensure that any segments that may have been misrouted or delayed are not delivered unexpectedly. RFC 793 specifies the length of time that the socket-pair should not be reused as two maximum segment lifetimes (2 MSL), or four minutes. This is the default setting for Windows Server 2003 TCP/IP. However, with this default setting, some network applications that perform many outbound connections in a short time may use up all available ports before the ports can be recycled.

Windows Server 2003 TCP/IP offers two methods of controlling this behavior. First, the TcpTimedWaitDelay registry parameter can be used to alter this value. Windows Server 2003 TCP/IP allows it to be set as low as 30 seconds, which should not cause problems in most environments. Second, the number of user-accessible ephemeral ports that can be used to source outbound connections is configurable using the MaxUserPort registry parameter. By default, when an application requests any socket from the system to use for an outbound call, a port between the values of 1024 and 5000 is supplied. The MaxUserPort parameter can be used to set the value of the uppermost port that the administrator chooses to allow for outbound connections. For instance, setting this value to 10,000 (decimal) would make approximately 9000 user ports available for outbound connections. For more details on this concept, see RFC 793. See also the MaxFreeTcbs and MaxHashTableSize registry parameters in Appendix A.

TCP/IP for Windows Server 2003 Service Pack 1 has implemented a smart TCP port allocation algorithm. When an application requests any available TCP port, TCP/IP first attempts to find an available port that does not correspond to a connection in the TIME WAIT state. If a port cannot be found, then it picks any available port. For more information, see "Smart TCP Port Allocation" in this article.

TCP Connections to and from Multihomed Computers

When TCP connections are made to a multihomed host, both the WINS client and the Domain Name Resolver (DNR) attempt to determine whether any of the destination IP addresses provided by the name server are on the same subnet as any of the interfaces in the local computer. If so, these addresses are sorted to the top of the list so that the application can try them prior to trying addresses that are not on the same subnet. If none of the addresses is on a common subnet with the local computer, behavior is different depending upon the name space. The PrioritizeRecordData TCP/IP registry parameter can be used to prevent the DNR component from sorting local subnet addresses to the top of the list.

In the WINS name space, the client is responsible for randomizing or load balancing between the provided addresses. The WINS server always returns the list of addresses in the same order, and the WINS client randomly picks one of them for each connection.

In the DNS name space, the DNS server is usually configured to provide the addresses in a round robin fashion. The DNR does not attempt to further randomize the addresses. In some situations, it is desirable to connect to a specific interface on a multihomed computer. The best way to accomplish this is to provide the interface with its own DNS entry. For example, a computer named raincity could have one DNS entry listing both IP addresses (actually two separate records in the DNS with the same name), and also records in the DNS for raincity1 and raincity2, each associated with just one of the IP addresses assigned to the computer.

When TCP connections are made from a multihomed host, things get a bit more complicated. If the connection is a Winsock connection using the DNS name space, once the target IP address for the connection is known, TCP attempts to connect from the best source IP address available. Again, the route table is used to make this determination. If there is an interface in the local computer that is on the same subnet as the target IP address, its IP address is used as the source in the connection request. If there is no best source IP address to use, the system chooses one randomly.

If the connection is a NetBIOS-based connection using the redirector, little routing information is available at the application level. The NetBIOS interface supports connections over various protocols and has no knowledge of IP. Instead, the redirector places calls on all of the transports that are bound to it. If there are two interfaces in the computer and one protocol installed, there are two transports available to the redirector. Calls are placed on both, and NetBT submits connection requests to the stack, using an IP address from each interface. It is possible that both calls succeed. If so, the redirector cancels one of them. The choice of which one to cancel depends upon the redirector ObeyBindingOrder registry value5. If this is set to 0 (the default value), the primary transport (determined by binding order) is the preferred one, and the redirector waits for the primary transport to time out before accepting the connection on the secondary transport. If this value is set to 1, the binding order is ignored, and the redirector accepts the first connection that succeeds and cancels the other(s).

Throughput Considerations

TCP was designed to provide optimum performance over varying link conditions, and Windows Server 2003 TCP/IP contains improvements such as those supporting RFC 1323. Actual throughput for a link depends on a number of variables, but the most important factors are:

Link speed (bits-per-second that can be transmitted)
Propagation delay
Window size (amount of unacknowledged data that may be outstanding on a TCP connection)
Link reliability
Network and intermediate device congestion
Path MTU

TCP throughput calculation is discussed in detail in Chapters 20–24 of TCP/IP Illustrated, by W. Richard Stevens6. Some key considerations are listed below:

The capacity of a pipe is bandwidth multiplied by round-trip time. This is known as the bandwidth-delay product. If the link is reliable, for best performance the window size should be greater than or equal to the capacity of the pipe so that the sending stack can fill it. The largest window size that can be specified, due to its 16-bit field in the TCP header, is 65535, but larger windows can be negotiated by using window scaling as described earlier in this article. See TcpWindowSize in Appendix A.
Throughput can never exceed window size divided by round-trip time.
If the link is unreliable or badly congested and packets are being dropped, using a larger window size may not improve throughput. Along with scaling windows support, Windows Server 2003 TCP/IP supports Selective Acknowledgments (SACK; described in RFC 2018) to improve performance in environments that are experiencing packet loss. It also includes support for timestamps (described in RFC 1323) for improved RTT estimation.
Propagation delay is dependent upon the speed of light, latencies in transmission equipment, and so on.
Transmission delay depends on the speed of the media.
For a specified path, propagation delay is fixed, but transmission delay depends upon the packet size.
At low speeds, transmission delay is the limiting factor. At high speeds, propagation delay may become the limiting factor.

To summarize, Windows Server 2003 TCP/IP can adapt to most network conditions and can dynamically provide the best throughput and reliability possible on a per-connection basis. Attempts at manual tuning are often counter-productive unless a qualified network engineer first performs a careful study of data flow.

User Datagram Protocol (UDP)

UDP provides a connectionless, unreliable transport service. It is often used for communications that use broadcast or multicast IP datagrams. Since delivery of UDP datagrams is not guaranteed, applications using UDP must supply their own mechanisms for reliability, if needed. Microsoft networking components use UDP for logon, browsing, and name resolution. UDP can also be used to carry IP multicast streams.

UDP and Name Resolution

UDP is used for NetBIOS name resolution by unicast to a NetBIOS name server or subnet broadcasts, and for DNS host name to IP address resolution. NetBIOS name resolution is accomplished over UDP port 137. DNS queries use UDP port 53. Because UDP itself does not guarantee delivery of datagrams, both of these services use their own retransmission schemes if they receive no answer to queries. Broadcast UDP datagrams are not usually forwarded over IP routers, so NetBIOS name resolution in a routed environment requires a name server of some type, or the use of static database files.

Mailslots over UDP

Many NetBIOS applications use mailslot messaging. A second-class mailslot is a simple mechanism for sending a message from one NetBIOS name to another over UDP. Mailslot messages can be broadcast on a subnet or directed to the remote host. To direct a mailslot message to another host, there must be some method of NetBIOS name resolution available. Microsoft provides WINS for this purpose.

NetBIOS over TCP/IP (NetBT)

The Windows Server 2003 implementation of NetBIOS over TCP/IP is referred to as NetBT. NetBT uses the following TCP and UDP ports:

UDP port 137 (name services)
UDP port 138 (datagram services)
TCP port 139 (session services)

NetBIOS over TCP/IP is specified by RFC 1001 and RFC 1002. The Netbt.sys driver is a kernel-mode component that supports the Transport Driver Interface (TDI) interface. Services such as Workstation (redirector) and Server (file server) use the TDI interface directly, but traditional NetBIOS applications have their calls mapped to TDI calls by the Netbios.sys driver. Using TDI to make calls to NetBT is a more difficult programming task, but can provide higher performance and freedom from historical NetBIOS limitations. NetBIOS concepts are discussed further in the “Network Application Interfaces” section of this article.

Transport Driver Interface (TDI)

Microsoft developed the Transport Driver Interface (TDI) to provide greater flexibility and functionality than is provided by existing interfaces, such as NetBIOS and Windows Sockets. All Windows transport providers expose TDI. The TDI specification describes the set of primitive functions by which transport drivers and TDI clients communicate and the call mechanisms used for accessing them. Currently, TDI is kernel-mode only.

The Windows Server 2003 redirector and server both use TDI directly, rather than going through the NetBIOS mapping layer. By doing so, they are not subject to many of the restrictions imposed by NetBIOS, such as the legacy 254-session limit.

TDI Features

TDI may be the most difficult to use of all Windows network APIs. It is a simple conduit, so the programmer must determine the format and meaning of messages.

TDI includes the following features:

Most Windows Server 2003 transports support TDI
An open naming and addressing scheme
Message and stream-mode data transfer
Asynchronous operation
Support for unsolicited indication of events
Extensibility—clients can submit private requests to a transport driver that understands them.
Support for limited use of standard kernel-mode I/O functions to send and receive data
32-bit addressing and values
Support for Access Control Lists (ACLs, used for security) on TDI address objects

More information about TDI is available from the Windows DDK.

Security Considerations

Network security is a serious consideration for administrators with computers exposed to public networks. The Microsoft TCP/IP stack has been strengthened against many attacks and in its default state handles most of the common attacks. Some additional protection against popular Denial of Service attacks can be added by setting the value of the SynAttackProtect parameter in the registry. This key allows the administrator to choose several levels of protection against SYN attacks.

Here are general guidelines that can lower your exposure to attack:

Disable unnecessary or optional services (for instance, the Client for Microsoft Networks and the File and Printer Sharing for Microsoft Networks components on the network connections of an IIS server).
Enable TCP/IP filtering and restrict access to only the ports that are necessary for the server to function. See article 150543 in the Microsoft Knowledge Base (https://go.microsoft.com/fwlink/?linkid=67962) for a list of ports that Windows services use.
Disable NetBIOS over TCP/IP on network connections where it is not needed.
Configure static IP addresses and parameters for network adapters connected to the Internet.
Configure registry parameters for maximum protection (see Appendix D).

Consult the Microsoft Security Web site (https://go.microsoft.com/fwlink/?linkID=7420) regularly for security bulletins.

Core Protocol Stack Components and the TDI Interface

Address Resolution Protocol (ARP)

ARP Cache

ARP Cache Aging

Internet Protocol (IP)

Routing

Duplicate IP Address Detection

Multihoming

Classless Inter-Domain Routing (CIDR)

IP Multicasting

IP over ATM

ATM Address Resolution

Internet Control Message Protocol (ICMP)

ICMP Router Discovery

Maintaining Route Tables

Path Maximum Transmission Unit (PMTU) Discovery

Use of ICMP to Diagnose Problems

Internet Protocol Security (IPsec)

Internet Group Management Protocol (IGMP)

IP/ARP Extensions for IP Multicasting

Multicast Extensions to Windows Sockets

Use of Multicast and IGMP by Windows Components

Transmission Control Protocol (TCP)

TCP Receive Window Size Calculation and Window Scaling (RFC 1323)

Delayed Acknowledgments

TCP Selective Acknowledgment (RFC 2018)

TCP Timestamps (RFC 1323)

Path Maximum Transmission Unit (PMTU) Discovery

Dead Gateway Detection

TCP Retransmission Behavior

TCP Keep-Alive Messages

Slow Start Algorithm and Congestion Avoidance

Silly Window Syndrome (SWS)

Nagle Algorithm

TCP TIME-WAIT Delay

TCP Connections to and from Multihomed Computers

Throughput Considerations

User Datagram Protocol (UDP)

UDP and Name Resolution

Mailslots over UDP

NetBIOS over TCP/IP (NetBT)

Transport Driver Interface (TDI)

TDI Features

Security Considerations

Additional resources