Using Network Monitor 3 to Troubleshoot a Domain Join Failure Caused by a Black Hole Router
This is Randy again with an interesting case that I had recently. We were having problems trying to join certain workstations to the domain. We would see that every workstation in one site would join successfully and all the workstations in another site would fail with an error indicating that we could not locate a domain controller for that domain. My first hunch was either the domain controllers in the one site were broken, or there were networking issues in that problem site. The first step in troubleshooting is “Check the Event Logs!” We did not see any alarming events on any of the domain controllers in the problem site. So my next step was to take a network trace of the issue. With the help of a networking engineer here at Microsoft, Tim Quinn, we analyzed the traffic of a successful domain join and a failure. We took a simultaneous trace from the workstation and authenticating domain controller to ensure that we could see both sides of the conversation and uncover any failures across the wire.
We used Network Monitor 3.2 to take the traces. You can find some very helpful webcasts on working with Network Monitor on the Netmon blog. Here is a snapshot of some traffic between the domain controller and the workstation that is attempting to join. This is from the Frame Summary pane and is a general overview of each frame sent on the wire.
You can see the protocol and description of each frame. We have a lot of traffic on RPC and SMB and from the frame description we see that this is communication on named pipes: Netlogon, Samr, and LSARPC. These are the connection points involved in a domain join between a workstation and a domain controller. By highlighting one of these frames in the Frame Summary pane, we can see each network layer of the frame in the Frame Details pane.
It is important to understand that data transfer uses encapsulation to transfer information from a process on one computer to a process on another computer. In the above example we see the LDAP client on a workstation talking to the LDAP server on a domain controller using the defined specifications of the LDAP protocol. This data is packaged in a TCP packet which is built using the specifications of the TCP protocol, which is packaged in an IP datagram, which is packaged on an Ethernet frame. Netmon separates each of these packages in order for you to analyze the behavior of each protocol individually.
So now we have two traces that show every frame on the wire: one from the perspective of the workstation and one from the perspective of the domain controller. So how do we find a frame in one trace to the corresponding frame in the other trace? We have a couple of frame and packet attributes that can help; the first one we will discuss is the identification number…
The identification number is an attribute of the IP datagram that is sent across the wire. This attribute is as simple as it sounds; it is just a random generated number to identify a specific IP datagram. If we expand the IPv4 header information from the above example, we see an attribute named Identification with a value of 3201.
If we look at either the network trace taken on the domain controller or the trace taken on the workstation, there will be a frame with an Identification number of 3201. You can filter both traces for this frame by using the filter IPv4.Identification == 3201.
Another way to line up the conversation of two traces taken simultaneously is to compare the Sequence and Acknowledgement numbers. These attributes are at the TCP layer instead of IPv4. To view these attributes, expand the TCP header. This is from the same frame as above:
We see that the last packet sequence number sent in this frame is 4167329214, and the last packet that we received from our partner in this communication is 1946363494. These numbers can often be misleading, because a router can strip and resend at the network layer (IP layer) and all the numbering can be misleading from the IP layer up (In this case TCP.) To align to simultaneous traces, I use the Identification attribute from above, and I use the sequence and acknowledgement numbers to verify dropped and received packets. To learn more about Sequence and Acknowledgement numbers and how TCP works, check out the following KB article:
Explanation of the Three-Way Handshake via TCP/IP
In comparing these traces, we see a breakdown in the communication. From the frame summary, it would appear to be an LDAP problem. After further analysis, we see that the issue is at the TCP layer. The next two snapshots are from the simultaneous traces and an explanation under each.
This is from the trace taken on the workstation and we see at the top, frame identification number 3201 which is our LDAP request. After this we get a strange Kerberos packet. This is actually an out of sequence packet that was the last part of the LDAP response from the domain controller. Because it is an incomplete portion of the response, Netmon did not parse the frame correctly and it shows as a Kerberos packet. Beyond that, we see the workstation eventually abandon the LDAP request (frame 133).
This is from the trace taken on the domain controller and we see at the top, frame identification number 3201 which is our LDAP request. We see that the DC does respond, but we send two frames of data, the first which never makes it to the workstation and the second frame (frame 125) that does successfully make it to the workstation as an out of sequence packet. After this, we never receive an acknowledgement from the workstation and we see the domain controller resend the missing packet (frame 127 and 128).
So what ever happened to the mysterious disappearing packet? This was caused by a router that would drop packets with a Maximum Transmission Unit (MTU) size too large to forward. This issue is known as a black hole router. We were able to change the MTU size sent and this resolved our issue.
Even though this blog is AskDS, it is important to understand the networking components used by Directory Services. By using Network Monitor, you can avoid time spent troubleshooting the wrong component.
- Randy Turner