DNS Clients and Timeouts (Part 1)
Some time ago I was helping a customer with a case related to the Windows DNS client. This customer was getting inconsistent name resolution results and he wanted to know why this was happening. The issue that this customer faced was related to the configuration of the DNS servers list in the clients and an incorrect assumption he made about the way that this list is used when a name has to be resolved.
I wanted to start this blog about DNS with a post that tries to clarify some of the concepts related to the use of the DNS servers list and timeouts by the Windows DNS client. In this first part I will describe a sample scenario, a "solution" to a requirement using an incorrect, but common, assumption and the problem with this solution. In the second part I will explain the behavior of the Windows DNS Client when dealing with timeouts.
The sample scenario
The sample environment consists of two Active Directory single-domain forests named contoso.local and fabrikam.local, both in the same location/LAN. These domains are managed by different people, with no trust relationship between them.
- contoso.local has two domain controllers: contosodc1.contoso.local with IP 10.24.8.11 and contosodc2.contoso.local with IP 10.24.8.12. There are also two workstations joined to the domain: client1.contoso.local with IP 10.24.9.101 and client2.contoso.local with IP 10.24.9.105.
- fabrikam.local has two domain controllers: fabrikamdc1.fabrikam.local with IP 10.24.8.31 and fabrikamdc2.fabrikam.local with IP 10.24.8.32. There is a member server in this domain named server1.fabrikam.local with IP 10.24.8.35.
All the DCs are running the DNS Server service with forwarders configured to the ISP. No name resolution is configured between the two domains. None of the internal domains are published on the Internet DNS servers. WINS is not used.
Members in each domain have their local domain's DCs listed as their DNS servers.
The requirement and the "solution"
Users of CLIENT1 and CLIENT2 need to access a share in SERVER1. The administrators decide to create user accounts for these users in fabrikam.local and have them provide these credentials every time they access this share instead of creating a trust relationship. Also, due to political issues between the administrators, neither of the groups wants to forward DNS queries to the other domain or make any changes to their DNS servers that implied contacting the other domain's DNS servers or listing the other domain's servers locally.
Now when the users in CLIENT1 and CLIENT2 need to connect to server1.fabrikam.local they first need to resolve its name to an IP address. This name resolution is not available in the current environment. The "solution" that the administrators in contoso.local decide to use is to list FABRIKAMDC1 and FABRIKAMDC2 as alternate DNS servers in CLIENT1 and CLIENT2 intermixed with their current DNS servers. The configuration of the DNS servers list for these clients is then going to be: CONTOSODC1, FABRIKAMDC1, CONTOSODC2 and FABRIKAMDC2. The desired configuration would look like this:
This solution is based on the (wrong) assumption that if these clients need to resolve server1.fabrikam.local then the secondary DNS server 10.24.8.31 (FABRIKAMDC1) is going to be used to resolve it because the primary DNS server 10.24.8.11 (CONTOSDC1) will not be able to do so.
Does this solution work?
Luckily (I will explain in a little bit why I am using this word) this solution works sometimes, but most of the times it fails. When it fails the clients are going to get a message that the name of SERVER1 cannot be found.
What is wrong with this solution?
The problem with this solution is the wrong assumption that all the DNS servers in the list are going to be queried until the name is found (positive answer) and that the query will fail only when all of the listed DNS servers answer that the name does not exist (negative answer). In other words: the client will try all the possible DNS servers it has configured until it gets a positive answer before it gives up the query and accepts any negative answer.
So what is the actual behavior of the DNS client?
The actual behavior of the DNS client is that it is going to query its DNS servers in the order that they are listed until an answer, either positive or negative, is received. Once an answer is received, either positive or negative, the DNS client stops the query process and gives that answer back to the calling application. Only when a query to a DNS server times-out (or reports a server error) is when the client retries the query with the next DNS server in the list. In other words: negative answers do no trigger retries with alternate DNS servers, only timeouts (and other errors) do.
Note #1: in the Windows DNS Client this behavior of querying servers in order slightly changes after 3 attempts to resolve a query time-out. I will explain this behavior in part 2 of this post.
Note #2: the calling application might have some logic to retry queries that receive a negative response, but this would be outside the knowledge of the DNS Client process.
Think of this concept as: the DNS client is looking for an answer, not for a positive answer. Once the client gets an answer it "trusts" that the DNS server that sent it did its best to get the correct answer and then it stops querying other servers.
In a DNS infrastructure any server should be able to resolve a name as long as the name exists in the DNS namespace. This means that it does not matter the DNS server that you query because you will eventually get the right answer (positive or negative). Multiple DNS servers are configured for fault-tolerance, not because they have access to disjoint DNS namespaces.
NOTE #3: a similar misconception exists about the use of multiple forwarders, but this will be the topic of another blog post.
So why does this solution work sometimes "by luck"?
This solution works when the query asking for server1.fabrikam.local that is sent to CONTOSDC1 times-out and is retried by the client in the secondary server FABRIKAMDC1. In this case FABRIKAMDC1 will be able to immediately answer with the IP of SERVER1.
What could be the cause for CONTOSODC1 to not answer before the client times-out? CONTOSODC1 has to rely on its forwarders to resolve names in fabrikam.local (remember that CONTOSODC1 does not have a copy of the zone fabrikam.local or any resolution to that domain) and the forwarders could be slow to respond, or the link to the ISP could congested, or the forwarders could be wrongly configured, or the forwarders timeout in CONTOSODC1 were set to high values, and a long etcetera.
This solution breaks when another client sends the same query to CONTOSDC1 shortly after the first query was resolved. This is because CONTOSDC1 would have then already cached a negative response for the name and then it can immediately answer to this query from its cache without contacting the forwarders. The solution also breaks when the name is not in the cache of CONTOSODC1 but the process of querying the forwarders allows it to answer (with a negative response) before the client times-out and switches to FABRIKAMDC1.
At the end for this "solution" to work the client has to be "lucky enough" to be querying CONTOSODC1 when the name is not already in the negative cache of this DNS server and the time to get a response back from the forwarders is higher than the client timeout.
If you are still wondering why CONTOSODC1 will always answer with a negative response for queries in fabrikam.local then you have to remember that neither CONTOSODC1 nor any of the Internet DNS servers have any knowledge of domain fabrikam.local, so none of them have any means to positively resolve names in this domain.
Can you show me some traces?
Here are some network traces that show the behavior that I just described:
In the first trace we have the original configuration of CLIENT1 listing only the DNS servers of its own domain and trying to resolve the name server1.fabrikam.local:
As expected both of the DNS servers answered with a negative response (name error) in packets #3 and #4 because none of them can resolve the name. Notice that CLIENT1 resent the query to CONTOSODC2 in packet #2 because the query to CONTOSODC1 in packet #1 timed-out after 1 second (I will explain this timeout value in Part 2). This timeout was probably due to CONTOSODC1 waiting to get an answer from its forwarders but we can see that it eventually sent back an answer in packet #3 (same logic applies for the delayed response from CONTOSODC2 in packet #4). We will use this fact that CONTOSODC1 takes longer than the client timeout to answer back to the client to analyze the next traces.
The next trace shows what happens when we configure CLIENT1 to use the DNS servers of fabrikam.local as alternate DNS servers using the "solution" we previously discussed (assume that the DNS Server service cache in CONTOSODC1 is empty at this point):
Notice how CLIENT1 first tried to resolve the name using CONTOSODC1 in packet #1; after 1 second it timed-out and retried the query using FABRIKAMDC1 in packet #2 who answered almost instantaneously in packet #3. CLIENT1 then got a positive answer indicating the IP address 10.24.8.35 for server1.fabrikam.local (notice the success answer containing this IP in packet #3). Also notice that some seconds later we have packet #4 which is the (delayed) response from CONTOSODC1 indicating that the name cannot be found, but at this point CLIENT1 has already resolved the name and does not need to take care of this answer.
Now what happens if CLIENT2 (who has the DNS servers list configured exactly as CLIENT1) tries to resolve the same name just after CLIENT1 resolved the name as in the previous trace (packets #1 to #4 from the previous trace are included to reinforce the sequence of events):
Notice how in packet #5 CLIENT2 sends the query to its primary DNS server CONTOSODC1. CONTOSODC1 has already cached that the name does not exist (due to the query that CLIENT1 previously sent) and can almost instantaneously answer back to CLIENT2 with a negative answer in packet #6. Notice how CLIENT2 does not retry the query using its secondary DNS server FABRIKAMDC1 because it already got an answer from the primary server.
Note #4: in a future post I will blog about the use of the positive and negative caches in Microsoft DNS Server.
In the next part of this post I will explain how the Windows DNS client deals with timeouts and its retry logic.
I hope that you find this information useful. Please provide your feedback about the content in the comments sections. I have several ideas for the next posts and I would like to hear your comments about topics that might be interesting for you.