DNS Clients and Timeouts (part 2)

In the first part of this blog post I described the behavior of the DNS client when there are multiple entries in the DNS servers list. In this second part I will try to explain how the Windows DNS Client works when dealing with timeouts and retries.

Note: From now on when I refer to DNS Client I am referring to the Windows implementation of this service. I will also highlight the variances in different versions of Windows when applicable.

How much time is the timeout and how many times do we retry a query?

The DNS client uses an array that defines the timeout to use in each attempt to resolve a query instead of a single fixed value as the same timeout for all the attempts. What this means is that the timeout used during each attempt is not necessarily always the same. Each element of the array defines the timeout to use during each attempt. The query is retried as many times as there are elements in the array, and the process stops when there are no more elements in the array.

For example, suppose that the Timeout array has these values: [1, 1, 2, 4, 4]. The first attempt to resolve a query will timeout after 1 second (first element of the array), the second attempt will timeout after 1 second (second element of the array), the third after 2 seconds, the fourth after 4 seconds, and so on. As there are 5 elements in the array, we will stop retrying after 5 attempts.

The pre-defined timeout array for each of the currently supported versions of Windows are (values shown are in seconds):

Timeout

OS

[0]

[1]

[2]

[3]

[4]

Windows XP

1

1

2

4

7

Windows Server 2003

1

1

2

4

4

Windows Vista & Windows Server 2008

1

1

2

4

4

Windows 7 & Windows Server 2008 R2

1

1

2

4

4

These timeouts can be customized using the registry value HKLM\System\CurrentControlSet\Services\dnscache\Parameters\DNSQueryTimeouts. This value does not exist by default and then the pre-defined default array just mentioned is used. If the value is defined then it should have a type of REG_MULTI_SZ (multi-line string) with each line containing one value of the array with the last line having a 0 to indicate the end of the list.

image

If any of the values in a line is higher than 30 then a value of 30 is used for that line instead. If the total sum of the values is higher than 120 (2 minutes) then the list is shortened from bottom-up to remove values until the total timeout is less than 120.

Which DNS Servers do we query in each attempt?

The DNS Client queries the following DNS servers in each attempt:

  1. In the first attempt we query the preferred DNS server of the preferred network adapter only. The preferred DNS server is the first server listed for that adapter. The adapters are sorted based on their binding order and the preferred adapter is the one at the top of the list. You can change this binding order by opening ncpa.cpl and going to the menu Advanced/Advanced Settings…

    image

    Adapters that are disabled, disconnected, do not have TCP/IP enabled or have no DNS servers listed are ignored.
    In this attempt we query one DNS server only.

  2. If the previous attempt times-out, retry the query with the next best DNS server for all the adapters. The next best DNS server for an adapter is the next on its list that has not already been queried and timed-out.
    Note that the lists are managed as circular-lists, once we reach the end of the list for an adapter, the next best server for that adapter will be the first on its list.
    In this attempt we query one DNS server per adapter.

  3. If the previous attempt times-out (because none of the DNS servers queried in the previous step answered in the expected time), retry the query with the next best DNS server for all the adapters.
    In this attempt we query one DNS server per adapter.

  4. If the previous attempt times-out, retry the query with all the possible DNS servers in all the adapters. This includes even servers that have already timed-out in the previous steps.
    In this attempt we query all the DNS servers in all the adapters.

  5. Repeat step (4) until we have run out of attempts. If there are no more attempts then return an error to the caller.
    In this attempt we query all the DNS servers in all the adapters.

It is important to clarify something for steps 2 to 5 as multiple servers are queried in those steps: as long as one of the servers queried in that attempt responds, with either a positive or negative answer, then the query is considered resolved. It is OK if the other servers queried do not respond as we already have an answer which is what we wanted.

Using the information about the default Timeout array and the retry logic that was just described we can see that each attempt will timeout, by default, after:

  • First attempt (step 1): Preferred DNS Server on the preferred adapter: times-out after 1 second
  • Second attempt (step 2): Next best server for all the adapters: times-out after 1 second
  • Third attempt (step 3): Next best server for all the adapters: times-out after 2 seconds
  • Fourth attempt (step 4): All DNS servers in all the adapters: times-out after 4 seconds
  • Fifth attempt (step 5): Repeat of step (4): times-out after 4 seconds (7 seconds in Windows XP)

After the fifth attempt times-out there are no more elements in the array to use, then we stop the query and return an error to the caller. As our default Timeout array has 5 elements, we try to resolve the query in 5 attempts. The total waiting time before we give up the query is 12 seconds (15 seconds in XP) which is the sum of all the values in the array.

Important note: the DNS servers list is kept in memory by the dnscache service. The next best server is determined based on a priority. All the servers start with the same priority and they are sorted for each adapter based on the precedence in which they were configured. Each time a server times-out its priority is reduced and when a server answers its priority is boosted (error conditions also modify the priority of a server). The next best server for an adapter is the one with the higher priority that is higher in the precedence list (if more than one server have the same priority then the next best is the one that is higher in the precedence list).
It is important to note that this prioritized list is kept across different queries; this means that the priorities are not reset after each query, but they are reused. The idea is that if a server timed-out a recent query then the next query will go to another server with a higher priority first. The effect of this is that the preferred DNS server might not be the first to get the next query if it recently timed-out.
These priorities are reset to the initial default values after an interval named ServerPriorityTimeLimit defined in registry. See http://support.microsoft.com/kb/320760 for more information about this value.
An example of this behavior is a client pointing to two DNS servers: DNS1 and DNS2. The client tries to resolve a name and DNS1 times-out but DNS2 answers. The next query that this client tries to resolve is going to go DNS2 first before being retried in DNS1, because DNS2 would have a higher priority than DNS1.

Making sense of all this information with an example

Suppose we have a computer named CLIENT1 running Windows Server 2003 that has 4 NICs with the binding order NIC1, NIC2, NIC3 and NIC4. The DNS servers list in CLIENT1 is:

NIC1

NIC2

NIC3

NIC4

à

10.110.1.1

à

10.120.1.1

à

10.130.1.1

à

10.140.1.1

 

10.110.1.2

 

 

 

10.130.1.2

 

10.140.1.2

 

10.110.1.3

 

 

 

10.130.1.3

 

 

 

10.110.1.4

 

 

 

 

 

 

The next best server for each adapter is indicated by the “à” symbol. As we are just starting the next best DNS servers for each adapter is the first in their list. Instead of showing how the priorities are modified after each timeout we are going to use a circular list to select the next best server (the effect is the same, and the example looks easier to understand).

We also have a Timeout array in CLIENT1 that looks like this:

Timeout

à

1

 

1

 

2

 

4

 

4

The “à” symbol indicates the value to use as the timeout for our next attempt. As we are just starting to resolve a query, we are at the first element of the array.

A process in CLIENT1 needs to resolve a name. Assume that none of the configured DNS servers are reachable so all the attempts time-out.

  1. First attempt: Send query to 10.110.1.1 (best for NIC1 which is the preferred adapter). Wait at most for 1s (the current value in the Timeout array) for an answer.
    After this attempt times-out our DNS table and Timeout array will look like this (symbols in red indicate changes from the previous state):

    NIC1

    NIC2

    NIC3

    NIC4

     

    10.110.1.1

    à

    10.120.1.1

    à

    10.130.1.1

    à

    10.140.1.1

    à

    10.110.1.2

     

     

     

    10.130.1.2

     

    10.140.1.2

     

    10.110.1.3

     

     

     

    10.130.1.3

     

     

     

    10.110.1.4

     

     

     

     

     

     

     

    Timeout

     

    1

    à

    1

     

    2

     

    4

     

    4

  2. Second attempt (or first retry): Send query to: 10.110.1.2 (next best for NIC1), 10.120.1.1 (next best for NIC2), 10.130.1.1 (next best for NIC3) and 10.140.1.1 (next best for NIC4). Wait at most for 1s (the current value in the Timeout array) for an answer from any of the servers queried.
    After this attempt times-out the DNS table and Timeout array will look like this:

    NIC1

    NIC2

    NIC3

    NIC4

     

    10.110.1.1

    à

    10.120.1.1

     

    10.130.1.1

     

    10.140.1.1

     

    10.110.1.2

     

     

    à

    10.130.1.2

    à

    10.140.1.2

    à

    10.110.1.3

     

     

     

    10.130.1.3

     

     

     

    10.110.1.4

     

     

     

     

     

     

     

    Timeout

     

    1

     

    1

    à

    2

     

    4

     

    4

  3. Third attempt: Send query to: 10.110.1.3, 10.120.1.1 (NIC2 has just one DNS server listed then this server is always the best server for it), 10.130.1.2 and 10.140.1.2. Wait at most for 2s for an answer from any of the servers queried.
    After this attempt times-out the DNS table and Timeout array will look like this:

    NIC1

    NIC2

    NIC3

    NIC4

     

    10.110.1.1

    à

    10.120.1.1

     

    10.130.1.1

    à

    10.140.1.1

     

    10.110.1.2

     

     

     

    10.130.1.2

     

    10.140.1.2

     

    10.110.1.3

     

     

    à

    10.130.1.3

     

     

    à

    10.110.1.4

     

     

     

     

     

     

     

    Timeout

     

    1

     

    1

     

    2

    à

    4

     

    4

  4. Fourth attempt: Send query to all DNS servers in all the adapters (including those that timed-out in previous attempts). Wait at most for 4s for an answer from any of the servers queried.
    At the end of the waiting time for this attempt the Timeout array will look like this (the DNS list table is not included as we will not use it again for this example):

    Timeout

     

    1

     

    1

     

    2

     

    4

    à

    4

  5. Fifth attempt: Send query to all DNS servers in all the adapters. Wait at most for 4s for an answer from any of the servers queried.
    After this attempt times-out we have run out of values in the Timeout array, then we give up and return an error to the caller.

Where is the Network Trace?

You can see the behavior of the previous example in a network trace:

image

  • Frame #3 shows the first attempt: preferred DNS server of the preferred adapter. We use one frame only for this attempt.
  • Frames #9 to #15 show the second attempt: next best DNS server in all the adapters. Notice how these frames have a time delta of 1s after the first attempt. We have 4 frames because we are querying 1 DNS server for each adapter and we have 4 NICs.
  • Frames #17 to #23 show the third attempt: next best DNS server in all the adapters. Notice how these frames have a time delta of 1s after the second attempt. We have 4 frames again because we are querying 1 DNS server for each adapter.
  • Frames #25 to #43 show the fourth attempt: all the DNS servers in all the adapters. Notice how these frames have a time delta of 2s after the third attempt. We have 10 frames because we are querying all the DNS servers in all the adapters, and we have a total of 10 servers to query: 4 for NIC1 + 1 for NIC2 + 3 for NIC3 + 2 for NIC4.
  • Frames #45 to #63 show the fifth attempt: all the DNS servers in all the adapters. Notice how these frames have a time delta of 4s after the fourth attempt. We have 10 frames for this attempt too.
  • We have no more frames because we do 5 attempts (remember the Timeout array has 5 elements by default). After the previous attempt times-out, which is going to be 4s (recheck the Timeout array again if you don’t remember where this value of 4s comes from), then we return an error to the caller.

Conclusion

Hopefully after reading this two-part blog post you have a better understanding of how the Windows DNS client works and the logic it follows when it deals with timeouts.

Based on this information, keep in mind these best practices:

  1. Configure the clients to point to more than one DNS server for fault-tolerance. Do not list more than one server to overcome disjoint DNS namespaces, and if you are going to do so, understand the risks and consequences.
  2. Try to have the DNS list in the clients ordered based on the “closeness” (in network terms) to the DNS servers to avoid retries due to timeouts.
  3. Try to have clients use DNS servers that have the information that they are going to query more often; in the case of domain members these would be the DNS servers that have the client domain’s zone.
  4. Maintain an internal DNS infrastructure and hierarchy where names can be resolved independently of the internal DNS server that is queried. For DNS implementations that support multi-domain AD environments, make sure that any DNS servers can resolve any names no matter the domain where the names are registered.
    Note: saying that any DNS servers can resolve any names does not mean that all of them have a copy of all the zones in the environment. What it means is that all of them have a way to find the name in the DNS hierarchy because the forwarders/stubs zones/delegations/secondary zones are properly configured.