The Case of the Mysterious Exchange Server Hang
Recently we had a case in which an Exchange 2003 server would hang and no longer accept any new RPC connections to the Information Store. The rest of the server seemed to be operating just fine, but it was the Store that was ultimately having the problems.
I took a look at the perfmon data that was provided and didn’t see anything out of the ordinary except for a small amount of RPC Operations taking place on the server. The server did look like it was processing data though, so this was quite intriguing to me now. I did notice that one DC had a number outstanding LDAP requests for an extended period of time as shown below.
We ended up taking some dumps of the Store, IIS processes, and LSASS to see what might be going on. The store and IIS dumps were not that interesting. Looking at the LSASS dumps was an eye opener. We saw that over 150 threads were hung up calling (SECUR32!CALLSPM) in to the Security Provider Manager (SPM). The beginning of the stacks were showing secur32!LsaAcceptSecurityContext calls which were mostly client authentication calls to the server. More info on the AcceptSecurityContext calls can be found here
There was almost 200 other threads that were calling netlogon!NlpUserValidateHigher which essentially means that we are trying to send a user validation request to a higher authority for authentication requests over the secure channel. Once we accept this validation request, we then attempt to connect to the DC over RPC to handle the request. Debug analysis can be found on Dave Goldman’s blog here.
By default, Netlogon only allows 2 concurrent API calls for these authentication requests which is controlled by a semaphore. If the 2 semaphore objects are tied up waiting for a response from the DC, all other requests will start queuing, thus having this mysterious hang affect on the Exchange Server. This was our problem since the debug analysis showed that we hit our maximum of 2 concurrent requests most likely to an overloaded DC, leaving a backlog of requests for authentication traffic. This request queue is controlled by the MaxConcurrentApi setting. Each request has a default timeout of 45 seconds, so if there were requests that were timing out, this is surely going to cause some delays for other users. On healthy servers with good network connectivity, these authentication requests are extremely fast.
At this point, we knew that we were tied up in authentication calls to DC’s, but we couldn’t find out what users were trying to logon which were taking the most amount of time. The debug information only shows a point in time. It is possible that a user could be trying to authenticate to a down-level domain across a slow WAN link, not sure at this time.
To move forward, we enabled Netlogon Debug logging per http://support.microsoft.com/kb/109626 and let the problem occur again.
We opened the netlogon.log file and started reviewing the information. Prior to the problem we can see that responses are returning in a timely manner. Notice the time intervals happen within the same second
07/31 11:36:11 [LOGON] SamLogon: Network logon of US\User1 from COMPUTER1 Entered
07/31 11:36:11 [LOGON] SamLogon: Network logon of US\User1 from COMPUTER1 Returns 0x0
07/31 11:36:11 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered
07/31 11:36:11 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0
07/31 11:36:11 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered
07/31 11:36:11 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0x0
As traffic increases, the response times are starting to get a little slower
07/31 11:53:56 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered
07/31 11:54:14 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0x0
07/31 11:53:57 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered
07/31 11:54:17 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0
Now we see a response time right at 45 second timeout below
07/31 11:57:02 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered
07/31 11:57:47 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0
Here is where our first netlogon timeout hit
07/31 11:57:03 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered
07/31 11:57:48 [CRITICAL] EXDOMAIN: NlAllocateClientApi timed out: 0 258
07/31 11:57:48 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.
07/31 11:57:48 [SESSION] I_NetLogonGetAuthData called: (null) EXDOMAIN(Flags 0x1)
07/31 11:57:48 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0xC000005E
0xC000005E = STATUS_NO_LOGON_SERVERS
Now we are seeing that we cannot allocate a Client API slow because the max request queue is busy servicing other requests
07/31 11:58:55 [CRITICAL] EXDOMAIN: NlAllocateClientApi timed out: 0 258
07/31 11:58:55 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.
07/31 12:38:08 [CRITICAL] EXDOMAIN: NlAllocateClientApi timed out: 0 258
07/31 12:38:08 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.
Now we get to an actual DC timeout error as shown below.
08/01 17:21:24 [CRITICAL] NlPrintRpcDebug: Couldn't get EEInfo for I_NetLogonSamLogonEx: 1761 (may be legitimate for 0xc0000064)
08/01 17:21:24 [CRITICAL] EXDOMAIN: NlFinishApiClientSession: timeout call to \\DC1.domain.com. Count: 2
08/01 17:21:24 [CRITICAL] EXDOMAIN: NlFinishApiClientSession: dropping the session to \\DC1.domain.com
08/01 17:21:24 [CRITICAL] EXDOMAIN: NlSetStatusClientSession: Set connection status to c000005e
We can see clearly now that DC1 was having problems servicing authentication requests to this Exchange server. This does not always mean that the DC is overloaded, it could be a down level network trust that is really slow that is causing this problem, so additional investigation needs to be performed at this point. We just know that Exchange is the victim and the problem is elsewhere now.
So what can we do at this point?
We can test secure channels for different domains to see which domains might be failing. First we will need to obtain the DC in which the secure channel is currently formed on the Exchange server by running nlttest /sc_query:<domain> replacing domain with the domain name where the Exchange Server resides in.
Once that DC is found, you will then run a command similar to the following for each of the domains:
nltest /server: DC1 /sc_query:ASIA
nltest /server: DC1 /sc_query:EUROPE
nltest /server: DC1 /sc_query:US
This will help fish out any down level domains that could be causing authentication delays.
You can also enable netlogon debug logging on the DC’s to help understand the traffic patterns there.
Installing the Server Performance Advisor on the Windows 2003 DC’s or using the Active Directory Diagnostics Data Collector in the Windows 2008 Reliability and Performance monitor will help fish out any potential bottlenecks.
Take netmon captures and search for NetrLogonSamLogonEx entries for Netlogon requests
For Windows 2003 servers, you can install the following hotfix to help track these type issues faster. This hotfix adds new performance counters to help track access to these semaphores better. Windows 2008 servers already have this built in to the OS
New performance counters for Windows Server 2003 let you monitor the performance of Netlogon authentication
The main ones you want to look at at the following
Semaphore Holders: How many threads on average are holding the client semaphore
This is the number of threads trying to get a netlogon session to a DC that are Blocked. Blocked could be locked open by a process, network down, etc. when semaphore waiters is non 0, some local process is waiting on lsass for a response and the lsass thread is blocked. This correlates to the MaxConcurrentApi setting
By default this value should be less than 2 at any given time. If values about 2 are sustained, then either the Exchange server or DCs are overloaded.
Average Semaphore Hold Time: The average wait time for a thread to acquire the semaphore
These values should normally be very quick. Longer hold times mean that a potential bottleneck is occurring.
Semaphore Waiters: The average number of waiters waiting on the semaphore.
This value should remain at 0 at all times. Short bursty spikes are OK to see as that simply means that we had a large amount of requests which were handled in a short period of time.
In some instances on heavily loaded servers, you may want to adjust MaxConcurrentApi to a value of 5 on both the Exchange Servers and DC’s to help widen the pipe or increase the amount of auth requests that can occur at any given time. Bumping this setting up may help alleviate this problem altogether, but could also prolong the issue due to some other underlying issue that has now been masked. It’s always best to understand where the problem is coming from before making any major changes such as this which may increase overall processor utilization on the Exchange server and your domain controllers.
In this instance, we set MaxConcurrentApi to 5 on the DC’s and Exchange Servers and this appears to have reduced the amount of occurrences of this problem.
This particular problem not only affects Exchange servers, but also affects other applications such as ISA server. More information on this can be found here
I hope this provides some insight in to some of the underlying dependency problems that you may seen in Exchange.
That is all for now.