DC fails logons or experiences LDAP timeouts
DC fails logons or experiences LDAP timeouts
This was an interesting one which rolled by recently, and it’s a looong post so I apologize ahead of time.
Let’s start with the end user experience and move on from there: User(s) cannot send mail or retrieve mail from Exchange 2010 server.
Well that’s pretty simple J.
So when we looked at the Exchange servers, we see a whole slew of events from various Exchange roles and servers – I have changed the target DC name, but the DC failing to return in a timely manner also varied.
Event id 2389’s
Process STORE.EXE (PID=7908). A search request to Directory Server DC1.FOO.COM did not return a result within 30 seconds and is being abandoned.
Process edgetransport.exe () (PID=6928). A request to Directory Server DC1.FOO.COM did not return a result within 180 secon...
Process Microsoft.Exchange.RpcClientAccess.Service.exe (PID=5256). A request to Directory Server DC1.FOO.COM did not return...
Process w3wp.exe () (PID=30496). A request to Directory Server DC1.FOO.COM did not return a result within 180 seconds and i...
At first we wondered if the queries were in fact “bad” ( expensive or inefficient ) queries but after looking at them they were all very targeted queries.
“CN=ExchSrv-01 MBX Store 013,CN=Databases,CN=Exchange Administrative Group(BAR),CN=Administrative Groups,CN=XCH,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=DC1,DC=FOO,DC=COM"
This path only had a few objects under it and the DCs should have been able to handle massive amounts of these calls without even breathing hard, and sure enough, when the Perf data was reviewed the CPU for LSASS was not very high ( IIRC something around 20% ) during the timeframe in which the Exchange servers were falling over.
Which brings us to performance data. There was no baseline for anything in this environment , so in order to determine what “normal” was, we gathered a day or so’s worth of data
The list below may not be the exact names as I am typing them from memory, but you get the picture (I hope).
· Overall System CPU
· LSASS CPU
· Process LSASS Private bytes
· NTDS (all)
· Logical Disk
· Physical Disk
However, none of this data was captured for a baseline, or from a DC during an outage.
So, I wrote a rudimentary LDAP stress tool which you can spin up X amount of threads in order to hit the DCs with various LDAP searches. We wanted to use this during off hours in order to determine the impact on the DCs, because at the time we did not have any perf data from an actual outage and wanted to get some more data before the next day.
I use Perfmon to trend and analyze high level data – think of it like day to day trending and analysis, and use something like SPA ( In 2k8 it’s in Perfmon under “Data Collector Sets”) see here for more info on it. I think of SPA more like a microscope, when you want to zoom in on a period of time where you know, or suspect, something is going on. You would never (IMO) want to take a day’s worth of SPA data to plow through.
So, we ran some LDAP stress against the DCs, trying to make them cry uncle and give us a timeout but it did not. Perhaps it just wasn’t generating enough load (even at 800 threads from multiple instances on multiple clients).
What we did see, was interesting. We can see that the stress tool was indeed giving us some perf data increase, and one of the more interesting counters is the ATQ (asynchronous thread queue) counters :
· \NTDS\ATQ Outstanding Queued Requests
· \NTDS\ATQ Request Latency
· \NTDS\ATQ Estimated Queue Dela
NOTE: ‘normal’ data is on the left ( may not be entirely normal as this is still a Saturday afternoon and lower than expected numbers)
Looking at ATQ Outstanding Queued Requests to see how many requests are queued at the domain controller.
A high level of queuing indicates that requests are arriving at the domain controller faster than they can be processed. This can also lead to a high latency in responding to requests. Delay is the estimated time (in milliseconds) the next request will spend in the queue prior to being serviced by the directory service, 1.265 seconds.
So as the stress load increased we see that the delays in processing goes up – perhaps to the point of failure?
By default there are 4 threads per processor allocated to the LDAP thread pool, we can change that via LDAP policies, specifically MaxPoolThreads:
· MaxPoolThreads = Maximum number of threads created by the domain controller for query execution (4 per processor)
Theory at this point was that we were clogging up the thread pool and more throughput here would improve performance as I did not believe we were not constrained by Disk\CPU\Memory on the systems. Processor, Disk IO counters, and Database (\Database(lsass)\Database Cache % Hit) give us some insight to my reasoning here.
We increased the threads to 8 per proc on a test DC and re-ran the ldap stress.
CPU usage is still within normal ranges ~20%.
We have lowered queue delay and increased ( nearly doubled ) DS search operations ( previous stress is on left , post 8 threads is on the right , note we also had the VM at one proc previously and this is now two)
At this point we wait for the next morning to hit, with Perfmon ready to go, SPA ready to go, and ready to get some LSASS.EXE dumps via procdump.exe if the issue occurs again.
Sure enough, Exchange goes down in spectacular flames…
We got the SPA data, LSA dumps etc. and in hindsight, some of the DCs had not been rebooted after making the LDAP policy change and so were still at the default 4 threads.
We examine the actual outage perf data and sure enough, we see that the queue delay is quite high, up to ~4 seconds as seen in the perf graph below.
But why? Why can this DC not handle the throughput here? Typically I see CPU increase when a DC in under serious load, and these DCs weren’t breathing hard.
The SPA data also showed a large number of LDAPPing requests which took ~2.2 seconds:
In addition, the raw SPA data showed more information ( client addresses )
Now this was a bit interesting…Let’s look at the LSASS process dump.
I saw all 8 ATQ threads and all of them were in this state:
Child-SP RetAddr Call Site
00000000`051fc2a8 000007fe`fcd0527b mswsock!Nbt_ResolveName
00000000`051fc2b0 000007fe`fccefa51 mswsock!Rnr_NbtResolveName+0x1b
00000000`051fc2e0 000007fe`fcce483a mswsock!Rnr_DoDnsLookup+0xb171
00000000`051fc350 000007fe`ff473fc9 mswsock!Dns_NSPLookupServiceNext+0x1ba
00000000`051fc6b0 000007fe`ff473f1b WS2_32!NSQUERY::LookupServiceNext+0x79
00000000`051fc710 000007fe`ff474562 WS2_32!WSALookupServiceNextW+0xce
00000000`051fc760 000007fe`ff474342 WS2_32!QueryDnsForFamily+0x1c2
00000000`051fd0c0 000007fe`ff474c05 WS2_32!QueryDns+0x110
00000000`051fd140 000007fe`ff473332 WS2_32!LookupAddressForName+0x1a5
00000000`051fd240 000007fe`fcaef2b9 WS2_32!GetAddrInfoW+0x232
00000000`051fd3d0 000007fe`fcaf03a9 netlogon!BuildSamLogonResponseEx+0x139
00000000`051fd6f0 000007fe`fcaf0c80 netlogon!LogonRequestHandler+0x211
00000000`051fd7c0 000007fe`fcb052fa netlogon!NlGetLocalPingResponse+0x410
00000000`051fdad0 000007fe`fc5b8a91 netlogon!I_NetLogonLdapLookupEx+0x616
00000000`051fdcc0 000007fe`fc5651dd ntdsai!LDAP_GetDSEAtts+0xe1f
00000000`051fe280 000007fe`fc56328e ntdsai!LDAP_CONN::SearchRequest+0x5d7
00000000`051fee50 000007fe`fc5bb37c ntdsai!LDAP_CONN::ProcessRequestEx+0xeab
00000000`051ff100 000007fe`fc5b64fe ntdsai!LDAP_CONN::IoCompletion+0x1ad
00000000`051ff570 000007fe`fc7106fa ntdsai!ProcessNewClient+0x140
00000000`051ff5d0 000007fe`fc041774 ntdsai!UDPIoCompletion+0x1a6
00000000`051ff920 000007fe`fc041921 NTDSATQ!AtqpProcessContext+0x221
00000000`051ff960 00000000`7737f56d NTDSATQ!AtqPoolThread+0x1e7
00000000`051ff9e0 00000000`774b3281 kernel32!BaseThreadInitThunk+0xd
00000000`051ffa10 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
Ahh very interesting.
Netlogon of course, processes DC locator and DSget* API’s. So, let’s recap what we have here.
· Exchange servers fall over from LDAP delays
· DCs have a default of four threads processing data
· DCs are not CPU constrained and LDAP queries are not expensive or inefficient
· DCs have some clients doing a DC discovery and sending the standard “LDAPPING” as documented in the [MS-ADTS]: Active Directory Technical Specification, you can see this in a network trace as well
· DCs threads are all gummed up in waiting to resolve name?
Upon further investigation we see the following in the DCs netlogon logs:
05/08 15:12:50 [MAILSLOT] Received ping from machine1.FOO.COM. (null) on UDP LDAP
05/08 15:12:50 Domain1: NO_CLIENT_SITE: machine1 10.10.10.87
05/08 15:12:50 [MAILSLOT] Domain1: Ping response 'Sam Logon Response Ex' (null) to \\machin1 Site: (null) on UDP LDAP
At the end of the day, after some additional code review and proving it out in a lab, the DC is trying to resolve the site to client mapping and if it does not find a subnet for the client, and can not resolve the name in DNS, it will try NetBios which takes ~2.5 seconds to timeout and a the number of calls from clients which fail in this manner, combined with the number of LDAP requests Exchange needed just to handle the morning logons – plugs the few default ATQ threads with client LDAP pings working through netbios name resolution.
The workaround in this case was to disable Netbios on the DC but this may not be an option for everyone, so the site subnet mapping or DNS name resolution should also fix this kind of an issue.
KEYWORDS: LDAP search hang delay netbios ATQ LDAP timeout