Is this horse dead yet: NTLM Bottlenecks and the RPC runtime
Hello again, this is guest author Herbert from Germany.
It’s harder to let go of old components and protocols than dropping old habits. But, I’m falling back to an old habit myself…there goes the New Year resolution.
Quite recently we were faced with a new aspect of an old story. We hoped this problem would cease to exist as customers move forward with Kerberos-based solutions and other methods that facilitate Kerberos, such as smartcard PKINIT.
Yes, there are still some areas where we have to use NTLM for the sake of compatibility or absence of a domain controller. One of the most popular scenarios is disconnected clients using RPC over HTTP to connect to an Exchange mailbox. Another one is web proxy servers - which still often use NTLM although they and most browsers support Kerberos also.
With RPC over HTTP you have two discrete NTLM authentications: the outer HTTP session is authenticated on the frontend server and the inner RPC authentication is done on the mailbox server. The NTLM load from proxy servers can be even worse - as each TCP session has to be authenticated - and some browsers frequently recycle their sessions.
One way or the other, you end up with a high rate of NTLM authentication requests. And you may have already found the “MaxConcurrentAPI“ parameter, which is the number of concurrent NTLM authentications processed by the server. Historically there has been constant talk about a default of 2. However, the defaults are quite different:
- Member-Workstation: 1
- Member-Server: 2
- Domain Controller: 1
The limit applies per Secure Channel. Members can only have one secure channel to a DC in the domain of which they are a member. Domain Controllers have one Secure Channel per trusted domain. However, as many customers follow a functional domain model of “user domains” and “resource domains”, the list of domains actually used for authentication is low and thus DCs are limited to 1 concurrent authentication for a certain “user domain”. Check out this diagram:
In this diagram, you see authentication requests started against servers in the left-hand forest as colored boxes by users in the right-hand forest. We are using the default values of MaxConcurrentAPI. The requests are forwarded along the trust paths to the right-hand forest. The trust paths used are shown by the arrows.
Now you see that on each resource forest DC up to 2 requests from member resource servers are queued. On the downstream DC, you get a maximum of 1 request from the grand-child domain. The same applies to the forest root DC. In this case, the only active authentication call for forest 1 is for the forest 2 grand-child domain, shown with brown API slots and arrows. Now that’s a real convoy…
The hottest link is between the forest root domains as every NTLM request needs to travel through the secure channels of forest1 root DCs with forest2 root DCs.
From the articles you may know “MaxConcurrentAPI” can be increased to 10 with a registry change. Well, Windows Server 2008 and Windows Server 2008 R2 have an update which pushes the limit to 150:
975363 A time-out error occurs when many NTLM authentication requests are sent from a computer that is running Windows Server 2008 R2, Windows 7, Windows Server 2008, or Windows Vista in a high latency network
This should be of some help… In addition, Windows Server 2008 and later include a performance object called ”Netlogon” which allows you monitoring the throughput, load and duration of NTLM authentication requests. You can add that to Windows Server 2003 using an update:
928576 New performance counters for Windows Server 2003 let you monitor the performance of Netlogon authentication
The article also offers a description of the counters. When you track the performance object you notice each secure channel is visible as a separate instance. This allows you to track activity per domain, what DCs are used and whether there are frequent fail-overs.
Beyond the article, these are our recommendations regarding performance baselines and alerts:
All Semaphores are busy, we have threads and thus logons waiting in the queue. This counter is a candidate for a warning.
This is the number of currently active callers. This is a candidate for a baseline to monitor. If this is approaching your maximum setting in baselines, you need to act.
This counts the total # of requests over this secure channel. When the secure channel fails and is reestablished, the count restarts from 0. Check the _Total instance for a counter for the whole server. Good to monitor the trend in baselines.
An authentication thread has hit the time-out for the waiting and the logon was denied. So the logon was slow, and then it failed. This is a very bad user experience and the secure channel is overloaded, hung or broken. Also check the _Total instance.
This is ALERT material.
Average Semaphore Hold Time
This should provide the average response time quite nicely. This is also a candidate for baseline monitoring for trends.
When it comes to discussing secure channels and maximum concurrency and queue depth, you also have to talk about how the requests are routed. Within a forest, you notice that the requests are sent directly to the target user domain.
When Netlogon finds that the user account is from another forest, it however has to follow the trust path, similar to what a Kerberos client would do (just the opposite direction). So the requests are forwarded to the parent domain and eventually arrive at the forest root DCs and from there across the forest boundary. You can easily imagine the Netlogon Service queues and context items look like rush hour at the Frankfurt airport.
So who cares?
You might say that besides the domains becoming bigger nowadays, there’s not a lot of news for folks running Exchange or big proxy server farms. Well, recently we became aware of a new source of NTLM authentication requests that was in the system for quite some time, but that now has reared its head. Recently customers have decided to turn this on, perhaps due to recommendations in a few of our best practices guides. We’re currently working on having these updated.
RPC Interface Restriction was introduced in Windows XP Service Pack 2 and Windows Server 2003 Service Pack 1 and offers the options to force authentication for all RPC Endpoint Mapper requests. The goal was to prevent anonymous attacks on the service. The goal may also have been avoiding denial of service attacks, but that one did not pan out very well. The details are described here:
In this description, the facility is hard-coded to use NTLM authentication. Starting with Windows 7, the feature can also use Kerberos for authentication. So this is yet another reason to update.
The server will only require authentication (reject anonymous clients) if “RestrictRemoteClients” is set to 1 or higher. When you have the combinations of applications with dynamic endpoints, many clients and frequent reconnects in the deployments, you get a sustainable number of authentications.
Some of the customers affected were quite surprised about the NTLM authentication volume, as they had everything configured to use Kerberos on their proxy servers and Exchange running without RPC over HTTP clients.
Exchange with MAPI clients is an application architecture that uses many different RPC interfaces, all using Endpoint Mapper. The list includes Store, NSPI, Referrer plus a few operating system interfaces like LSA RPC, each one of them triggering NTLM authentications. The bottleneck is then caused by the queuing of requests, done in each hop along the trust path.
Similar problems may happen with custom applications using RPC or DCOM to communicate. It all comes down to the rate of NTLM authentications induced on the AD infrastructure.
In our testing we found that not all RPC interfaces are happy with secure endpoint mapper, see the blog of Ned.
What are customers doing about it?
Most customers are then going to increase “MaxConcurrentAPI” which provides relief. Many customers also add monitoring of Netlogon performance counters to their baseline. We also have customers who start to use secure channel monitoring, and when they see that a DC is heaping incoming secure channels, they use “nltest /SC_RESET” to balance resource domain controllers or member servers evenly across the downstream domain controllers.
And yes, one way out of this is also setting the RPC registry entries or group policy to the defaults, so clients don’t attempt NTLM authentication. Since this setting was often required by the security department, it is probably not being changed in all cases. Some arguments that the secure Endpoint Mapper may not provide significant value are as follows:
1. The call is only done to get the server TCP port. The communication to the server typically is authenticated separately.
2. If the firewall does not permit incoming RPC endpoint mapper request from the Internet, the callers are all from the internal network. Thus no information is disclosed to outside entities if the network is secure.
3. There are no known vulnerabilities in the endpoint mapper. It was once justified when there were vulnerabilities, but not today.
4. If you can’t get the security policy changed, ask the IT team to expedite Windows 7 deployment as it does not cause NTLM authentication in this scenario.
Ah, those old habits, they always come back on you. The hope you now have tools and countermeasures to make all this more bearable.
There is an update available that adds NetLogon events 5816-5819 when you experience Semaphore Waiters and Semaphore Timeouts. This will allow you to find bottlenecks in your trust graph quickly. Check out:
New event log entries that track NTLM authentication delays and failures in Windows Server 2008 R2 are available - http://support.microsoft.com/kb/2654097
Herbert “glue factory” Mauerer