Deep Dive: Active Directory ESE Version Store Changes in Server 2019
Hey everybody. Ryan Ries here to help you fellow AD ninjas celebrate the launch of Server 2019.
Warning: As is my wont, this is a deep dive post. Make sure you've had your coffee before proceeding.
But now I want to talk about an enhancement to on-premises Active Directory in Server 2019 that you won't read or hear anywhere else. This specific topic is near and dear to my heart personally.
The intent of the first section of this article is to discuss how Active Directory’s sizing of the ESE version store has changed in Server 2019 going forward. The second section of this article will discuss some basic debugging techniques related to the ESE version store.
Active Directory, also known as NT Directory Services (NTDS,) uses Extensible Storage Engine (ESE) technology as its underlying database.
One component of all ESE database instances is known as the version store. The version store is an in-memory temporary storage location where ESE stores snapshots of the database during open transactions. This allows the database to roll back transactions and return to a previous state in case the transactions cannot be committed. When the version store is full, no more database transactions can be committed, which effectively brings NTDS to a halt.
In 2016, the CSS Directory Services support team blog, (also known as AskDS,) published some previously undocumented (and some lightly-documented) internals regarding the ESE version store. Those new to the concept of the ESE version store should read that blog post first.
In the blog post linked to previously, it was demonstrated how Active Directory had calculated the size of the ESE version store since AD’s introduction in Windows 2000. When the NTDS service first started, a complex algorithm was used to calculate version store size. This algorithm included the machine’s native pointer size, number of CPUs, version store page size (based on an assumption which was incorrect on 64-bit operating systems,) maximum number of simultaneous RPC calls allowed, maximum number of ESE sessions allowed per thread, and more.
Since the version store is a memory resource, it follows that the most important factor in determining the optimal ESE version store size is the amount of physical memory in the machine, and that - ironically - seems to have been the only variable not considered in the equation!
The way that Active Directory calculated the version store size did not age well. The original algorithm was written during a time when all machines running Windows were 32-bit, and even high-end server machines had maybe one or two gigabytes of RAM.
As a result, many customers have contacted Microsoft Support over the years for issues arising on their domain controllers that could be attributed to or at least exacerbated by an undersized ESE version store. Furthermore, even though the default ESE version store size can be augmented by the "EDB max ver pages (increment over the minimum)" registry setting, customers are often hesitant to use the setting because it is a complex topic that warrants heavier and more generous amounts of documentation than what has traditionally been available.
The algorithm is now greatly simplified in Server 2019:
- When NTDS first starts, the ESE version store size is now calculated as 10% of physical RAM, with a minimum of 400MB and a maximum of 4GB.
The same calculation applies to physical machines and virtual machines. In the case of virtual machines with dynamic memory, the calculation will be based off of the amount of "starting RAM" assigned to the VM. The "EDB max ver pages (increment over the minimum)" registry setting can still be used, as before, to add additional buckets over the default calculation. (Even beyond 4GB if desired.) The registry setting is in terms of "buckets," not bytes. Version store buckets are 32KB each on 64-bit systems. (They are 16KB on 32-bit systems, but Microsoft no longer supports any 32-bit server OSes.) Therefore, if one adds 5000 "buckets" by setting the registry entry to 5000 (decimal,) then 156MB will be added to the default version store size. A minimum of 400MB was chosen for backwards compatibility because when using the old algorithm, the default version store size for a DC with a single 64-bit CPU was ~410MB, regardless of how much memory it had. (There is no way to configure less than the minimum of 400MB, similar to previous Windows versions.) The advantage of the new algorithm is that now the version store size scales linearly with the amount of memory the domain controller has, when previously it did not.
|Physical Memory in the Domain Controller||Default ESE Version Store Size|
This new calculation will result in larger default ESE version store sizes for domain controllers with greater than 4GB of physical memory when compared to the old algorithm. This means more version store space to process database transactions, and fewer cases of version store exhaustion. (Which means fewer customers needing to call us!)
Note: This enhancement currently only exists in Server 2019 and there are not yet any plans to backport it to older Windows versions.
Note: This enhancement applies only to Active Directory and not to any other application that uses an ESE database such as Exchange, etc.
ESE Version Store Advanced Debugging and Troubleshooting
This section will cover some basic ESE version store triage, debugging and troubleshooting techniques.
As covered in the AskDS blog post linked to previously, the performance counter used to see how many ESE version store buckets are currently in use is:
\\.\Database ==> Instances(lsass/NTDSA)\Version buckets allocated
Once that counter has reached its limit, (~12,000 buckets or ~400MB by default,) events will be logged to the Directory Services event log, indicating the exhaustion:
[caption id="attachment_17665" align="alignnone" width="594"] Figure 1: NTDS version store exhaustion. [/caption]
The event can also be viewed graphically in Performance Monitor:
[caption id="attachment_17675" align="alignnone" width="982"] Figure 2: The plateau at 12,314 means that the performance counter "Version Buckets Allocated" cannot go any higher. The flat line represents a dead patient. [/caption]
As long as the domain controller still has available RAM, try increasing the version store size using the previously mentioned registry setting. Increase it in gradual increments until the domain controller is no longer exhausting the ESE version store, or the server has no more free RAM, whichever comes first. Keep in mind that the more memory that is used for version store, the less memory will be available for other resources such as the database cache, so a sensible balance must be struck to maintain optimal performance for your workload. (i.e. no one size fits all.)
If the "Version Buckets Allocated" performance counter is still pegged at the maximum amount, then there is some further investigation that can be done using the debugger.
The eventual goal will be to determine the nature of the activity within NTDS that is primarily responsible for exhausting the domain controller of all its version store, but first, some setup is required.
First, generate a process memory dump of lsass on the domain controller while the machine is "in state" – that is, while the domain controller is at or near version store exhaustion. To do this, the "Create dump file" option can be used in Task Manager by right-clicking on the lsass process on the Details tab. Optionally, another tool such as Sysinternals’ procdump.exe can be used (with the -ma switch .)
In case the issue is transient and only occurs when no one is watching, data collection can be configured on a trigger, using procdump with the -p switch.
Note: Do not share lsass memory dump files with unauthorized persons, as these memory dumps can contain passwords and other sensitive data.
It is a good idea to generate the dump after the Version Buckets Allocated performance counter has risen to an abnormally elevated level but before version store has plateaued completely. The reason why is because the database transaction responsible may be terminated once the exhaustion occurs, therefore the thread would no longer be present in the memory dump. If the guilty thread is no longer alive once the memory dump is taken, troubleshooting will be much more difficult.
Next, gather a copy of %windir%\System32\esent.dll from the same Server 2019 domain controller. The esent.dll file contains a debugger extension, but it is highly dependent upon the correct Windows version, or else it could output incorrect results. It should match the same version of Windows as the memory dump file.
Next, download WinDbg from the Microsoft Store, or from this link.
Once WinDbg is installed, configure the symbol path for Microsoft’s public symbol server:
[caption id="attachment_17685" align="alignnone" width="922"] *Figure 3: srv*c:\symbols*http://msdl.microsoft.com/download/symbol*\[/caption
Now load the lsass.dmp memory dump file, and load the esent.dll module that you had previously collected from the same domain controller:
[caption id="attachment_17695" align="alignnone" width="916"] Figure 4: .load esent.dll[/caption]
Now the ESE database instances present in this memory dump can be viewed with the command !ese dumpinsts:
[caption id="attachment_17705" align="alignnone" width="995"] Figure 5: !ese dumpinsts - The only ESE instance present in an lsass dump on a DC should be NTDSA. [/caption]
Notice that the current version bucket usage is 11,189 out of 12,802 buckets total. The version store in this memory dump is very nearly exhausted. The database is not in a particularly healthy state at this moment.
The command !ese param <instance> can also be used, specifying the same database instance gotten from the previous command, to see global configuration parameters for that ESE database instance. Notice that JET_paramMaxVerPages is set to 12800 buckets, which is 400MB worth of 32KB buckets:
[caption id="attachment_17715" align="alignnone" width="1031"] Figure 6: !ese param [/caption]
To see much more detail regarding the ESE version store, use the !ese verstore <instance> command, specifying the same database instance:
[caption id="attachment_17725" align="alignnone" width="711"] Figure 7: !ese verstore [/caption]
The output of the command above shows us that there is an open, long-running database transaction, how long it’s been running, and which thread started it. This also matches the same information displayed in the Directory Services event log event pictured previously.
Neither the event log event nor the esent debugger extension were always quite so helpful; they have both been enhanced in recent versions of Windows.
In older versions of the esent debugger extension, the thread ID could be found in the dwTrxContext field of the PIB, (command: !ese dump PIB 0x000001AD71621320) and the start time of the transaction could be found in m_trxidstack as a 64-bit file time. But now the debugger extension extracts that data automatically for convenience.
Switch to the thread that was identified earlier and look at its call stack:
[caption id="attachment_17735" align="alignnone" width="651"] Figure 8: The guilty-looking thread responsible for the long-running database transaction. [/caption]
The four functions that are highlighted by a red rectangle in the picture above are interesting, and here’s why:
When an object is deleted on a domain controller, and that object has links to other objects, those links must also be deleted/cleaned by the domain controller. For example, when an Active Directory user becomes a member of a security group, a database link between the user and the group is created that represents that relationship. The same principle applies to all linked attributes in Active Directory. If the Active Directory Recycle Bin is enabled, then the link-cleaning process will be deferred until the deleted object surpasses its Deleted Object Lifetime – typically 60 or 180 days after being deleted. This is why, when the AD Recycle Bin is enabled, a deleted user can be easily restored with all of its group memberships still intact – because the user account object’s links are not cleaned until after its time in the Recycle Bin has expired.
The trouble begins when an object with many backlinks is deleted. Some security groups, distribution lists, RODC password replication policies, etc., may contain hundreds of thousands or even millions of members. Deleting such an object will give the domain controller a lot of work to do. As you can see in the thread call stack shown above, the domain controller had been busily processing links on a deleted object for 47 seconds and still wasn’t done. All the while, more and more ESE version store space was being consumed.
When the AD Recycle Bin is enabled, this can cause even more confusion, because no one remembers that they deleted that gigantic security group 6 months ago. A time bomb has been sitting in the AD Recycle Bin for months. But suddenly, AD replication grinds to a standstill throughout the domain and the admins are scrambling to figure out why.
The performance counter "\\.\DirectoryServices ==> Instances(NTDS)\Link Values Cleaned/sec" would also show increased activity during this time.
There are two main ways to fight this: either by increasing version store size with the "EDB max ver pages (increment over the minimum)" registry setting, or by decreasing the batch size with the "Links process batch size" registry setting, or a combination of both. Domain controllers process the deletion of these links in batches. The smaller the batch size, the shorter the individual database transactions will be, thus relieving pressure on the ESE version store.
Though the default values are properly-sized for almost all Active Directory deployments and most administrators should never have to worry about them, the two previously-mentioned registry settings are supported and well-informed enterprise administrators are encouraged to tweak the values – within reason – to avoid ESE version store depletion. Contact Microsoft customer support before making any modifications if there is any uncertainty.
At this point, one could continue diving deeper, using various approaches (e.g. consider not only debugging process memory, but also consulting DS Object Access audit logs, object metadata from repadmin.exe, etc.) to find out which exact object with many thousands of links was just deleted, but in the end that’s a moot point. There’s nothing else that can be done with that information. The domain controller simply must complete the work of link processing.
In other situations however, it will be apparent using the same techniques shown previously, that it’s an incoming LDAP query from some network client that’s performing inefficient queries, leading to version store exhaustion. Other times it will be DirSync clients. Other times it may be something else. In those instances, there may be more you can do besides just tweaking the version store variables, such as tracking down and silencing the offending network client(s), optimizing LDAP queries, creating database indices, etc..
Thanks for reading,
- Ryan "Where's My Ship-It Award" Ries