Excessive paging on Exchange 2007 servers when working sets are trimmed

For a list of current recommendations to help alleviate these issue, click here

Recently, there has been a rash of performance issues on Exchange 2007 Mailbox servers where they become unresponsive due to excessive paging. Previously this was tracked down to .NET garbage collection not occurring properly which caused managed services to consume excessive amounts of memory. Applying http://support.microsoft.com/kb/942027 was the fix for this. If you have installed SP1 for Exchange 2007, we currently recommend to apply .NET 2.0 SP1 to the server which contains this fix. This is not a hard block, but a warning during the pre-req checks during setup to apply this.

Even if you have this hotfix installed, excessive paging was still occurring. After looking in to this further, we noticed that the working set for store.exe was getting trimmed causing us to page that trimmed memory to the paging file. Here is what you will see if you experience this in performance monitor. The red line is Memory\Pages/sec when the trim operations occur.

image

Here is a screenshot of all processes working sets.

image

If you look at all of the processes, they are all getting trimmed at the same time. With Exchange 2007, we have lifted the ceiling on store cache and can use upwards of 80-90% of the memory on a server. So if you have 32GB of RAM on the server, the store process itself will take longer to warm-up the cache than it will with only 16GB of RAM. During this time, clients may experience slower than normal responses while the cache is being populated. Looking at the above data, store.exe is using over 20GB of RAM and when we reach a certain peak, the working set gets trimmed causing this cache to also get trimmed. This is where performance dies when this working set is being paged to disk. Once the paging has occurred, we then have to reload that data back in to the working set as that is needed to perform a current operation. If the roller coaster ride begins, the server is going to fall over as you can see at midnight that even performance monitor couldn't even connect to the server. Once paging storm died down 30 minutes later, we were back working again, albeit slow.

Memory Planning considerations can be found in the help file at http://technet.microsoft.com/en-us/library/bb738124.aspx 

So you may ask, what is causing this? Well, in Windows 2003, if a driver makes a call to allocate a considerable amount of memory, specifically MMAllocateContiguousMemory, the working sets of processes need to get trimmed to give this call the memory that it needs. This can actually be any driver on the server that has the need to request these memory allocations. Detecting what driver it is much harder than you would think as you have to perform some type of kernel debugging to get to the bottom of it.

Currently in Windows 2003, we will trim about 1/4 of the working sets for each process to satisfy these requests and as you can see, it appears to be much more than 1/4 of the working set being trimmed, more like 3/4. Luckily, the below hotfix helps with this situation greatly when these trimming operations occurs as with the hotfix applied, we only trim 8,192 pages of each working set instead of a percentage.

A Windows Server 2003-based computer becomes unresponsive because of a memory manager trimming operation that is caused by an indeterminate module that requests lots of memory
http://support.microsoft.com/kb/938486 

As you can see below after the hotfix is applied, the store working set remained steady throughout it's lifetime which is a huge improvement over the previous screenshots.

image

Shown below are the working sets from all processes. When these trim operations occur, we trim memory in a step stair pattern, thus preventing your server from all of this excessive paging. This more or less prevents faulty drivers from hanging/crashing your server.

image

ExBPA over time will be updated to detect drivers that may aggravate this problem, but at least there is some protection built in to the Operating system now to help this situation with this hotfix applied.

Currently, you can request this hotfix and the link is in the aforementioned article to get it. 

For a list of current recommendations to help alleviate these issue, click here

Mike