ASP.NET Case Study: High CPU in GC - Large objects and high allocation rates

Rather than spending a lot of time on explaining the details of the garbage collector, I'll refer you to Maoni's blog for some very interesting reading, but in
this case study I want to show you how to get to the bottom of a problem with high CPU in the garbage collector.

Problem description:

At times the application will be very slow to respond and the CPU usage is very high (70-80%).

Troubleshooting steps:

I have already tipped you off in the title that the issue will be High CPU in GC, but how do we even get to that conclusion? The answer is performance monitor...

A high CPU issue is generally one of 3 things

· An infinite loop

· Too much load (i.e. too many requests doing lots of small things, so no one single thing is a culprit)

· Too much churning in the Garbage Collector.

The first one is pretty easy to spot in a memory dump if you get it at the right time, and is usually caused by some kind of logic problem. The second type is really tough and often requires adding hardware and new machines along with some serious profiling to optimize the code that is running on each request. For the purposes of this post, I'll skip both of those.

To find out if your performance issue might be GC related, your friend is the .NET CLR Memory performance counters

Let's first take a look at one of the most important performance counters used for these types of issues, .net CLR Memory / % Time in GC.

Paraphrasing from Maoni's blog and my favorite white paper on ASP.NET performance counters, the % Time in GC counter shows how much of the CPU time spent since the last GC was spent on the current GC. In other words, if we have spent 1 million cycles since the last GC, and we spent 0.3 million cycles in the current GC, this number will be 30%.

That means for short periods of times you can see huge spikes, but it is the average that is important for this counter.

In different articles people will declare the optimal average value to be 5% or 30% or 20%, so what is it? Short answer: there is really no optimal value, or rather, the optimal value would be close to 0, but reducing this value is very hard, so I wouldn't even bother if you don't have an average value above 30% or so. There is simply too much pain and too little gain.

The common cause for most high CPU in GC issues is a very high allocation rate ( .net CLR Memory / allocated bytes/sec) but really, if you spend all your GC time doing Gen 0 collections that is not going to cause a high CPU in GC since Gen 0 collections are very inexpensive. So basically what hurts you is a lot of Gen 2 collections, i.e. a lot of .net objects making it into Generation 2 and then getting released. The other big thing that causes a high number of Gen 2 collections is excessive use of the Large Object Heap (LOH). The reason for this is because every time we fill up a LOH segment and need to create a new one to satisfy a request we need to garbage collect, and a large object heap collection triggers a Gen 2 collection, which in turn triggers a Gen 1 and a Gen 0 collection.

This brings us to the third group of counters that I usually look at ( .net CLR Memory / # Gen 0 Collections, # Gen 1 Collections and # Gen 2 Collections). Again, there is no hard and fast rule about what the rates should be between these. Some recommend a 100/10/1 ratio, i.e. for each Gen 2 garbage collection you should have about 10 Gen 1 collections and 100 Gen 0 collections if you have a good allocation pattern. I think that is a good rule of thumb but deviances from this are common and a fairly small deviance wouldn't concern me too much, however if have the same value for Gen 0, Gen 1 and Gen 2 collections you can start worrying a little bit:)

Oh, while I remember, another thing (apart from the blocked Finalizer that I posted about earlier) that can cause a large number of Gen 2 collections is of course the human element, in other words if you call GC.Collect(3) or GC.Collect(2) manually in your application on a regular and frequent bases, or if you call GC.GetTotalMemory(true). To see if this can be an issue, watch the ( .net CLR Memory / # Induced GC). Ideally this one should be 0, and it is in most cases, but if you are closing in on the memory limit you might see this one go up since ASP.NET itself will induce a GC at that point, but only 1 or so...

For this specific issue I have the following values:

% Time in GC ~40 % average
allocated bytes / sec 400 MB average
# Induced GC 0
# Gen 0 Collections 28.379
# Gen 1 Collections 28.378
# Gen 2 Collections 28.378

WOW! Now that is a massive amount of allocated bytes / sec :) but I definitely know that I am not allocating anything that huge in my application, especially not every second.

The other interesting point here is that our Gen 0, Gen 1 and Gen 2 are close to identical, which is a sure clue that we are either stressing the LOH pretty hard or we have a lot of objects that make it into the 2nd generation and then die immediately. My personal vote goes to the LOH.

In a perfect world we know exactly what page or small set of pages are responsible for the high allocation rate and we can use the CLR profiler (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dndotnet/html/highperfmanagedapps.asp) to catalogue the allocations and figure out a way to reduce them. Oh, and did I mention, of course in a perfect world we can repro the problem on a test environment. But… need I say it? We don’t live in a perfect world…

However if you can repro the issue in test, I would highly recommend using the CLR profiler to look at your allocation pattern.

In this particular case, we have no clue what is causing the problem so we will resort to my favourite tool windbgJ

Debugging the problem:

GC problems are pretty hard to debug for a couple of reasons.

1. If you catch a -hang dump and it happens to be in the middle of a GC, that is no absolute indication that you have a high CPU in GC problem. In a process with fairly high load and a lot of allocations, GC's happen frequently so it is common to catch it in a GC.
2. Even if you know you have a GC problem (based on performance counters), and you catch the process in a GC, it is hard to tell what actually caused it.

The best way to catch the offender with the hands in the cookie jar, is to take multiple dumps on different occasions when the CPU is high and if they all look pretty much the same, there is a good possibility that you found your CPU thief.

Step 1 - Determining if we are in a GC:

As I've mentioned before we have two different versions of the GC (the server version and the workstation version). In the server version we have a dedicated GC thread per processor (or 2 if it is hyper threaded), in the workstation version we GC on the same thread that initiates the GC. The dump output I am going to show you is from a dual-core machine with .net 2.0.

If we are not in a GC, we will have 2 threads (one per core) looking something like this

  18 Id: 134c.918 Suspend: 1 Teb: 7ffa8000 Unfrozen
ChildEBP RetAddr Args to Child
01e1fe68 7c822124 77e6baa8 00000398 00000000 ntdll!KiFastSystemCallRet
01e1fe6c 77e6baa8 00000398 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
01e1fedc 79e77fd1 00000398 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
01e1ff20 79e77f9a 00000398 ffffffff 00000000 mscorwks!PEImage::LoadImage+0x199
01e1ff70 79e77f50 ffffffff 00000000 00000000 mscorwks!CLREvent::WaitEx+0x117
01e1ff80 79f3549b ffffffff 00000000 00000000 mscorwks!CLREvent::Wait+0x17
01e1ffa8 79f6ece3 00000000 b28c067c 01e1ffec mscorwks!SVR::gc_heap::gc_thread_function+0x2e
01e1ffb8 77e66063 000e7660 00000000 00000000 mscorwks!SVR::gc_heap::gc_thread_stub+0x9b
01e1ffec 00000000 79f6ec79 000e7660 00000000 kernel32!BaseThreadStart+0x34

So this is a GC thread waiting for work...

In our dumps from the problem our GC threads look like this…

  16 Id: f28.1150 Suspend: 1 Teb: fff82000 Unfrozen
ChildEBP RetAddr Args to Child
0248fd28 7d4d8c46 000002dc 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15
0248fd98 79e77fd1 000002dc ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
0248fddc 79e77f9a 000002dc ffffffff 00000000 mscorwks!PEImage::LoadImage+0x199
0248fe2c 79e77f50 ffffffff 00000000 00000000 mscorwks!CLREvent::WaitEx+0x117
0248fe3c 79f35e1a ffffffff 00000000 00000000 mscorwks!CLREvent::Wait+0x17
0248fe54 7a0d5a3b 001afbf0 0000000d 00000002 mscorwks!SVR::t_join::join+0x61
0248ff54 79f391bf 00000002 001afbf0 00000000 mscorwks!SVR::gc_heap::plan_phase+0xd78
0248ff70 79f39954 00000002 ffffffff 001afbf0 mscorwks!SVR::gc_heap::gc1+0x57
0248ff88 79f35422 00000000 00000000 001afbf0 mscorwks!SVR::gc_heap::garbage_collect+0x37c
0248ffa8 79f6ece3 00000000 914b1904 0248ffec mscorwks!SVR::gc_heap::gc_thread_function+0x68
0248ffb8 7d4e0729 001afbf0 00000000 00000000 mscorwks!SVR::gc_heap::gc_thread_stub+0x9b
0248ffec 00000000 79f6ec79 001afbf0 00000000 kernel32!BaseThreadStart+0x34

…which indicates that we are in the middle of a garbage collection.

So what triggered this particular garbage collection? To find that out we have to look for a thread that has GarbageCollectGeneration on the stack...

  21 Id: f28.1244 Suspend: 1 Teb: fff0f000 Unfrozen
ChildEBP RetAddr Args to Child
029eeeb4 7d4d8c46 000002f0 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15
029eef24 79e77fd1 000002f0 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac
029eef68 79e77f9a 000002f0 ffffffff 00000000 mscorwks!PEImage::LoadImage+0x199
029eefb8 79e77f50 ffffffff 00000000 00000000 mscorwks!CLREvent::WaitEx+0x117
029eefc8 79f3a3f3 ffffffff 00000000 00000000 mscorwks!CLREvent::Wait+0x17
029eefec 79f3a2f6 ffffffff 001afcf8 029ef0e4 mscorwks!SVR::gc_heap::wait_for_gc_done+0x99
029ef010 79f3a0c7 00000000 00000000 0004e210 mscorwks!SVR::GCHeap::GarbageCollectGeneration+0x267
029ef0a0 79f2d0aa 029ef0e4 0004e210 00000003 mscorwks!SVR::gc_heap::try_allocate_more_space+0x1bc
029ef0c0 79f32b14 029ef0e4 0004e210 00000003 mscorwks!SVR::gc_heap::allocate_more_space+0x2f
029ef100 7a0d5c9a 0004e20c 00000000 00000000 mscorwks!SVR::gc_heap::allocate_large_object+0x8c
029ef11c 79e74c1c 00211be8 0004e20c 00000000 mscorwks!SVR::GCHeap::Alloc+0xbc
029ef138 79e7e940 0004e20c 00000000 00000000 mscorwks!Alloc+0x60
029ef18c 79e7e9a0 791240f0 00013880 00000000 mscorwks!FastAllocatePrimitiveArray+0xbd
029ef24c 02920df3 00009c40 00002247 d768bc75 mscorwks!JIT_NewArr1+0x148
WARNING: Frame IP not in any known module. Following frames may be wrong.
029ef2a8 6881bdc6 02a60824 06a77b84 0aa3a218 0x2920df3
029ef2c0 6881bfbc 06a76df0 0aa3a218 06a7cb04 System_Web_ni+0x25bdc6
00000000 00000000 00000000 00000000 00000000 System_Web_ni+0x25bfbc

In SOS for 1.1 the threads output for this thread would have a (GC) in the exception column like this...

0:029> !threads
ThreadCount: 7
UnstartedThread: 0
BackgroundThread: 7
PendingThread: 0
DeadThread: 0
Hosted Runtime: no

                                      PreEmptive GC Alloc Lock
ID OSID ThreadOBJ State GC Context Domain Count APT Exception
14 1 11b8 001aba50 1808220 Enabled 00000000:00000000 001a8868 0 Ukn (Threadpool Worker)
18 2 11f8 001b1ef0 b220 Enabled 00000000:00000000 001a8868 0 MTA (Finalizer)
19 3 1174 001c75e8 80a220 Enabled 00000000:00000000 001a8868 0 MTA (Threadpool Completion Port)
20 4 133c 001ca250 1220 Enabled 00000000:00000000 001a8868 0 Ukn
12 6 1178 001dcae8 880a220 Enabled 00000000:00000000 001a8868 0 MTA (Threadpool Completion Port)
21 7 1244 00211ba8 180b220 Enabled 00000000:00000000 001ca990 1 MTA (GC)
22 5 6f8 001dc450 880b220 Enabled 00000000:00000000 001a8868 0 MTA (Threadpool Completion Port)

…but that is not always true for sos.dll for 2.0, so then we have to resort to finding the thread that calls garbage collect generation...

Step 2 - Why are we garbage collecting:

In this case we can see that the GC happened because we tried to allocate a large object (allocate_large_object) and we needed more space on the LOH to do this...
now that is a pretty good indication that we hit the spot... particularly since we saw that the number of Gen 2 collections and Gen 1 collections and Gen 0 collections were pretty much the same, indicating that there is a lot of Gen 2, or LOH collections.

The other thing we can see from this stack before moving on to the managed stack is that we are trying to allocate an object of size 0x0004e20c (the first parameter to allocate_large_object) which equates to 320.012 bytes.

We can also see if we run !runaway (which gives us combined user mode CPU time for different threads) that the threads that have used up most CPU since the start of this process are the following:

0:029> !runaway
User Mode Time
Thread Time
14:11b8 0 days 0:00:55.687
16:1150 0 days 0:00:45.500
17:7a8 0 days 0:00:43.875
21:1244 0 days 0:00:23.140
0:ea0 0 days 0:00:00.046
29:fc8 0 days 0:00:00.000
28:11f0 0 days 0:00:00.000

The total up-time for the process is around 17 minutes so it has been idle for most of its lifetime.

0:021> .time
Debug session time: Tue Jun 20 09:46:04.000 2006 (GMT+2)
System Uptime: 4 days 6:36:51.031
Process Uptime: 0 days 0:16:53.000
Kernel time: 0 days 0:00:45.000
User time: 0 days 0:02:48.000

Threads 16 and 17 are our GC threads and they will pretty much always be on top of the list since a lot of the other threads (worker threads) will spend a lot of their time being idle waiting for requests.

Then we have thread 14 and 21 that are worker threads (based on the threads output above), and if we start with thread 14 we can see based on the stack output that this thread is currently idle…

  14 Id: f28.11b8 Suspend: 1 Teb: fff88000 Unfrozen
ChildEBP RetAddr Args to Child
020afc8c 7d4d8c46 00000224 00000000 020afcd0 ntdll!ZwWaitForSingleObject+0x15
020afcfc 79e77fd1 00000224 00009c40 00000000 kernel32!WaitForSingleObjectEx+0xac
020afd40 79e77f9a 00000224 00009