.NET Hang Case Study: The GC-Loader Lock Deadlock (a story of mixed mode dlls)
I'm sorry I haven't had a chance to write much lately because of a lot of different reasons. I moved houses and at the same time I started writing some managed debugging scripts for debugdiag 1.1 to allow it to automatically catch a lot of the issues that I've been writing about before. But... enough with the excuses already...
I had an interesting case recently, well actually it is still on-going but during the course of the case we discovered a hang in their application caused by using mixed-mode dlls, which is what I'm going to talk about now...
Intermittently, some ASP.NET requests doing web service calls would hang and the application needs to be restarted for the application to function correctly. The application frequently receives a lot of load and dumps were taken during the hang situation using adplus -hang -pn w3wp.exe.
Typically if I get this kind of problem description, the answer in 99% of the cases is either that the remote server (web service server) is hanging or that the application is not properly configured in terms of maxconnections and threads. Take a peak at this post for the normal troubleshooting steps in this type of case.
This case was slightly different though, their maxConnections setting was set extremely high to avoid running out of connections and the remote server was running just fine...
Debugging the issue:
In this dump we had 4 threads waiting for a response from a webservice call (sitting in System.Net.HttpWebRequest.GetResponse). There can only be a couple of reasons for this.
1. We have sent the request off and it is now in the hands of the remote server
2. We have sent the request off and it has responded but we are out of threads for it to respond on
3. There is some kind of network issue
I have mentioned earlier that we don't spin up new threads when CPU usage is higher than 80%, and in this case, the problem is that CPU usage is reported as being 81% on the box. If it wasn't for this fact you can see below that there would have been plenty of room for us to spin up new threads since we are only using 99 out of 800 threads, which they have configured as their max limit.
0:093> !threadpool CPU utilization 81% Worker Thread: Total: 99 Running: 99 Idle: 0 MaxLimit: 800 MinLimit: 400 Work Request in Queue: 771 -------------------------------------- Number of Timers: 0 -------------------------------------- Completion Port Thread: Total: 8 Free: 2 MaxFree: 16 CurrentLimit: 8 MaxLimit: 800 MinLimit: 8
The number 81% is a bit special... as I have mentioned in some of my posts about high CPU in GC. When we are in the middle of a GC, the .net framework will temporarily set the variable that holds the CPU usage for the machine to 81% to prevent new threads for spinning up, so if you are in the middle of a GC the CPU usage will often show as 81% even if the actual CPU usage on the box is 5% or 70% or whatever it might be.
Looking closer at the memory dump with !threads we can see that we are in the middle of a garbage collection. We can determine that this is the fact because of the (GC) on thread 83, which indicates in turn that thread 83 triggered the GC.
0:093> !threads ThreadCount: 133 UnstartedThread: 0 BackgroundThread: 127 PendingThread: 0 DeadThread: 6 PreEmptive GC Alloc Lock ID ThreadOBJ State GC Context Domain Count APT Exception 15 0x1e8c 0x000edff8 0x180a220 Enabled 0x14547554:0x1454855c 0x00153848 1 MTA (Threadpool Worker) 27 0x1558 0x000f2398 0xb220 Enabled 0x18637148:0x186371a4 0x00153848 0 MTA (Finalizer) 17 0x1768 0x00140858 0x1800222 Enabled 0x187b23c8:0x187b2f24 0x00153848 1 MTA (Threadpool Worker) ... 84 0x1988 0x270a65f0 0x1800220 Enabled 0x00000000:0x00000000 0x000e99f0 0 Ukn (Threadpool Worker) 85 0x14f0 0x2b1aca28 0x1800220 Enabled 0x10515464:0x105159d8 0x00153848 1 MTA (Threadpool Worker) 86 0xc40 0x342dbe20 0x1800220 Enabled 0x1c58d974:0x1c58db08 0x00153848 1 MTA (Threadpool Worker) 83 0x1c48 0x26a3d2e0 0x1800220 Disabled 0x187be918:0x187be924 0x00153848 2 MTA (GC) (Threadpool Worker) 87 0x1010 0x27261528 0x1800220 Enabled 0x1c53df30:0x1c53e1f8 0x00153848 1 MTA (Threadpool Worker) 89 0x1890 0x3441da40 0x1800220 Enabled 0x00000000:0x00000000 0x000e99f0 0 Ukn (Threadpool Worker) 88 0x1878 0x2725a3b0 0x1800220 Enabled 0x1a6009e4:0x1a601344 0x000e99f0 0 MTA (Threadpool Worker) ... 138 0x16e0 0x2714d8b0 0x1800220 Enabled 0x12564098:0x125640a0 0x00153848 1 MTA (Threadpool Worker) 137 0x1f38 0x346b40a8 0x1800220 Enabled 0x1c585544:0x1c5858b8 0x00153848 1 MTA (Threadpool Worker) 136 0xaa8 0x2724d9c0 0x1800222 Disabled 0x1e5673b4:0x1e567eec 0x025deef0 0 MTA (Threadpool Worker) 135 0x1e74 0x271fc300 0x1800220 Enabled 0x12630fa8:0x12631f54 0x00153848 1 MTA (Threadpool Worker) 134 0x1d34 0x34569f88 0x1800220 Enabled 0x1053cae8:0x1053d0d8 0x00153848 1 MTA (Threadpool Worker) 133 0x1f8 0x2931a9b8 0x1800220 Enabled 0x166eb118:0x166ec814 0x00153848 1 MTA (Threadpool Worker)...
If we look at the stack of thread 83 we can see that it has in fact triggered a garbage collection (since it is in GarbageCollectGeneration) and is currently initializing a garbage collection by suspending all .net threads (SysSuspendForGC)
0:083> kb ChildEBP RetAddr Args to Child 0c8feef8 7c822114 77e67143 00000001 0c8fef48 ntdll!KiFastSystemCallRet 0c8feefc 77e67143 00000001 0c8fef48 00000001 ntdll!NtWaitForMultipleObjects+0xc 0c8fefa4 77e6109d 00000001 0c8ff0e4 00000000 kernel32!WaitForMultipleObjectsEx+0x11a 0c8fefc0 791f2ff8 00000001 0c8ff0e4 00000000 kernel32!WaitForMultipleObjects+0x18 0c8ff1f8 791f311c 00000001 00000000 000f0008 mscorsvr!Thread::SysSuspendForGC+0x248 0c8ff210 791f337d 00000001 00000000 000f0008 mscorsvr!GCHeap::SuspendEE+0xcf 0c8ff22c 791f5775 00000000 00000000 00000024 mscorsvr!GCHeap::GarbageCollectGeneration+0x13f 0c8ff25c 791bd4ae 26a3d318 00000024 00000000 mscorsvr!gc_heap::allocate_more_space+0x181 0c8ff484 791b5411 26a3d318 00000024 00000000 mscorsvr!GCHeap::Alloc+0x7b 0c8ff498 791bd5d0 00000024 00000000 00000000 mscorsvr!Alloc+0x3a 0c8ff4b4 791b2a6b 0000000a 1a2eec78 791b2a9c mscorsvr!SlowAllocateString+0x26 0c8ff4c0 791b2a9c 1a2ef2ac 1a2eec78 18404c38 mscorsvr!JIT_TrialAlloc::GenAllocArray+0x29c 0c8ff528 799a012d 0c8ff61c 00000000 1a2ef2ac mscorsvr!JIT_TrialAlloc::GenAllocArray+0x2cd 0c8ff52c 0c8ff61c 00000000 1a2ef2ac 1a2eec78 mscorlib_79990000+0x1012d 0c8ff530 00000000 1a2ef2ac 1a2eec78 02ea6c44 0xc8ff61c
To complete this step, all threads that are executing managed code needs to be suspended before any real garbage collection can occur, and in order to suspend a .NET thread, the thread has to have preemptive gc enabled. Enabling preemptive GC basically means that the code is allowing for someone to go in and stop it while doing a garbage collection, and in most cases preemptive GC is enabled for all threads (except for the GC trigger).
If you rest your detective eyes on the !threads output again, you will probably spot thread 136 which has preemptive GC disabled. (marking the line in bold might have helped as well:))
This means that this thread can not be suspended for some reason, and until it is suspended, garbage collection can not occur so we will just have to sit tight until this thread decides to enable preemptive GC.
But... why does it have preemptive GC disabled? Let's take a look at it...
0:136> kb 2000 ChildEBP RetAddr Args to Child 09a6e9fc 7c822124 7c83970f 0000019c 00000000 ntdll!KiFastSystemCallRet 09a6ea00 7c83970f 0000019c 00000000 00000000 ntdll!NtWaitForSingleObject+0xc 09a6ea3c 7c839620 00000000 00000004 00000001 ntdll!RtlpWaitOnCriticalSection+0x19c 09a6ea5c 7c832ad0 7c889d94 00000000 00000000 ntdll!RtlEnterCriticalSection+0xa8<br>09a6ea90 7c833ea9 00000001 00000000 09a6eacc ntdll!LdrLockLoaderLock+0xe4 09a6ed00 77e41bcc 27026bc0 09a6ed4c 09a6ed2c ntdll!LdrLoadDll+0xc9 09a6ed68 791e535b 79248bc0 00000000 00000000 kernel32!LoadLibraryExW+0x1b2 09a6ed84 79338170 79248bc0 00000000 00000000 mscorsvr!WszLoadLibraryEx+0x5f 09a6edac 793381e4 1e566fec 79c208f8 09a6ee9c mscorsvr!COMX509Certificate::LoadCertificateContext+0x2d4 09a6edbc 0259df1d 09a6edc8 1e566db0 34660170 mscorsvr!COMX509Certificate::BuildFromContext+0xf WARNING: Frame IP not in any known module. Following frames may be wrong. 09a6eed8 791b36be 00000000 00000012 09a6eef4 0x259df1d 09a6eee8 791b33cc 00000000 09a6f008 791bddd7 mscorsvr!ArgIterator::ArgIterator+0x55 09a6ef28 791bdac5 00000000 00000000 09a6f008 mscorsvr!CallDescrWorker+0x30 09a6ef38 791bdcc9 00000000 79c0e02b 79b7c000 mscorsvr!MetaSig::SizeOfActualFixedArgStack+0x14 09a6f008 791cccc1 00c0e02b 79b7c000 00000001 mscorsvr!MethodDesc::CallDescr+0x79 09a6f0c4 791ccd75 79c0e02b 79b7c000 79c0e070 mscorsvr!MethodDesc::CallDescr+0x4f 09a6f0ec 79312dbf 09a6f104 03775518 2724d9c0 mscorsvr!MethodDesc::Call+0x97 09a6f11c 791ccba0 09a6f184 791b3704 ffffffff mscorsvr!RegisterWaitForSingleObjectCallback_Worker+0x9b 09a6f164 79312e71 03775518 79312d24 09a6f184 mscorsvr!Thread::DoADCallBack+0x5c 09a6f1d8 79234695 0863b688 00000000 34449408 mscorsvr!RegisterWaitForSingleObjectCallback+0xab 09a6f1f0 791cce67 34449408 27297bd0 793eced0 mscorsvr!ThreadpoolMgr::AsyncCallbackCompletion+0x16 09a6f204 791cce26 27297bd0 00000000 791ccdb6 mscorsvr!ThreadpoolMgr::ExecuteWorkRequest+0x19 09a6f224 791f03e4 00000000 00000000 00000000 mscorsvr!ThreadpoolMgr::WorkerThreadStart+0x129 09a6ffb8 77e6608b 2723c428 00000000 00000000 mscorsvr!ThreadpoolMgr::intermediateThreadProc+0x44 09a6ffec 00000000 791f03a3 2723c428 00000000 kernel32!BaseThreadStart+0x34 0:136> !clrstack Thread 136 ESP EIP 0x09a6ede8 0x7c82ed54 [FRAME: ECallMethodFrame] [DEFAULT] [hasThis] I4 System.Security.Cryptography.X509Certificates.X509Certificate.BuildFromContext(I) 0x09a6edf8 0x79a712fa [DEFAULT] [hasThis] Void System.Security.Cryptography.X509Certificates.X509Certificate..ctor(I) 0x09a6ee04 0x08468bf9 [DEFAULT] [hasThis] Void System.Net.SecureChannel.ExtractServerCertificate() 0x09a6ee2c 0x0259dcb4 [FRAME: InterceptorFrame] [DEFAULT] [hasThis] Void System.Net.SecureChannel.ExtractServerCertificate() 0x09a6ee3c 0x08468ba7 [DEFAULT] [hasThis] Class System.Security.Cryptography.X509Certificates.X509Certificate System.Net.TlsStream.get_Certificate() 0x09a6ee44 0x08468af2 [DEFAULT] [hasThis] Boolean System.Net.ServicePoint.AcceptCertificate(Class System.Net.TlsStream,Class System.Net.WebRequest) 0x09a6ee5c 0x08465ecd [DEFAULT] [hasThis] ValueClass System.Net.WebExceptionStatus System.Net.Connection.ConstructTlsChannel(String,Class System.Net.HttpWebRequest,ByRef Class System.Net.Sockets.NetworkStream,Class System.Net.Sockets.Socket) 0x09a6ee78 0x07408863 [DEFAULT] [hasThis] ValueClass System.Net.WebExceptionStatus System.Net.Connection.ConstructTransport(Class System.Net.Sockets.Socket,ByRef Class System.Net.Sockets.NetworkStream,Class System.Net.HttpWebRequest) 0x09a6eeac 0x07402e6d [DEFAULT] [hasThis] Void System.Net.Connection.StartConnectionCallback(Object,Boolean) 0x09a6f128 0x791b33cc [FRAME: ContextTransitionFrame]
Ok, so thread 136 is in the middle of getting us a connection (probably for another web service call) and in the midst of doing so, it needs to get some certificate and to do that it needs to load up some dll. Somewhere along this line it decided to disable preemptive GC to avoid being interrupted by the GC, and loading a dll is usually a pretty common reason for disabling preemptive GC, and typically it will be enabled as soon as we have the dll, so there is nothing strange here...
But wait... so so far we know that we have a lot of threads waiting for a GC, and the GC is waiting for this thread to enable preemptive GC, so what is this thread waiting on? Well, judging by the title of the post and by the callstack, I would be surprised if you didn't say "the loader lock".
The loader lock is taken when you load a dll, and it uses a critical section to synchronize so the next step on our path is to find the owner of the loader lock to see why it isn't letting go of it...
I'm going to use a command from sieextpub.dll called !critlist to see who the owner of the critical section is...
0:136> !critlist CritSec at 7c889d94. Owned by thread 129. Waiting Threads: 66 79 81 84 89 108 111 115 123 136 172 174 178
In here we can see that my thread (Thread 136) is waiting for critical section 7c889d94 (which nicely matches the 1st param of RtlEnterCriticalSection in my stack), and that the owner is... tata... thread 129.
You guessed it, the next step is to take a look at the stack for thread 129.
0:129> kb ChildEBP RetAddr Args to Child 0955f9b0 7c822124 77e6bad8 00000360 00000000 ntdll!KiFastSystemCallRet 0955f9b4 77e6bad8 00000360 00000000 00000000 ntdll!NtWaitForSingleObject+0xc 0955fa24 77e6ba42 00000360 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac 0955fa38 791f2dde 00000360 ffffffff 791f2d8e kernel32!WaitForSingleObject+0x12 0955fa44 791f2d8e 2720c868 00000000 00000000 mscorsvr!GCHeap::WaitUntilGCComplete+0x4f 0955fa54 792ae7f9 04db0000 00000000 00000000 mscorsvr!Thread::RareDisablePreemptiveGC+0xb5 0955fb00 7925420a 04db0000 00000003 00000000 mscorsvr!SystemDomain::RunDllMain+0x7d 0955fd88 7917daa0 00000001 00000003 00000000 mscorsvr!ExecuteDLL+0x3c0 0955fda0 7c82257a 04db0000 00000003 00000000 mscoree!_CorDllMain+0x40 0955fdc0 7c83c195 04e5bd94 04db0000 00000003 ntdll!LdrpCallInitRoutine+0x14 0955fe74 77e661d6 793eced0 00000000 0955fea4 ntdll!LdrShutdownThread+0xd2 0955fe84 79278926 00000000 00000000 791ccdb6 kernel32!ExitThread+0x2f 0955fea4 791f03e4 00000000 fc71ec80 00000000 mscorsvr!ThreadpoolMgr::WorkerThreadStart+0x154 0955ffb8 77e6608b 270bf160 00000000 00000000 mscorsvr!ThreadpoolMgr::intermediateThreadProc+0x44 0955ffec 00000000 791f03a3 270bf160 00000000 kernel32!BaseThreadStart+0x34
Ok, this is plain bad... we are in mscoree!_CorDllMain calling mscorsvr!ExecuteDLL etc. and in the end we are waiting for the GC to complete. In other words we can not finish here and unlock the loader lock until the GC has completed.
So to summarize we have thread 83 triggering a GC and all managed threads waiting for this GC to complete. Thread 83 is waiting to suspend thread 136 before it can perform the GC. Thread 136 needs the loader lock to enable preemptive GC so it needs to wait for 129 to finish and unlock the loader lock, but thread 129 can't release the loader lock until the GC finishes, ergo, we have a deadlock.
How do we get into the deadlock situation?
Ok, fine, so we know we have a deadlock and what threads and locks are involved and all that... how does that help me?
This type of deadlock between the GC and the loader lock is commonly referred to as the mixed-mode dll issue because it is typically caused by some managed C++ dll which calls native code from DllMain allowing it to own the loader lock and wait for a GC at the same time.
The issue is described in more detail in http://msdn2.microsoft.com/en-us/library/aa290048(vs.71).aspx
So if that was the case we would see MyMixedModeDll!DllMain somewhere on the stack and we could point to MyMixedModeDll.dll as being the dll we needed to fix up to avoid this problem. Calling .net code from DllMain is a non-supported scenario because of this particular deadlock problem.
Sooo... in our case... is it mscorsvr.dll who is the bad guy? mscorsvr.dll is the core of the .net framework, surely it wouldn't do this... so what is happening here?
If our dll was a native dll, we would call its DllMain method at different times during the process execution. One of these places is on a thread shutdown, which is exactly what is happening here... We would call DllMain with a parameter indicating that a thread that had this dll loaded is shutting down so it can do whatever it needs to do in that case. However our dll is not a native dll so it doesn't have a DllMain, instead we call into a "fake" DllMain called _CorDllMain to notify the dll and we pass in the address of the dll. So in this case we can identify the dll by seeing what dll is loaded at 04db0000 (1st parameter to _CorDllMain).
0:129> lm start end module name ... 04db0000 050fd000 MyMixedModeDll (no symbols) ...
Ok, so if we have no DllMain, why are we holding the loader lock and allowing for a GC to happen? The reason is because this dll has exported public static valuetypes, which is another no-no for mixed mode dlls as you can see in this kb article http://support.microsoft.com/?id=814472.
Compilation of this dll probably generated a linker warning like the ones you see in the kb, but you know how things go, if it compiles it ships:) jk, hopefully this should serve as a reminder to check out those warnings...
To resolve the deadlock we need to fix up these statics, or in my customers case, go back to the 3rd party and make them fix them, because as you probably guessed, the real dll wasn't called MyMixedModeDll, but I'm choosing to let the actual component remain unnamed:)
Until next time,