Biztalk Server 2004 - Mixed Dll loading problem

Sometime ago I had to troubleshoot a problem in Biztalk Server 2004. I was told that Biztalk Server 2004 apparently stopped working and the only way to get it back on track would be to restart the instances and we didn´t had errors reported. After that everything would run smoothly. The worst part was that this only happened from time to time.

Since this appeared to be a hang we collected three memory dumps with 2-3 minutes apart when the issue showed up again.

The first thing we looked at was threadpool.

0:136> !threadpool

CPU utilization 81%

Worker Thread: Total: 101 Running: 101 Idle: 1 MaxLimit: 250 MinLimit: 100

Work Request in Queue: 1148640

--------------------------------------

Number of Timers: 0

--------------------------------------

Completion Port Thread: Total: 0 Free: 0 MaxFree: 4 CurrentLimit: 0 MaxLimit: 250 MinLimit: 100

 

I see CPU utilization equals 81%. This usually means we are in middle of a GC. Looking at my threads I can see that thread 102 triggered the GC.

108 0x2ba8 0x116b8228 0x3800220 Enabled 0x00000000:0x00000000 0x000e6dc0 0 MTA (Threadpool Worker)

102 0x1bb8 0x116b8780 0x1800220 Disabled 0x05a77d84:0x05a79d58 0x000e6dc0 1 MTA (GC) (Threadpool Worker)

115 0x22ac 0x116b8948 0x1800220 Enabled 0x00000000:0x00000000 0x000e6dc0 1 MTA (Threadpool Worker)

Looking closer at this thread stack we can see a call to SysSuspendForGC. This means that we are going to suspend all threads to allow the GC to proceed.

0:102> kb

ChildEBP RetAddr Args to Child

120ef0dc 7c822114 77e67143 00000001 120ef12c ntdll!KiFastSystemCallRet [d:\srvrtm\base\ntos\rtl\i386\userdisp.asm @ 545]

120ef0e0 77e67143 00000001 120ef12c 00000001 ntdll!ZwWaitForMultipleObjects+0xc [d:\srvrtm\base\ntdll\daytona\obj\i386\usrstubs.asm @ 2363]

120ef188 77e6109d 00000001 120ef2c8 00000000 kernel32!WaitForMultipleObjectsEx+0x11a [d:\nt\base\win32\client\synch.c @ 1521]

120ef1a4 791feb58 00000001 120ef2c8 00000000 kernel32!WaitForMultipleObjects+0x18 [d:\nt\base\win32\client\synch.c @ 1386]

120ef3dc 791fec7c 00000001 00000001 116b8780 mscorsvr!Thread::SysSuspendForGC+0x248

120ef3f4 791fee14 00000001 00000000 ffffffff mscorsvr!GCHeap::SuspendEE+0xcf

120ef410 7924f4be 00000002 00000000 ffffffff mscorsvr!GCHeap::GarbageCollectGeneration+0x13f

120ef42c 792ef053 ffffffff 00000000 120ef498 mscorsvr!GCHeap::GarbageCollect+0x3c

120ef43c 0b961f3a 120ef448 ffffffff 00000000 mscorsvr!GCInterface::CollectGeneration+0xf

Suspending all threads is allowed if preemptive is true. Otherwise it won´t be possible to suspend the thread. Now looking at my threads there is actually one that has preemptive set to Disable. This is thread 76.

74 0x2fa8 0x0bd2c2e0 0x1800220 Enabled 0x05a9b104:0x05a9d0d8 0x000e6dc0 0 MTA (Threadpool Worker)

 76 0x1958 0x0bd2c4a8 0x1800222 Disabled 0x05a95144:0x05a970d8 0x000e9a08 1 MTA (Threadpool Worker)

 60 0x18f0 0x0bd2c670 0x3800220 Enabled 0x00000000:0x00000000 0x000e6dc0 0 MTA (Threadpool Worker)

Lets switch to thread 76 and see what it is doing.

0:076> kb

ChildEBP RetAddr Args to Child

0de6e8d4 7c822124 7c83970f 00000174 00000000 ntdll!KiFastSystemCallRet [d:\srvrtm\base\ntos\rtl\i386\userdisp.asm @ 545]

0de6e8d8 7c83970f 00000174 00000000 00000000 ntdll!NtWaitForSingleObject+0xc [d:\srvrtm\base\ntdll\daytona\obj\i386\usrstubs.asm @ 2371]

0de6e914 7c839620 00000000 00000004 00000001 ntdll!RtlpWaitOnCriticalSection+0x19c [d:\srvrtm\base\ntdll\resource.c @ 619]

0de6e934 7c837a15 7c889d94 00000000 0de6ea30 ntdll!RtlEnterCriticalSection+0xa8 [d:\srvrtm\base\ntdll\resource.c @ 870]

0de6e968 7c836c23 00000000 00000000 0de6e9b0 ntdll!LdrLockLoaderLock+0x133 [d:\srvrtm\base\ntdll\ldrapi.c @ 2746]

0de6e9e4 7c836cf9 00000001 00000001 00000000 ntdll!LdrGetDllHandleEx+0x94 [d:\srvrtm\base\ntdll\ldrapi.c @ 719]

0de6ea00 77e665e3 00000001 00000000 0de6ea80 ntdll!LdrGetDllHandle+0x18 [d:\srvrtm\base\ntdll\ldrapi.c @ 614]

0de6ea4c 77e92330 0de6ea80 00000000 0de6f160 kernel32!GetModuleHandleForUnicodeString+0x20 [d:\nt\base\win32\client\module.c @ 1278]

0de6eec4 77e663f4 00000001 00000002 792323a4 kernel32!BasepGetModuleHandleExW+0x14f [d:\nt\base\win32\client\module.c @ 1556]

0de6eedc 792322cd 792323a4 00000000 0de6f160 kernel32!GetModuleHandleW+0x29 [d:\nt\base\win32\client\module.c @ 1368]

0de6f130 793cd536 00000000 00000000 119c5740 mscorsvr!LoadImageHlp+0x1e

0de6f148 7932bf1e 00000000 0bd2c4a8 0de6f840 mscorsvr!AssemblyFileHash::GenerateDigest+0xb3

It´s waiting on a critical section. In ntdll!RtlEnterCriticalSection the first parameter is in fact the critical section we are interested on.

0:076> !critsec 0x7c889d94

CritSec ntdll!LdrpLoaderLock+0 at 7c889d94

WaiterWoken No

LockCount 13

RecursionCount 1

OwningThread 9b0

EntryCount 0

ContentionCount 47

*** Locked

The owning thread of this section is thread 9b0 that corresponds to thread 136. Now the funny part is that thread 136 is waiting for GC to complete. Now this looks like a deadlock. No wonder that Biztalk stopped working.

0:136> kb

ChildEBP RetAddr Args to Child

12b6faa4 7c822124 77e6bad8 00000288 00000000 ntdll!KiFastSystemCallRet [d:\srvrtm\base\ntos\rtl\i386\userdisp.asm @ 545]

12b6faa8 77e6bad8 00000288 00000000 00000000 ntdll!NtWaitForSingleObject+0xc [d:\srvrtm\base\ntdll\daytona\obj\i386\usrstubs.asm @ 2371]

12b6fb18 77e6ba42 00000288 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac [d:\nt\base\win32\client\synch.c @ 1246]

12b6fb2c 791fed5c 00000288 ffffffff 791fe901 kernel32!WaitForSingleObject+0x12 [d:\nt\base\win32\client\synch.c @ 1147]

12b6fb38 791fe901 0e3c5b08 00000000 00000000 mscorsvr!GCHeap::WaitUntilGCComplete+0x4f

12b6fb48 792b2ad1 0ce40000 00000000 00000000 mscorsvr!Thread::RareDisablePreemptiveGC+0xb5

12b6fbf4 79258ec2 0ce40000 00000003 00000000 mscorsvr!SystemDomain::RunDllMain+0x7d

12b6fe7c 7917daa0 00000001 00000003 00000000 mscorsvr!ExecuteDLL+0x3c0

12b6fe94 7c82257a 0ce40000 00000003 00000000 mscoree!_CorDllMain+0x40

The badboy can be seen has the dll in the first argument into mscoree!_CorDllMain. In this case it is 0Ce40000. This address correspond to my badboy dll. In our case since this was a custom dll the solution was the development team to fix it.

0:136> lmvm badboydll

start end module name

0ce40000 0ce48000 badboydll (deferred)

This kind of deadlock between the loader lock and GC is a known issue and you can see more details in here:Mixed DLL Loading Problem.

Also Tess (http://blogs.msdn.com/tess) has wrote a very good article on this (and how valuable it was to me to troubleshoot this issue. You can see this article here: .NET Hang Case Study: The GC-Loader Lock Deadlock (a story of mixed mode dlls).