Red alert! My Server is hung - what do I do?

So you have a dump from a hung server and you’re the first person on the scene. Your IT Manager is jumping up and down, the phone is ringing off the hook and people are hovering outside your cube. It’s game time and the pressure is on!!! Now what do you do?

Well take a deep breath, get a cup of coffee, and relax because I’m here to help you out! Let me share what we typically do on our first pass through a hung server kernel debug. This works for both live debugs and dumps. These are steps you can take and they will find problems!

Here’s something else to consider. If the server is mission critical you will probably want to get a dump vs. a live debug so you can get the server back up and running. This will take the pressure off because you can then do the debug offline, and if need be, send the dump to other people for review.

Before we get started let me state that the following data is completely fabricated and many of the process names and address in this output have been made up. Do not question odd offsets or alignments.

I’m also assuming that you know how to

1. Collect a kernel dump: http://support.microsoft.com/kb/244139

2. Set up the debugger: http://www.microsoft.com/whdc/devtools/debugging/installx86.mspx

3. Know how to use the symbol server: http://support.microsoft.com/kb/311503

0) Before I start these types of debugs I like to open a log file.

1: kd> .logopen H:\repro\hungserver.log

Opened log file 'H:\repro\hungserver.log'

1) !vm - Look for memory usage. Generally speaking you want to look at what the current pool or memory usage values are and compare them to the max available.

1: kd> !vm

 

 

*** Virtual Memory Usage ***

      Physical Memory: 982890 ( 3931560 Kb)

      Page File: \??\P:\pagefile.sys

        Current: 3931560 Kb Free Space: 3742548 Kb

        Minimum: 3931560 Kb Maximum: 4193280 Kb

      Available Pages: 631300 ( 2525200 Kb)

      ResAvail Pages: 888171 ( 3552684 Kb)

      Locked IO Pages: 195 ( 780 Kb)

      Free System PTEs: 202830 ( 811324 Kb) < THIS IS OK

      Free NP PTEs: 32765 ( 131060 Kb) < THIS IS OK

      Free Special NP: 0 ( 0 Kb)

      Modified Pages: 241 ( 964 Kb)

      Modified PF Pages: 241 ( 964 Kb)

      NonPagedPool Usage: 11377 ( 45508 Kb) < THIS IS OK

      NonPagedPool Max: 65536 ( 262144 Kb)

      PagedPool 0 Usage: 6398 ( 25592 Kb)

      PagedPool 1 Usage: 2201 ( 8804 Kb)

      PagedPool 2 Usage: 2216 ( 8864 Kb)

      PagedPool 3 Usage: 2179 ( 8716 Kb)

      PagedPool 4 Usage: 2199 ( 8796 Kb)

      PagedPool Usage: 15193 ( 60772 Kb) < THIS IS OK

      PagedPool Maximum: 67584 ( 270336 Kb)

      Shared Commit: 24569 ( 98276 Kb)

      Special Pool: 0 ( 0 Kb)

      Shared Process: 12519 ( 50076 Kb)

      PagedPool Commit: 15252 ( 61008 Kb)

      Driver Commit: 2083 ( 8332 Kb)

      Committed pages: 313611 ( 1254444 Kb) < THIS IS OK

      Commit limit: 1925815 ( 7703260 Kb)

Check to see if any apps are using tons of memory. In this case I don’t see a problem.

      Total Private: 239673 ( 958692 Kb)

         36b0 EXCEL.EXE 10775 ( 43100 Kb) < THIS IS OK, etc

         2ee8 myapploc.exe 10288 ( 41152 Kb)

         097c MySSrv.exe 7497 ( 29988 Kb)

         0418 MyFun32.exe 6277 ( 25108 Kb)

         0474 svchost.exe 6164 ( 24656 Kb)

         1be8 ABCDEFGH.EXE 4984 ( 19936 Kb)

         0480 IEXPLORE.EXE 4924 ( 19696 Kb)

         09c4 ANOTHER.exe 4768 ( 19072 Kb)

         19a4 HMMINTER.exe 4207 ( 16828 Kb)

         1b30 ohboya.EXE 4146 ( 16584 Kb)

         4558 aprocess.EXE 4138 ( 16552 Kb)

         30e8 another.exe 3691 ( 14764 Kb)

         0924 aservicec.exe 3508 ( 14032 Kb)

         0854 RRXXc.exe 3400 ( 13600 Kb)

         3458 MYWIN.EXE 3389 ( 13556 Kb)

         0d90 FunService.exe 3298 ( 13192 Kb)

         1180 CustomAp.exe 3221 ( 12884 Kb)

         06ac XYZvrver.exe 2769 ( 11076 Kb)

         2cdc ABCDEFGH.exe 2591 ( 10364 Kb)

         02f4 lsass.exe 2567 ( 10268 Kb)

         21b4 IEXPLORE.EXE 2516 ( 10064 Kb)

         3420 Process.exe 2450 ( 9800 Kb)

         4cd4 XYZXY.EXE 2305 ( 9220 Kb)

         4a30 lookup.EXE 2244 ( 8976 Kb)

         4360 Process.exe 2201 ( 8804 Kb)

         0564 spoolsv.exe 2166 ( 8664 Kb)

         2e5c XYZXYZEXE 2076 ( 8304 Kb)

         02bc winlogon.exe 1964 ( 7856 Kb)

         4e48 winlogon.exe 1958 ( 7832 Kb)

         42bc ABCDEFGH.exe 1943 ( 7772 Kb)

         0eb8 svchost.exe 1922 ( 7688 Kb)

         3b98 Process.exe 1919 ( 7676 Kb)

         4c1c IEXPLORE.EXE 1864 ( 7456 Kb)

         17b8 winlogon.exe 1852 ( 7408 Kb)

         3124 winlogon.exe 1849 ( 7396 Kb)

         14b8 winlogon.exe 1847 ( 7388 Kb)

         32cc winlogon.exe 1843 ( 7372 Kb)

         1f84 winlogon.exe 1843 ( 7372 Kb)

         2ebc winlogon.exe 1842 ( 7368 Kb)

         1548 winlogon.exe 1840 ( 7360 Kb)

         21c4 PROCESS213.EXE 1833 ( 7332 Kb)

         3b58 MYWIN.EXE 1817 ( 7268 Kb)

         4b3c winlogon.exe 1816 ( 7264 Kb)

 

NOTE if you see high pool values you will want to issue a !poolused 2 and a !poolused 4 to dump out the pool usages so you can see what pool tags are consuming pool. (We will write a dedicated blog on this topic later.)

2) !sysptes - See if one of the lists is low (less than 10)

 

1: kd> !sysptes

All of these are ok

System PTE Information

  Total System Ptes 224223

     SysPtes list of size 1 has 225 free

     SysPtes list of size 2 has 57 free

     SysPtes list of size 4 has 136 free

     SysPtes list of size 8 has 59 free

     SysPtes list of size 16 has 95 free

 

    starting PTE: c022b000

    ending PTE: c03dff78

  free blocks: 652 total free: 202831 largest free block: 191973

 

3) !defwrites - If throttling, the server is doing nothing other than writing to the disk.

1: kd> !defwrites

*** Cache Write Throttle Analysis ***

      CcTotalDirtyPages: 187 ( 748 Kb)

      CcDirtyPageThreshold: 130560 ( 522240 Kb)

      MmAvailablePages: 631300 ( 2525200 Kb)

      MmThrottleTop: 450 ( 1800 Kb)

      MmThrottleBottom: 80 ( 320 Kb)

      MmModifiedPageListHead.Total: 241 ( 964 Kb)

Write throttles not engaged < THIS IS OK. Good = NOT engaged.

 

 

4) !ready to see if we're holding stuff up

 

 

1: kd> !ready

Processor 0: No threads in READY state < THIS IS OK

Processor 1: No threads in READY state < THIS IS OK

If we had threads in a ready state you would want to investigate what those threads were and what is running on the processor.

 

 

5) !pcr x; kv on each processor - If they aren't idle then we could be doing DPCs

 

 

1: kd> !pcr 0 < Dump the processor control registers for CPU 0

KPCR for Processor 0 at ffdff000:

    Major 1 Minor 1

      NtTib.ExceptionList: ffffffff

          NtTib.StackBase: 00000000

         NtTib.StackLimit: 00000000

       NtTib.SubSystemTib: 80042000

            NtTib.Version: 012e7ace

        NtTib.UserPointer: 00000001

            NtTib.SelfTib: 00000000

                  SelfPcr: ffdff000

                     Prcb: ffdff120

                     Irql: 00000000

                      IRR: 00000000

                      IDR: ffffffff

            InterruptMode: 00000000

                      IDT: 8003f400

                      GDT: 8003f000

                      TSS: 80042000

            CurrentThread: 8056cd00

               NextThread: 00000000

               IdleThread: 8056cd00

                DpcQueue: < NO DPCs: Not much to look at then

    

1: kd> !pcr 1 < Dump the processor control registers for CPU 1

KPCR for Processor 1 at f773f000:

    Major 1 Minor 1

      NtTib.ExceptionList: f5ba1d30

          NtTib.StackBase: 00000000

         NtTib.StackLimit: 00000000

       NtTib.SubSystemTib: f773fef0

            NtTib.Version: 0121925d

        NtTib.UserPointer: 00000002

            NtTib.SelfTib: 7ffda000

                  SelfPcr: f773f000

                     Prcb: f773f120

                     Irql: 00000000

                      IRR: 00000000

                      IDR: ffffffff

            InterruptMode: 00000000

                      IDT: f77456e0

                      GDT: f77452e0

                      TSS: f773fef0

            CurrentThread: 8963cb90

               NextThread: 00000000

               IdleThread: f7741fa0

                DpcQueue: < NO DPCs: Not much to look at then

 

6) !locks - Look for deadlocks and contention

The following output is of interest.

The thread ID with the <*> next to it means that he has exclusive access to the resource and that all the other threads are waiting on that thread to finish its work. Typically you would !thread that OWNER THREAD ID <*> (e.g., !thread 87bddda0) to see what that thread is doing. If you have two threads that have exclusive access to two different resources, and these threads are in each other’s exclusive waiters list, you have a deadlock. The following is an example of what a deadlock might look like. In this case you would want to !thread each owner and evaluate the logic of the code in each stack that allowed the threads to get into this state

 

1: kd> !locks

**** DUMP OF ALL RESOURCE OBJECTS ****

KD: Scanning for held locks......

Resource @ 0x8a50ee98 Shared 4 owning threads

     Threads: 896856d0-01<*> 89686778-01<*> 896862d0-01<*> 89685da0-01<*>

KD: Scanning for held locks............................................................

Resource @ 0x896da1bc Exclusively owned

     Threads: 896e3b20-01<*>

KD: Scanning for held locks..

Resource @ 0x81234567 Shared 1 owning threads

    Contention Count = 15292

    NumberOfSharedWaiters = 1

    NumberOfExclusiveWaiters = 39

     Threads: 87bddda0-01<*> 806d2020-01

     Threads Waiting On Exclusive Access:

              80ced020 80c036f8 80cdc7a0 80c438b0

              80e6cda0 80f96987 8007fd60 8004dc10

              80d7b020 80a2dd70 80b89620 80b58020

              8036eda0 87abc123 80606da0 8056e890

              802b3630 80cc7590 80d64020 80f7dda0

              80129580 80b73da0 806d2578 80b505d8

      

KD: Scanning for held locks................

Resource @ 0x83245678 Exclusively owned

    Contention Count = 4827

    NumberOfExclusiveWaiters = 35

     Threads: 87abc123-01<*>

     Threads Waiting On Exclusive Access:

              803e6aa0 80876020 80240020 80f56588

              808174f0 80bd6b28 80c3c448 8046d6c8

              801e8da0 80356518 80b4c978 8069e020

              80cb9020 87bddda0 80c65020 86daaac0

              80379020 80fe4020

 

 

8) !process 0 0 - Search for drwtsn32.   This would indicate that we have a process that has crashed and is in the process of being dumped. This could cause a server hang. Look at the PEB for drwtsn32 and get its command line to see what process is being dumped. You should be able to do this by getting its process id and doing a .process PROCESSID;.reload;!PEB

The following is how to extract a command line for any process, but it would work for Watson also.

 

1: kd> .process 89f31020

Implicit process is now 89f31020

1: kd> .reload

Loading Kernel Symbols

...........................................................................................................................................

Loading User Symbols

...............................

Loading unloaded module list

...............

1: kd> !peb

PEB at 7ffdf000

    InheritedAddressSpace: No

    ReadImageFileExecOptions: Yes

    BeingDebugged: No

    ImageBaseAddress: 01000000

    Ldr 77fc23a0

    Ldr.Initialized: Yes

    Ldr.InInitializationOrderModuleList: 00171ef8 . 00176c90

    Ldr.InLoadOrderModuleList: 00171e90 . 00176c80

    Ldr.InMemoryOrderModuleList: 00171e98 . 00176c88

            Base TimeStamp Module

         1000000 3e80245d Mar 24 05:41:49 2003 \??\P:\WINDOWS\system32\winlogon.exe

        77f40000 3e802494 Mar 25 05:42:44 2003 P:\WINDOWS\system32\ntdll.dll

        77e40000 44c60ec8 Jul 25 08:30:00 2006 P:\WINDOWS\system32\kernel32.dll

        77ba0000 3e802496 Mar 25 05:42:46 2003 P:\WINDOWS\system32\msvcrt.dll

        77da0000 3e802495 Mar 25 05:42:45 2003 P:\WINDOWS\system32\ADVAPI32.dll

        77c50000 40566fc9 Mar 15 23:08:57 2004 P:\WINDOWS\system32\RPCRT4.dll

        77d00000 45e7bafc Mar 02 00:49:48 2007 P:\WINDOWS\system32\USER32.dll

        77c00000 45e7bafc Mar 02 00:49:48 2007 P:\WINDOWS\system32\GDI32.dll

        75970000 3e8024a2 Mar 25 05:42:58 2003 P:\WINDOWS\system32\USERENV.dll

        75810000 3e8024a3 Mar 25 05:42:59 2003 P:\WINDOWS\system32\NDdeApi.dll

        761b0000 3e8024a0 Mar 25 05:42:56 2003 P:\WINDOWS\system32\CRYPT32.dll

       

    SubSystemData: 00000000

    ProcessHeap: 00070000

    ProcessParameters: 00020000

    WindowTitle: '< Name not readable >'

    ImageFile: '\??\P:\WINDOWS\system32\winlogon.exe'

    CommandLine: 'winlogon.exe' < HERE IS THE COMMAND LINE.. No args in this case

 

( output is truncated ... )

9) Look at the handle table size.   If it’s over 10000 you may have trouble. If you do have a handle leak refer to TalkBackVideo Understanding handle leaks and How to use !htrace to find them

1: kd> !process 0 0

**** NT ACTIVE PROCESS DUMP ****

PROCESS 8a613270 SessionId: none Cid: 0004 Peb: 00000000 ParentCid: 0000

    DirBase: 0acc0000 ObjectTable: e1001d10 HandleCount: 2510.

    Image: System

PROCESS 8a294328 SessionId: none Cid: 0274 Peb: 7ffdf000 ParentCid: 0004

    DirBase: ef1ac000 ObjectTable: e14ac1d0 HandleCount: 124.

    Image: smss.exe

PROCESS 8a103424 SessionId: 0 Cid: 02a4 Peb: 7ffdf000 ParentCid: 0274

    DirBase: ed804000 ObjectTable: e18caa68 HandleCount: 1171.

    Image: csrss.exe

PROCESS 8a104343 SessionId: 0 Cid: 02bc Peb: 7ffdf000 ParentCid: 0274

    DirBase: ed539000 ObjectTable: e18c67b0 HandleCount: 498.

    Image: winlogon.exe

PROCESS 8a0f6634 SessionId: 0 Cid: 02e8 Peb: 7ffdf000 ParentCid: 02bc

    DirBase: ece72000 ObjectTable: e1668e40 HandleCount: 568.

    Image: services.exe

PROCESS 8a123423 SessionId: 0 Cid: 02f4 Peb: 7ffdf000 ParentCid: 02bc

    DirBase: ecd7a000 ObjectTable: e16684a0 HandleCount: 30000. < This is bad

    Image: lsass.exe

PROCESS 89f96453 SessionId: 0 Cid: 03e0 Peb: 7ffdf000 ParentCid: 02e8

    DirBase: eb99c000 ObjectTable: e16bb570 HandleCount: 500.

    Image: svchost.exe

PROCESS 8a0c6532 SessionId: 0 Cid: 042c Peb: 7ffdf000 ParentCid: 02e8

    DirBase: eb6d7000 ObjectTable: e1731170 HandleCount: 156.

    Image: svchost.exe

PROCESS 8a0a8d88 SessionId: 0 Cid: 0460 Peb: 7ffdf000 ParentCid: 02e8

    DirBase: eb58f000 ObjectTable: e17372e8 HandleCount: 124.

    Image: svchost.exe

PROCESS 89f77678 SessionId: 0 Cid: 0474 Peb: 7ffdf000 ParentCid: 02e8

    DirBase: eb484000 ObjectTable: e17305b8 HandleCount: 1457.

    Image: svchost.exe

 

9) !process 0 0 system - Check the worker threads in the system process (search for srv! to find server worker threads). What are these threads doing? These are the server service threads. Are they blocked on I/O or waiting for a resource?

10) 1: kd> !process 0 17 csrss.exe - Look for 16 LPC server threads.

What are they doing? Are they blocked?

11) !stacks 2, This will dump every call stack on the server.   You may need to go through and evaluate every stack on the server. Look for critical sections, etc.

15) !qlocks This will allow you to check the stack of all the Queued spin locks on the machine.    For further information on spinlocks refer to the Windows Internals book.

 

1: kd> !qlocks

Key: O = Owner, 1-n = Wait order, blank = not owned/waiting, C = Corrupt

                       Processor Number

    Lock Name 0 1 << Nothing to worry about here.

KE - Dispatcher

MM - Expansion

MM - PFN

MM - System Space

CC - Vacb

CC - Master

EX - NonPagedPool

IO - Cancel

EX - WorkQueue

IO - Vpb

IO - Database

IO - Completion

NTFS - Struct

AFD - WorkQueue

CC - Bcb

MM - NonPagedPool

16) !process 0 17 winlogon.exe to look for hung LPC calls. If you find a LPC call calling out of winlogon you can follow the call with the !LPC debugger command. This will allow you to see what the thread is doing in the other process.

 

If you have further questions on any of these commands, please refer to the debugger.chm file in the Windows debugger tools install.

Good luck and happy debugging.

 

“This debugger is mine, there are many like it but this one is mine!” Jeff Dailey