Wait Chain Traversal

John Robbins

Code download available at:  Bugslayer2007_07.exe(175 KB)


Basics of Wait Chain Traversal
Mixed Debugging on x64
Wrap Up

Having concentrated on writing development tools for years, I always look at new operating system releases with an eye toward seeing if there’s anything new I can use to solve difficult development problems. For example, Windows® XP introduced vectored exception handling, which allows you to intercept the exception handling chain for logging and diagnostic purposes. In the November 2006 issue of MSDN® Magazine, I discussed my attempt at using that functionality to write an exception logger for Microsoft® .NET-based applications.

With Windows Vista™, Microsoft didn’t disappoint. In fact, Windows Vista has a very interesting new API called Wait Chain Traversal (WCT), which allows you to determine when and why a process is deadlocked. That sounded extremely exciting to me, so of course I had to jump in with both feet. The good news is that WCT will report exactly what synchronization object you are deadlocking on. The bad news is that it only reports a limited set of synchronization primitives. Even with that limitation, it’s still a very useful API and something you’ll want to have in your debugging toolkit.

In this column, I want to discuss the WCT API, its usage, and its limitations. As part of this column, I give you a tool that pinpoints all the deadlocks supported by WCT. Because I insisted on writing the tool in .NET, I also get to show you a descent into the depths of interop despair and how I was able to get the WCT API working from the .NET Framework.

Basics of Wait Chain Traversal

Figure 1 lists the synchronization primitives that the WCT API supports for deadlock reporting. While there are a few other synchronization primitives you might deadlock on, such as an event, the list covers a majority of the problem areas for most applications. If you’re focused on .NET, you might be wondering if a sync block is supported. If you haven’t heard of sync blocks before, that’s because you don’t use them directly; sync blocks are the internal construct used to implement the lock keyword along with the Monitor class. As the common language runtime (CLR) provides the implementation for sync blocks, they are not known to the native WCT API.

Figure 1 WCT-Supported Synchronization Primitives

Synchronization Primitive Description
ALPC The internal Windows mechanism used for Remote Procedure Calls
COM COM calls across threads and processes
Critical Sections The lightweight, per process, native synchronization type
Mutexes The handle-based synchronization type generally used to coordinate multiple process synchronization
SendMessage Sending a message to another thread or window blocks until that message is processed
Process and Thread Handles Waiting on processes and threads

The WCT API works its magic by looking at whether a thread is waiting in a blocking call. If the thread is on a blocking call, it looks up the object the thread is blocking on and determines, if appropriate, the name of the object and which thread or process owns that object. For example, if thread A is waiting on process B to complete execution and process B calls SendMessage to process A, you have a deadlock that WCT can detect and report.

The only problem I see with the WCT API is that it does not report deadlocks when using WaitForMultipleObjects, which reduces its usefulness in many applications. For example, .NET synchronization uses WaitForMultipleObjects for all synchronization coordination. While WCT won’t report that the thread is deadlocked, it will report that the thread is blocked. Fortunately, the WCT API can report the waiting times so you can manually decide if a particular thread is deadlocked through inspection and comparison using a tool like LockWatcher (part of this column’s code download).

WCT is a typical Win32® API, meaning you have to grab an opaque handle, pass that handle to the method that records the data, and call a form of CloseHandle when done. The Platform SDK documentation shows all that’s required to initialize and use the WCT API. For initialization, there’s one slight bump in that you’ll always want to call the RegisterWaitChainCOMCallback function to register the CoGetCallState and CoGetActivationState COM functions so WCT can report COM ownership information. As these functions are not documented, you’ll need to get them through GetProcAddress from ole32.dll. Personally, I think the WCT API should automatically handle this small chore when you initialize it to look for COM deadlocks.

To initialize WCT, call the OpenThreadWaitChainSession function exported from advapi32.dll:

HWCT WINAPI OpenThreadWaitChainSession(

Once initialized, you can call GetThreadWaitChain whenever you want to determine the blocked threads in an application:

BOOL WINAPI GetThreadWaitChain(
  HWCT WctHandle, DWORD_PTR Context,
  DWORD Flags, DWORD ThreadId,
  LPBOOL IsCycle

While you can initialize the WCT API to do all operations asynchronously, nearly everyone who uses it will use the synchronous version. In addition to the thread ID, the Flag parameter to GetThreadWaitChain sets the type of data you’re interested in receiving. The Flags parameter defaults to only looking at threads inside the specified process. If you want to get information from other processes, you’ll need to specify the three flags WCT_OUT_OF_PROC_COM_FLAG, WCT_OUT_OF_PROC_CS_FLAG, and WCT_OUT_OF_PROC_FLAG. You can do so easily with the summary flag that ORs these three together, WCTP_GETINFO_ALL_FLAGS.

The key information returned by GetThreadWaitChain is an array of WAITCHAIN_NODE_INFO structures that describes the state of the thread and what it’s currently blocked on. Additionally, GetThreadWaitChain will also return whether the thread is deadlocked based on the analysis engine inside the WCT code. On the whole, the WCT API is relatively simple, and while it won’t tell you all your deadlocks, it’s accurate for what it does report.

When a thread is blocked, the WCT API reports in the array of WAITCHAIN_NODE_INFO structures the thread and the cause of the deadlock. For example, if you have a deadlock where thread A owns a critical section X and wants to acquire critical section Y, and thread B owns critical section Y and wants to acquire X, the deadlock is reported as follows in the array:

Index 1: Thread A’s ID
Index 2: Waiting on a critical section
Index 3: Thread B’s ID
Index 4: Waiting on a critical section
Index 5: Thread A’s ID

Note that the critical section address is not reported. However, for other objects, such as mutexes, if it has a name, the WCT API will report that name.

To get the most out of the WCT API, you’ll always want to name your handles so you can see exactly what’s causing your deadlock. Of course, if you’re naming handles, those handles are now visible outside the process, which can lead to serious synchronization problems. For example, say you give a mutex a name such as MyHappyMutex. You have two processes running, A and B, and both are waiting on the same mutex name. If process A releases the mutex, both process A and process B could grab it. Thus, you could have a serious debugging challenge. If you are going to name your handles and you want to ensure they are unique to your process, you’ll want to append on the name the process ID or some other string guaranteed to be unique to the process.

While naming your handles does help with your debugging, some people might be nervous about security attacks because, once a handle is global, another process can open the handle. For certain scenarios, this can be a valid concern; if you are worried at the prospect, at least consider naming your handles in debug builds.


While the WCT API is designed for native code, my plan was to write a tool that uses it from .NET code as I had some ideas for additional analysis. Additionally, .NET applications are easier to write since you don’t have to worry as much about memory so I thought I could implement it faster. Little did I know I was about to run into some real head-scratching problems with interop.

Figure 2 shows the native WAITCHAIN_NODE_INFO structure. If you look at it closely, you’ll see that it has two overlapping unions in it. For most structures P/Invoke is no trouble, but whenever there’s a C++ union floating around you know you’re in for some serious pain. Nearly all structures and APIs from Windows are defined at the pinvoke.net Web site, but a sinking feeling told me that I was on my own with this nasty native structure.

Figure 2 Native WAITCHAIN_NODE_INFO Structure

typedef struct _WAITCHAIN_NODE_INFO
    WCT_OBJECT_TYPE ObjectType;
    WCT_OBJECT_STATUS ObjectStatus;

    union {
        struct {
            WCHAR ObjectName[WCT_OBJNAME_LENGTH];
            LARGE_INTEGER Timeout;    // Not implemented in v1
            BOOL Alertable;           // Not implemented in v1
        } LockObject;

        struct {
            DWORD ProcessId;
            DWORD ThreadId;
            DWORD WaitTime;
            DWORD ContextSwitches;
        } ThreadObject;


My first hope was that I could possibly get by with defining two separate structures on the .NET side. On other projects, I’ve been able to declare separate DllImportAttribute P/Invoke methods with the different .NET side declarations to ease the work on defining the structure. With the WCT code, I’d be assuming possible internal implementation details, but the .NET structure definition would be trivial. Running the native Wait Chain Traversal sample (msdn2.microsoft.com/ms681418.aspx) on a deadlocking sample quickly showed me that there was enough difference in the returned deadlock information that this first attempt at a solution wasn’t going to be enough.

That meant I needed to turn to the fun of LayoutKind.Explicit to define the individual offsets inside the structure. A quick native sample that returned the size of the WAITCHAIN_NODE_INFO structure showed it is 280 bytes on both 32-bit and 64-bit Windows, so that eliminated the necessity of defining separate structures and programmatically determining which one to use based on the operating system.

My secret trick to easily finding structure offsets is to use WinDBG from the Debugging Tools for Windows package, which you can download from microsoft.com/whdc/devtools/debugging. Since WinDBG is a native-only debugger, you’ll need to debug a native C++ application that’s using the structure in question. To display a type you use the DT command, which conveniently shows all field offsets by default. In the case of the WAITCHAIN_NODE_INFO, I’ll want to pass the –r 1 and –v options to the DT command. The former tells WinDBG to recursively expand the structure one level and the latter dumps all the information about the structure. Figure 3 shows the result of the command on Windows Vista x64.

Figure 3 Result of WinDBG DT Command

0:000> dt -v -r 1 NodeInfoArray[0]
Local var [AddrFlags 90  AddrOff 0000000000000050  Reg/Val rsp (7)] @ 0x12d970 
[0] [16] struct _WAITCHAIN_NODE_INFO, 4 elements, 0x118 bytes
   +0x000 ObjectType       : Enum _WCT_OBJECT_TYPE,  11 total enums
8 ( WctThreadType )
   +0x004 ObjectStatus     : Enum _WCT_OBJECT_STATUS,  11 total enums
3 ( WctStatusBlocked )
   +0x008 LockObject       : struct _WAITCHAIN_NODE_INFO::<unnamed-tag>::<unnamed-tag>, 
                             3 elements, 0x110 bytes
      +0x000 ObjectName       : [128]  “۰”
      +0x100 Timeout          : union _LARGE_INTEGER, 4 elements, 0x8 bytes
         +0x000 LowPart          : 0
         +0x004 HighPart         : 0
         +0x000 u                : struct <unnamed-tag>, 2 elements, 0x8 bytes
         +0x000 QuadPart         : 0
      +0x108 Alertable        : 0
   +0x008 ThreadObject     : struct _WAITCHAIN_NODE_INFO::<unnamed-tag>::<unnamed-tag>, 
                             4 elements, 0x10 bytes
      +0x000 ProcessId        : 0x6f0
      +0x004 ThreadId         : 0x6f4
      +0x008 WaitTime         : 0x3005ce
      +0x00c ContextSwitches  : 0x2fb

The first interesting piece of information is 0x118 in the third line of the output in Figure 3, which is the size of the structure. As you would expect, the first two enumerations are at offsets 0x0 and 0x4. The two unions, LockObject and ThreadObject, start at offset 0x8. Taking the LockObject union, the offset for the Unicode character array is 0x8 and runs for 0x100 bytes. That means the Timeout low part starts at 0x108 and the Timeout high part starts at 0x110. Finally, the Alertable field is at 0x114. Once you see the pattern, the offsets for the ThreadObject union are trivial.

After coding my .NET version of the WAITCHAIN_NODE_INFO structure using the FieldOffset attributes, I ran into my first major bump. The LockObject union has a 128 Unicode character array, ObjectName, as the first item. No matter how I tried to define that field, I was getting a TypeLoadException every time my code accessed my defined structure. For type-safety reasons as well as issues related to garbage collection, fields that point to reference types can’t overlap other fields, and as arrays and strings are both reference types, they’re unusable in this situation.


I was about to give up when I was reminded of the unsafe keyword and a new use of the fixed keyword in C# that was introduced with the .NET Framework 2.0. I could define the structure as unsafe and use a fixed char array definition for the ObjectName field; this allows for an unsafe inline array within a structure. It looks like an array in the structure, but the field actually ends up being accessed through a pointer; I could then use the String constructor that takes a character pointer and access the actual value.

After defining my structure and adding the /unsafe compiler switch, I was very happy to see that my test application reported a critical section deadlock exactly as the native test sample. Expanding my test application to have one thread acquire a critical section and call SendMessage to window on another thread waiting on the same critical section worked as well. Adding a deadlock test to block on the Explorer process also showed up correctly.

Thinking everything was going well, I added a test that deadlocked on a named mutex. The Platform SDK WCT sample showed the name of the mutex as:

\Sessions\1\BaseNamedObjects\Deadlock Mutex A

The \Session\1 indicates my interactive login session, and BaseNamedObjects is the kernel object table. In the test application, I’d named the mutex Deadlock Mutex A. However, my P/Invoke masterpiece of a structure said the name was only "\Sessi".

Thinking that I didn’t have the structure size right, I did some checks with Marshal.SizeOf, and the .NET side was reporting that it thought the structure was the proper 280 bytes. It also dawned on me that if I had the structure size wrong, the previous tests on critical sections would not have worked at all because the native GetThreadWaitChain function would have stepped all over the .NET memory for the array. Additionally, when I ran through the array for those previous deadlocks, I would have not seen the correct data in the array in my C# code.

At this point, I was a little confused, as the critical section deadlock test said my structure was the right size, but a named mutex was saying that the structure might be too small. My first thought was that I needed to see exactly what the native GetThreadWaitChain function was writing, so I ran my .NET application under WinDBG on my named mutex deadlock. Setting a breakpoint on ADVAPI32!GetThreadWaitChain, finding the array parameter, and stepping out of the function showed that the native side was copying in the complete name of the mutex into the array. However, that data was certainly not getting back into my C# code!

Unfortunately, I needed to see what .NET was doing with the array in native code. Of course, I’m running the x64 version of Windows Vista, so that meant that mixed debugging—where you can single step both managed and native sides of your application at the same time—is not supported. In the simple case where you have only one binary, set the platform to x86, which forces the application to be 32-bit—then you can do mixed debugging on 64-bit Windows Vista. However, for real-world applications that have many assemblies, it’s far faster to do your mixed debugging on a true 32-bit computer and not change your build. As I was writing this column, I received several questions about mixed debugging on x64 systems and can report it’s somewhat possible. At the end of the column, I’ll discuss some of the tricks I’ve been using to see both the native and managed sides when running full-bore x64 .NET Framework code.

After starting to spelunk through far too much assembly language, I had a brainstorm. Because I was only seeing "\Sessi" in the fixed array, I was wondering if the .NET interop code was only marshaling data when other fields existed at the same location. It almost seemed like the thread union portion, which is 12 bytes long, determined the length of the name union portion. To test my theory, I tossed in an extra Int32 field for the thread union and said that it began at 0x18 in the structure layout. You can imagine my shock when I ran the application and got back a string of "\Session"!

As it turns out, there is a bug with how the marshaler interacts with the code generated by the C# compiler for the fixed keyword when used to create a char array (this bug should be fixed in the Visual Studio "Orcas" release). To work around the issue, I needed to modify the struct so that it is blittable, meaning that the types that make up the struct must all have the same native representation as they do managed. The list of blittable types includes Byte, SByte, Int16, UInt16, Int32, UInt32, Int64, UInt64, IntPtr, UIntPtr, Single, and Double. You’ll notice, however, that Char is conspicuously missing from the list. In fact, Char is not blittable, because it can have multiple representations in native code (as either one or two bytes). To work around my problem, I changed the field type of the fixed array to be ushort rather than char; then in my code that turns this array into a String, I first cast from the resulting ushort* into the char* that the String constructor expects. Figure 4 shows the final structure that works correctly.

Figure 4 Final WAITCHAIN_NODE_INFO Definition

[StructLayout ( LayoutKind.Explicit, Size=280 )]
internal unsafe struct WAITCHAIN_NODE_INFO
    [FieldOffset ( 0x0 )]
    public WCT_OBJECT_TYPE ObjectType;
    [FieldOffset ( 0x4 )]
    public WCT_OBJECT_STATUS ObjectStatus;

    // The name union.
    [FieldOffset ( 0x8 )]
    private fixed ushort RealObjectName [ WCT_OBJNAME_LENGTH ];
    [FieldOffset ( 0x108 )]
    public Int32 TimeOutLowPart;
    [FieldOffset ( 0x10C )]
    public Int32 TimeOutHiPart;
    [FieldOffset ( 0x110 )]
    public Int32 Alertable;

    // The thread union.
    [FieldOffset ( 0x8 )]
    public Int32 ProcessId;
    [FieldOffset ( 0xC )]
    public Int32 ThreadId;
    [FieldOffset ( 0x10 )]
    public Int32 WaitTime;
    [FieldOffset ( 0x14 )]
    public Int32 ContextSwitches;

    // Does the work to get the ObjectName field.
    public String ObjectName ( )
        fixed ( WAITCHAIN_NODE_INFO* p = &this )
            return (p->RealObjectName [ 0 ] != ‘\0’) ?
                new String ( (char*)p->RealObjectName ) :


Once I got the WAITCHAIN_NODE_INFO structure figured out, I was able to finish the LockWatcher application. Now you can use it to find deadlocks in your applications. To look for deadlocks in a single process or a group of specific processes, just pass the process ID values on the command line. If you don’t specify any process IDs, LockWatcher will report on all processes on the computer. If you’re looking at all processes, you’ll want to start LockWatcher from a command window that has elevated rights so it has full permission to inspect data in other elevated processes.

By default, LockWatcher only shows the threads that are blocked and any blocked items reported through the WCT API. To see the wait time and context switch information, pass the –a command line option. One nice feature of LockWatcher is the –t <seconds> option, where LockWatcher will continually inspect the process for deadlocks on the interval you specify.

When LockWatcher reports a deadlock, you’ll see output such as the following, which is an example of a critical section deadlock:

Process : DEAD.EXE, PID : 5828
TID: 5392
**Following thread is DEADLOCKED!
TID:  424
   CriticalSection Status: Owned
      TID: 5492
         CriticalSection Status: Owned
            TID:  424
**Following thread is DEADLOCKED!
TID: 5492
   CriticalSection Status: Owned
      TID:  424
         CriticalSection Status: Owned
            TID: 5492

For the second thread, you can read the output as thread 5492 is waiting on a critical section owned by thread 424. Thread 424 is waiting on a critical section.

As I discussed earlier, the WCT API does not report deadlocks when using WaitForMultipleObjects. The following is the output of a deadlock where thread 5652 owns the Deadlock Mutex A and has called WaitForMultipleObjects to wait on the two other thread handles.

Process : DEAD.EXE, PID : 5796
TID: 5652
TID: 4776
   Mutex Status: Owned Name: \Sessions\1\BaseNamedObjects\Deadlock Mutex A
      TID: 5652
TID: 4920
   Mutex Status: Owned Name: \Sessions\1\BaseNamedObjects\Deadlock Mutex A
      TID: 5652

By looking at the output, you can piece together part of the deadlock, but not the whole story. My goal for LockWatcher was to build an algorithm on top of the returned data from the WCT API to identify deadlocks if WaitForMultipleObjects is used. Working from the previous example, it’s quite easy to see the work I had to do. My next step was to whip up a .NET sample that deadlocked on two Mutex class instances and had the following output:

Process : DEADDOTNET.EXE, PID : 3644
TID: 4976
TID: 6120
TID: 2152
TID: 5852

As you can see, the WCT API is not reporting anything useful so unless I was going to start doing code injection, stack walking, and parameter deciphering, there’s not much I can do. Even though the WCT API is limited, it will still help find far more deadlocks than if you did not have it. To see how LockWatcher reports different deadlocks, I included the appropriately named DEAD program, which I wrote in native C++, as a test program.

Mixed Debugging on x64

Earlier, I discussed how to recompile your application for an x86 platform to allow mixed managed and native debugging. However, as part of my exploration of the P/Invoke challenges I faced with the WAITCHAIN_NODE_INFO structure, I did a bit of mixed debugging on a .NET application running as an x64 binary. While not a supported scenario from Microsoft, I thought I’d tell you about some of the tricks I used to allow me to see both sides without going insane.

Managed debugging does not go through the standard Windows Debugging API (msdn2.microsoft.com/ms679276.aspx), but uses a scheme reliant on vectored exception handling. If you poke at what’s going on in the CLR Debugging API (msdn2.microsoft.com/ms404520.aspx), you can see they are using the usual single step and breakpoint exceptions as a normal debugger. Because managed debugging does not go through the Windows Debugging API, you won’t get the error you get when you attach a native debugger to an application that’s already running under a debugger. Thus, once you start debugging your managed application in Visual Studio®, you can attach a native debugger to the managed application.

While you can attempt to use a second instance of Visual Studio as the native debugger, it’s nearly impossible to do anything because you’ll be breaking on every single step the managed debugging API triggers. Because there’s no way in Visual Studio to automatically continue on exceptions, the only hope you’ll have is to enjoy the fun of WinDBG.

While WinDBG is harder to use, the lovely Event Filtering will let you ignore the single step exceptions that are triggered constantly by the CLR Debugging API. After you’ve attached WinDBG to the managed application, you’ll immediately stop at the initial, or loader, breakpoint.

The first command you’ll want to use is the following:

sxd –c “gn” sse

The SXD (Set Exception Disabled) command sets the Single Step Exception (SSE) to disabled, which means you’ll not stop when they occur. Since the CLR Debugging API controls everything with the single step, I’ve also set the first chance command to have WinDBG automatically continue and pass them on to the debuggee as not handled. Remember that you’ve set this command when you do stop inside the 64-bit native portion because you won’t be able to single step anything with it on. To turn single stepping back on when you need it, execute the following command:

sxe -c “” sse

Once you’ve got the single-stepping under control, you need to plan for how you’re going to deal with the CLR Debugging API’s breakpoints as well as the ones you want to set in native code. When I was trying to figure out what was going on in my P/Invoke problems, I needed to see the parameters to GetThreadWaitChain, which is in advapi32.dll. Consequently, I needed a breakpoint on that address via the following command:

bu ADVAPI32!GetThreadWaitChain

The big trick I needed to accomplish was ignoring all the breakpoints coming from the CLR Debugging API. Fortunately, WinDBG treats breakpoint exceptions like any other exceptions so I can continue using the SX (Set Exception) family of commands to have WinDBG perform a command whenever a breakpoint exception is hit. Even more fortunate, WinDBG supports rudimentary control flow commands known as debugger command programs. With the .if and .else commands, you can execute conditional logic on your debugger commands.

To ignore all breakpoint exceptions except the location where I set my breakpoint, the following command takes care of everything on an x64 machine:

sxe -c “.if (rip==ADVAPI32!GetThreadWaitChain){.echo STOPPED AT YOUR BU!}.else{gn}” bpe

The SXE (Set Exception Enabled) says for every BPE (Break Point Exception), execute the command on the first chance exception. The command checks to see whether the instruction register, RIP, is executing the first instruction in GetThreadWaitChain. (For a 32-bit machine, substitute EIP as the instruction register.) If it is, the debugger will report the stop by echoing the text to the screen. If it is not, the debugger will execute the GN (Go Not Handled) command to let the debuggee have the exception, thus not confusing the CLR Debugging API.

While a little bit convoluted, these steps make it possible to do a modicum of mixed debugging on an x64 system. Keep in mind that it is very easy to mess up the managed application if you are not super careful as to what you are doing. It’s highly unlikely that the techniques I presented here are supported—or even condoned—by Microsoft. However, when you need to see both the native and managed side of your application on an x64 system, at least you have a hint to get you going.

Wrap Up

What a long, strange trip this column has been! What started out as a seemingly simple piece turned into a long trip through P/Invoke and mixed debugging under x64, and led to the ability to detect some of the deadlocks in your applications. Given the Microsoft track record of API improvements over the years, my hope is that you’ll see WaitForMultipleObjects support in WCT in a future service pack. When we get that support, I’ll update LockWatcher to support it!

Tip 78 If you are doing any ASP.NET development, you know that there are times when you just have to look at the raw HTTP information going across the wire. In those cases, Fiddler (www.fiddler2.com) comes to the rescue. Not only can you see every byte, but you can change and tweak the data with a really cool scripting API. And the newest version, Fiddler 2, supports HTTPS interception, even more reason to make it a mandatory part of your toolkit.

Tip 79 Gregg Miskelly came up with a brilliant debugging trick on his blog. In managed code, you don’t have the address of an object, so setting a per instance breakpoint is almost impossible. However, Gregg points out that if you do the cool Make Object Id trick on the instance, you can set a per instance breakpoint by setting a conditional breakpoint to this == 1#. That’s one I could have used a million times over!

Send your questions and comments for John to slayer@microsoft.com.

John Robbins is a cofounder of Wintellect, a software consulting, education, and development firm that specializes in both .NET and Windows. His latest book is Debugging Microsoft .NET 2.0 Applications (Microsoft Press, 2006). You can contact John at www.wintellect.com.