Profiler Stack Walking in the .NET Framework 2.0: Basics and Beyond
Microsoft .NET Framework 2.0
Common Language Runtime (CLR)
Summary: Describes how you can program your profiler to walk managed stacks in the common language runtime (CLR) of the .NET Framework. (14 printed pages)
Synchronous and Asynchronous Calls
Mixing It Up
Be on Your Best Behavior
Enough Is Enough
Credit Where Credit Is Due
About the Author
This article is targeted toward anyone interested in building a profiler to examine managed applications. I will describe how you can program your profiler to walk managed stacks in the common language runtime (CLR) of the .NET Framework. I'll try to keep the mood light, because the topic itself can be heavy going at times.
The profiling API in version 2.0 of the CLR has a new method named DoStackSnapshot that lets your profiler walk the call stack of the application you're profiling. Version 1.1 of the CLR exposed similar functionality through the in-process debugging interface. But walking the call stack is easier, more accurate, and more stable with DoStackSnapshot. The DoStackSnapshot method uses the same stack walker used by the garbage collector, security system, exception system, and so on. So you know it's got to be right.
Access to a full stack trace gives users of your profiler the ability to get the big picture of what's going on in an application when something interesting happens. Depending on the application and on what a user wants to profile, you can imagine a user wanting a call stack when an object is allocated, when a class is loaded, when an exception is thrown, and so on. Even getting a call stack for something other than an application event—for example, a timer event—would be interesting for a sampling profiler. Looking at hot spots in code becomes more enlightening when you can see who called the function that called the function that called the function containing the hot spot.
I'm going to focus on getting stack traces with the DoStackSnapshot API. Another way to get stack traces is by building shadow stacks: you can hook FunctionEnter and FunctionLeave to keep a copy of the managed call stack for the current thread. Shadow stack building is useful if you need stack information at all times during application execution, and if you don't mind the performance cost of having your profiler's code run on every managed call and return. The DoStackSnapshot method is best if you need slightly sparser reporting of stacks, such as in response to events. Even a sampling profiler taking stack snapshots every few milliseconds is much sparser than building shadow stacks. So DoStackSnapshot is well suited for sampling profilers.
Take a Stack Walk on the Wild Side
It's very useful to be able to get call stacks whenever you want them. But with power comes responsibility. A profiler user will not want stack walking to result in an access violation (AV) or a deadlock in the runtime. As a profiler writer, you must wield your power with care. I will talk about how to use DoStackSnapshot, and how to do so carefully. As you'll see, the more you want to do with this method, the harder it is to get it right.
Let's take a look at our subject. Here's what your profiler calls (you can find this in the ICorProfilerInfo2 interface in Corprof.idl):
HRESULT DoStackSnapshot( [in] ThreadID thread, [in] StackSnapshotCallback *callback, [in] ULONG32 infoFlags, [in] void *clientData, [in, size_is(contextSize), length_is(contextSize)] BYTE context, [in] ULONG32 contextSize);
The following code is what the CLR calls on your profiler. (You can also find this in Corprof.idl.) You pass a pointer to your implementation of this function in the callback parameter from the preceding example.
typedef HRESULT __stdcall StackSnapshotCallback( FunctionID funcId, UINT_PTR ip, COR_PRF_FRAME_INFO frameInfo, ULONG32 contextSize, BYTE context, void *clientData);
It's like a sandwich. When your profiler wants to walk the stack, you call DoStackSnapshot. Before the CLR returns from that call, it calls your StackSnapshotCallback function several times, once for each managed frame or for each run of unmanaged frames on the stack. Figure 1 shows this sandwich.
Figure 1. A "sandwich" of calls during profiling
As you can see from my notations, the CLR notifies you of the frames in the reverse order from how they were pushed onto the stack—leaf frame first (pushed last), main frame last (pushed first).
What do all the parameters for these functions mean? I'm not ready to discuss them all yet, but I'll discuss a few of them, starting with DoStackSnapshot. (I'll get to the rest in a few moments.) The infoFlags value comes from the COR_PRF_SNAPSHOT_INFO enumeration in Corprof.idl, and it enables you to control whether the CLR will give you register contexts for the frames that it reports. You can specify any value you like for clientData and the CLR will give it back to you in your StackSnapshotCallback call.
In StackSnapshotCallback, the CLR uses the funcId parameter to pass you the FunctionID value of the currently walked frame. This value is 0 if the current frame is a run of unmanaged frames, which I'll talk about later. If funcId is nonzero, you can pass funcId and frameInfo to other methods, such as GetFunctionInfo2 and GetCodeInfo2, to get more info about the function. You can get this function information right away, during your stack walk, or alternatively save the funcId values and get the function information later, which reduces your impact on the running application. If you get the function information later, remember that a frameInfo value is valid only inside the callback that gives it to you. Although it's okay to save the funcId values for later use, don't save the frameInfo for later use.
When you return from StackSnapshotCallback, you will typically return S_OK and the CLR will continue walking the stack. If you like, you can return S_FALSE, which stops the stack walk. Your DoStackSnapshot call will then return CORPROF_E_STACKSNAPSHOT_ABORTED.
Synchronous and Asynchronous Calls
You can call DoStackSnapshot in two ways, synchronously and asynchronously. A synchronous call is the easiest to get right. You make a synchronous call when the CLR calls one of your profiler's ICorProfilerCallback(2) methods, and in response you call DoStackSnapshot to walk the stack of the current thread. This is useful when you want to see what the stack looks like at an interesting notification point like ObjectAllocated. To perform a synchronous call, you call DoStackSnapshot from within your ICorProfilerCallback(2) method, passing zero or null for the parameters I haven't told you about.
An asynchronous stack walk occurs when you walk the stack of a different thread or forcefully interrupt a thread to perform a stack walk (on itself or on another thread). Interrupting a thread involves hijacking the instruction pointer of the thread to force it to execute your own code at arbitrary times. This is insanely dangerous for too many reasons to list here. Please, just don't do it. I'll restrict my description of asynchronous stack walks to non-hijacking uses of DoStackSnapshot to walk a separate target thread. I call this "asynchronous" because the target thread was executing at an arbitrary point at the time the stack walk begins. This technique is commonly used by sampling profilers.
Walking All Over Someone Else
Let's break down the cross-thread—that is, the asynchronous—stack walk a little. You have two threads: the current thread and the target thread. The current thread is the thread executing DoStackSnapshot. The target thread is the thread whose stack is being walked by DoStackSnapshot. You specify the target thread by passing its thread ID in the thread parameter to DoStackSnapshot. What happens next is not for the faint of heart. Remember, the target thread was executing arbitrary code when you asked to walk its stack. So the CLR suspends the target thread, and it stays suspended the whole time that it is being walked. Can this be done safely?
I'm glad you asked. This is indeed dangerous, and I'll talk some later about how to do this safely. But first, I'm going to get into mixed-mode stacks.
Mixing It Up
A managed application is not likely to spend all of its time in managed code. PInvoke calls and COM interop allow managed code to call into unmanaged code, and sometimes back again with delegates. And managed code calls directly into the unmanaged runtime (CLR) to do JIT compilation, handle exceptions, perform garbage collection, and so on. So when you do a stack walk, you will probably encounter a mixed-mode stack—some frames are managed functions, and others are unmanaged functions.
Grow Up, Already!
Before I continue, a brief interlude. Everyone knows that stacks on our modern PCs grow (that is, "push") to smaller addresses. But when we visualize these addresses in our minds or on whiteboards, we disagree with how to sort them vertically. Some of us imagine the stack growing up (little addresses on top); some see it growing down (little addresses on the bottom). We're divided on this issue in our team as well. I choose to side with any debugger I've ever used—call-stack traces and memory dumps tell me that the little addresses are "above" the big addresses. So stacks grow up; main is at the bottom, the leaf callee is at the top. If you disagree, you'll have to do some mental rearranging to get through this part of the article.
Waiter, There Are Holes in My Stack
Now that we're speaking the same language, let's look at a mixed-mode stack. Figure 2 illustrates an example mixed-mode stack.
Figure 2. A stack with managed and unmanaged frames
Stepping back a bit, it's worthwhile to understand why DoStackSnapshot exists in the first place. It's there to help you walk managed frames on the stack. If you tried to walk managed frames yourself, you would get unreliable results, particularly on 32-bit systems, because of some wacky calling conventions used in managed code. The CLR understands these calling conventions, and DoStackSnapshot can therefore help you decode them. However, DoStackSnapshot is not a complete solution if you want to be able to walk the entire stack, including unmanaged frames.
Here's where you have a choice:
Option 1: Do nothing and report stacks with "unmanaged holes" to your users, or ...
Option 2: Write your own unmanaged stack walker to fill in those holes.
When DoStackSnapshot comes across a block of unmanaged frames, it calls your StackSnapshotCallback function with funcId set to
If the unmanaged block consists of more than one unmanaged frame, the CLR still calls StackSnapshotCallback only once. Remember, the CLR is making no effort to decode the unmanaged block—it has special insider information that helps it skip over the block to the next managed frame, and that's how it progresses. The CLR doesn't necessarily know what's inside the unmanaged block. That's for you to figure out, hence Option 2.
That First Step Is a Doozy
No matter which option you choose, filling in the unmanaged holes isn't the only hard part. Just beginning the walk can be a challenge. Take a look at the stack above. There's unmanaged code at the top. Sometimes you'll be lucky, and the unmanaged code will be COM or PInvoke code. If so, the CLR is smart enough to know how to skip it, and begins the walk at the first managed frame (D in the example). However, you might still want to walk the top-most unmanaged block in order to report as complete a stack as possible.
Even if you don't want to walk the top-most block, you might be forced to anyway—if you're not lucky, that unmanaged code is not COM or PInvoke code, but helper code in the CLR itself, such as code to do JIT compiling or garbage collection. If that's the case, the CLR won't be able to find the D frame without your help. So an unseeded call to DoStackSnapshot will result in the error CORPROF_E_STACKSNAPSHOT_UNMANAGED_CTX or CORPROF_E_STACKSNAPSHOT_UNSAFE. (By the way, it's really worthwhile to visit corerror.h.)
Notice that I used the word "unseeded." DoStackSnapshot takes a seed context using the context
If you pass null for the context parameter, the stack walk is unseeded, and the CLR starts at the top. However, if you pass a non-null value for the context parameter, representing the CPU state at some spot lower down on the stack (such as pointing to the D frame), the CLR performs a stack walk seeded with your context. It ignores the real top of the stack and starts wherever you point it.
OK, not quite true. The context you pass to DoStackSnapshot is more of a hint than an outright directive. If the CLR is certain it can find the first managed frame (because the top-most unmanaged block is PInvoke or COM code), it will do that and ignore your seed. Don't take it personally, though. The CLR is trying to help you by providing the most accurate stack walk it can. Your seed is useful only if the top-most unmanaged block is helper code in the CLR itself, because we have no information to help us skip it. Therefore, your seed is used only when the CLR cannot determine by itself where to start the walk.
You might wonder how you can provide the seed to us in the first place. If the target thread is not yet suspended, you can't just walk the target thread's stack to find the D frame and thus calculate your seed context. And yet I'm telling you to calculate your seed context by doing your unmanaged walk before calling DoStackSnapshot and thus before DoStackSnapshot takes care of suspending the target thread for you. Does the target thread need to be suspended by you and by the CLR? Actually, yes.
I think it's time to choreograph this ballet. But before I get too deep, note that the issue of whether and how to seed a stack walk applies only to asynchronous walks. If you're doing a synchronous walk, DoStackSnapshot will always be able to find its way to the top-most managed frame without your help—no seed necessary.
All Together Now
For the truly adventuresome profiler that is doing an asynchronous, cross-thread, seeded stack walk while filling in the unmanaged holes, here's what a stack walk would look like. Assume that the stack illustrated here is the same stack you saw in Figure 2, just broken up a bit.
|Stack Contents||Profiler and CLR actions|
|1. You suspend the target thread. (The target thread's suspend count is now 1.)
2. You get the target thread's current register context.
3. You determine whether the register context points to unmanaged code-that is, you call ICorProfilerInfo2::GetFunctionFromIP and check whether you get back a FunctionID value of 0.
4. Because in this example the register context does point to unmanaged code, you perform an unmanaged stack walk until you find the top-most managed frame (Function D).
|5. You call DoStackSnapshot with your seed context, and the CLR suspends the target thread again. (Its suspend count is now 2.) The sandwich begins.
a. The CLR calls your StackSnapshotCallback function with the FunctionID for D.
b. The CLR calls your StackSnapshotCallback function with FunctionID equal to 0. You must walk this block yourself. You can stop when you reach the first managed frame. Alternatively, you can cheat and delay your unmanaged walk until sometime after your next callback, because the next callback will tell you exactly where the next managed frame begins and thus where your unmanaged walk should end.
c. The CLR calls your StackSnapshotCallback function with the FunctionID for C.
d. The CLR calls your StackSnapshotCallback function with the FunctionID for B.
e. The CLR calls your StackSnapshotCallback function with FunctionID equal to 0. Again, you must walk this block yourself.
f. The CLR calls your StackSnapshotCallback function with the FunctionID for A.
g. The CLR calls your StackSnapshotCallback function with the FunctionID for Main.
|6. You resume the target thread. Its suspend count is now 0, so the thread physically resumes.|
Be on Your Best Behavior
OK, this is way too much power without some serious caution. In the most advanced case, you're responding to timer interrupts and suspending application threads arbitrarily to walk their stacks. Yikes!
Being good is hard and involves rules that are not obvious at first. So let's dive in.
The Bad Seed
Let's start with an easy rule: do not use a bad seed. If your profiler supplies an invalid (non-null) seed when you call DoStackSnapshot, the CLR will give you bad results. It will look at the stack where you point it, and make assumptions about what the values on the stack are supposed to represent. That will cause the CLR to dereference what are assumed to be addresses on the stack. Given a bad seed, the CLR will dereference values off into some unknown place in memory. The CLR does everything it can to avoid an all-out second-chance AV, which would tear down the process you are profiling. But you really should make an effort to get your seed right.
Woes of Suspension
Other aspects of suspending threads are complicated enough that they require multiple rules. When you decide to do cross-thread walking, you've decided at a minimum to ask the CLR to suspend threads on your behalf. Moreover, if you want to walk the unmanaged block at the top of the stack, you've decided to suspend threads by yourself without invoking the wisdom of the CLR on whether this is a good idea at the moment.
If you took computer science classes, you probably remember the "dining philosophers" problem. A group of philosophers is sitting at a table, each with one fork on the right and one on the left. According to the problem, they each need two forks to eat. Each philosopher picks up his right fork, but then no one can pick up his left fork because each philosopher is waiting for the philosopher to his left to put down the needed fork. And if the philosophers are seated at a circular table, you've got a cycle of waiting and a lot of empty stomachs. The reason they all starve is that they break a simple rule of deadlock avoidance: if you need multiple locks, always take them in the same order. Following this rule would avoid the cycle where A waits on B, B waits on C, and C waits on A.
Suppose an application follows the rule and always takes locks in the same order. Now a component comes along (your profiler, for example) and starts arbitrarily suspending threads. The complexity has increased substantially. What if the suspender now needs to take a lock held by the suspendee? Or what if the suspender needs a lock held by a thread that's waiting for a lock held by another thread that's waiting for a lock held by the suspendee? Suspension adds a new edge to our thread-dependency graph, which can introduce cycles. Let's take a look at some specific problems.
Problem 1: The suspendee owns locks that are needed by the suspender or that are needed by threads that the suspender depends on.
Problem 1a: The locks are CLR locks.
As you can imagine, the CLR performs a lot of thread synchronization and therefore has several locks that are used internally. When you call DoStackSnapshot, the CLR detects that the target thread owns a CLR lock that the current thread (the thread that is calling DoStackSnapshot) needs in order to perform the stack walk. When that condition arises, the CLR refuses to perform the suspension, and DoStackSnapshot immediately returns with the error CORPROF_E_STACKSNAPSHOT_UNSAFE. At this point, if you've suspended the thread yourself before your call to DoStackSnapshot, then you will resume the thread yourself, and you've avoided a problem.
Problem 1b: The locks are your own profiler's locks.
This problem is really more of a common-sense issue. You might have your own thread synchronization to do here and there. Imagine that an application thread (Thread A) encounters a profiler callback and runs some of your profiler code that takes one of the profiler's locks. Then Thread B needs to walk Thread A, which means Thread B will suspend Thread A. You need to remember that while Thread A is suspended, you shouldn't have Thread B try to take any of the profiler's own locks that Thread A might own. For example, Thread B will execute StackSnapshotCallback during the stack walk, so you shouldn't take any locks during that callback that could be owned by Thread A.
Problem 2: While you suspend the target thread, the target thread tries to suspend you.
You might say, "That can't happen!" Believe it or not, it can, if:
- Your application runs on a multiprocessor box, and
- Thread A runs on one processor and Thread B runs on another, and
- Thread A tries to suspend Thread B while Thread B tries to suspend Thread A.
In that case, it's possible that both suspensions win, and both threads end up suspended. Because each thread is waiting for the other to wake it up, they stay suspended forever.
This problem is more disconcerting than Problem 1, because you can't rely on the CLR to detect before you call DoStackSnapshot that the threads will suspend each other. And after you've performed the suspension, it's too late!
Why is the target thread trying to suspend the profiler? In a hypothetical, poorly-written profiler, the stack-walking code, along with the suspension code, might be executed by any number of threads at arbitrary times. Imagine that Thread A is trying to walk Thread B at the same time that Thread B is trying to walk Thread A. They both try to suspend each other simultaneously, because they're both executing the SuspendThread portion of the profiler's stack-walking routine. Both win, and the application being profiled is deadlocked. The rule here is obvious—don't allow your profiler to execute stack-walking code (and thus suspension code) on two threads simultaneously!
A less obvious reason that the target thread might try to suspend your walking thread is due to the inner workings of the CLR. The CLR suspends application threads to help with tasks like garbage collection. If your walker tries to walk (and thus suspend) the thread performing the garbage collection at the same time that the garbage collector thread tries to suspend your walker, the processes will be deadlocked.
But it's easy to avoid the problem. The CLR suspends only the threads it needs to suspend in order to do its work. Imagine that there are two threads involved in your stack walk. Thread W is the current thread (the thread performing the walk). Thread T is the target thread (the thread whose stack is walked). As long as Thread W has never executed managed code, and is therefore not subject to CLR garbage collection, the CLR will never try to suspend Thread W. This means it's safe for your profiler to have Thread W suspend Thread T.
If you're writing a sampling profiler, it's quite natural to ensure all of this. You will typically have a separate thread of your own creation that responds to timer interrupts and that walks the stacks of other threads. Call this your sampler thread. Because you create the sampler thread yourself and have control over what it executes (and it therefore never executes managed code), the CLR will have no reason to suspend it. Designing your profiler so that it creates its own sampling thread to do all the stack walking also avoids the problem of the "poorly-written profiler" described earlier. The sampler thread is the only thread of your profiler trying to walk or suspend other threads, so your profiler will never try to directly suspend the sampler thread.
This is our first nontrivial rule, so for emphasis let me repeat it:
Rule 1: Only a thread that has never run managed code should suspend another thread.
Nobody Likes to Walk a Corpse
If you're performing a cross-thread stack walk, you must ensure that your target thread remains alive for the duration of the walk. Just because you pass the target thread as a parameter to the DoStackSnapshot call doesn't mean you've implicitly added any kind of lifetime reference to it. The application can make the thread go away at any time. If that happens while you're trying to walk the thread, you could easily cause an access violation.
Fortunately, the CLR notifies profilers when a thread is about to be destroyed, using the aptly named ThreadDestroyed callback defined with the ICorProfilerCallback(2) interface. It's your responsibility to implement ThreadDestroyed and have it wait until any process walking that thread is finished. This is interesting enough to qualify as our next rule:
Rule 2: Override the ThreadDestroyed callback and have your implementation wait until you are done walking the stack of the thread to be destroyed.
Following Rule 2 blocks the CLR from destroying the thread until you are done walking that thread's stack.
Garbage Collection Helps You Make a Cycle
Things can get a little confusing at this point. Let's start with the text of the next rule, and decipher it from there:
Rule 3: Do not hold a lock during a profiler call that can trigger garbage collection.
I mentioned earlier that it is a bad idea for your profiler to hold one if its own locks if the owning thread might be suspended, and if the thread might be walked by another thread that needs the same lock. Rule 3 helps you avoid a more subtle problem. Here, I'm saying you shouldn't hold any of your own locks if the owning thread is about to call an ICorProfilerInfo(2) method that might trigger a garbage collection.
A couple of examples should help. For the first example, assume that Thread B is doing the garbage collection. The sequence is:
- Thread A takes and now owns one of your profiler locks.
- Thread B calls the profiler's GarbageCollectionStarted callback.
- Thread B blocks on the profiler lock from Step 1.
- Thread A executes the GetClassFromTokenAndTypeArgs function.
- The GetClassFromTokenAndTypeArgs call tries to trigger a garbage collection, but detects that a garbage collection is already in progress.
- Thread A blocks, waiting for the garbage collection currently in progress (Thread B) to complete. However, Thread B is waiting for Thread A because of your profiler lock.
Figure 3 illustrates the scenario in this example:
Figure 3. A deadlock between the profiler and the garbage collector
The second example is a slightly different scenario. The sequence is:
- Thread A takes and now owns one of your profiler locks.
- Thread B calls the profiler's ModuleLoadStarted callback.
- Thread B blocks on the profiler lock from Step 1.
- Thread A executes the GetClassFromTokenAndTypeArgs function.
- The GetClassFromTokenAndTypeArgs call triggers a garbage collection.
- Thread A (which is now doing the garbage collection) waits for Thread B to be ready to be collected. But Thread B is waiting for Thread A because of your profiler lock.
- Figure 4 illustrates the second example.
Figure 4. A deadlock between the profiler and a pending garbage collection
Have you digested the madness? The crux of the problem is that garbage collection has its own synchronization mechanisms. The result in the first example occurs because only one garbage collection can occur at a time. This is admittedly a fringe case, because garbage collections generally don't occur so often that one has to wait for another, unless you're operating under stressful conditions. Even so, if you profile long enough this scenario will occur, and you need to be prepared for it.
The result in the second example occurs because the thread performing the garbage collection must wait for the other application threads to be ready for collection. The problem arises when you introduce one of your own locks into the mix, thus forming a cycle. In both cases Rule 3 is broken by allowing Thread A to own one of the profiler locks and then call GetClassFromTokenAndTypeArgs. (Actually, calling any method that might trigger a garbage collection is sufficient to doom the process.)
You probably have several questions by now.
Q. How do you know which ICorProfilerInfo(2) methods might trigger a garbage collection?
Q. What does this have to do with stack walking? There's no mention of DoStackSnapshot.
A. True. And DoStackSnapshot is not even one of those ICorProfilerInfo(2) methods that trigger a garbage collection. The reason I'm discussing Rule 3 here is that it's precisely those adventuresome programmers asynchronously walking stacks from arbitrary samples who will be most likely to implement their own profiler locks, and thus be prone to falling into this trap. Indeed, Rule 2 essentially tells you to add synchronization into your profiler. It is quite likely that a sampling profiler will have other synchronization mechanisms as well, perhaps to coordinate reading and writing shared data structures at arbitrary times. Of course, it's still possible for a profiler that never touches DoStackSnapshot to encounter this issue.
Enough Is Enough
I'm going to finish with a quick summary of the highlights. Here are the important points to remember:
- Synchronous stack walks involve walking the current thread in response to a profiler callback. These don't require seeding, suspending, or any special rules.
- Asynchronous walks require a seed if the top of the stack is unmanaged code and not part of a PInvoke or COM call. You supply a seed by directly suspending the target thread and walking it yourself until you find the top-most managed frame. If you don't supply a seed in this case, DoStackSnapshot may return a failure code or skip some frames at the top of the stack.
- If you need to suspend threads, remember that only a thread that has never run managed code should suspend another thread.
- When performing asynchronous walks, always override the ThreadDestroyed callback to block the CLR from destroying a thread until that thread's stack walk is complete.
- Do not hold a lock while your profiler calls into a CLR function that can trigger a garbage collection.
For more information about the profiling API, see Profiling (Unmanaged) on the MSDN Web site.
Credit Where Credit Is Due
I'd like to include a note of thanks to the rest of the CLR Profiling API team, because writing these rules has truly been a team effort. Special thanks to Sean Selitrennikoff, who provided an earlier incarnation of much of this content.
About the Author
David has been a developer at Microsoft for longer than you'd think, given his limited knowledge and maturity. Although no longer allowed to check in code, he still offers ideas for new variable names. David is an avid fan of Count Chocula and owns his own car.