February 2013

Volume 28 Number 02

Async Programming - Async Causality Chain Tracking

By Andrew Stasyuk

With the advent of C# 5, Visual Basic .NET 11, the Microsoft .NET Framework 4.5 and .NET for Windows Store apps, the asynchronous programming experience has been streamlined greatly. New async and await keywords (Async and Await in Visual Basic) allow developers to maintain the same abstraction they were used to when writing synchronous code.

A lot of effort was put into Visual Studio 2012 to improve asynchronous debugging with tools such as Parallel Stacks, Parallel Tasks, Parallel Watch and the Concurrency Visualizer. However, in terms of being on par with the synchronous code debugging experience, we’re not quite there yet.

One of the more prominent issues that breaks the abstraction and reveals internal plumbing behind the async/await façade is the lack of call stack information in the debugger. In this article, I’m going to provide means to bridge this gap and improve the asynchronous debugging experience in your .NET 4.5 or Windows Store app.

Let’s settle on essential terminology first.

Definition of a Call Stack

MSDN documentation (bit.ly/Tukvkm) used to define call stack as “the series of method calls leading from the beginning of the program to the statement currently being executed at run time.” This notion was perfectly valid for the single-threaded, synchronous programming model, but now that parallelism and asynchrony are gaining momentum, more precise taxonomy is necessary.

For the purpose of this article, it’s important to distinguish the causality chain from the return stack. Within the synchronous paradigm, these two terms are mostly identical (I’ll mention the exceptional case later). In asynchronous code, the aforementioned definition describes a causality chain.

On the other hand, the statement currently being executed, when finished, will lead to a series of methods continuing their execution. This series constitutes the return stack. Alternatively, for readers familiar with the continuation passing style (Eric Lippert has a fabulous series on this topic, starting at bit.ly/d9V0Dc), the return stack might be defined as a series of continuations that are registered to execute, should the currently executing method complete.

In a nutshell, the causality chain answers the question, “How did I get here?” while return stack is the answer for, “Where do I go next?” For example, if you’ve got a deadlock in your application, you might be able to find out what caused it from the former, while the latter would let you know what the consequences are. Note that while a causality chain always tracks back to the program entry point, the return stack is cut off at the point where the result of asynchronous operation is not observed (for example, async void methods or work scheduled via ThreadPool.QueueUserWorkItem).

There’s also a notion of stack trace being a copy of a synchronous call stack preserved for diagnostics; I’ll use these two terms interchangeably.

Be aware that there are several unspoken assumptions in the preceding definitions:

  • “Method calls” referred to in the first definition generally imply “methods that have not completed yet,” which bear the physical meaning of “being on stack” in the synchronous programming model. However, while we’re generally not interested in methods that have already returned, it’s not always possible to distinguish them during asynchronous debugging. In this case, there’s no physical notion of “being on stack” and all continuations are equally valid elements of a causality chain.
  • Even in synchronous code, a causality chain and return stack aren’t always identical. One particular case when a method might be present in one, but missing from the other, is a tail call. Though not directly expressible in C# and Visual Basic .NET, it may be coded in Intermediate Language (IL) (“tail.” prefix) or produced by the just-in-time (JIT) compiler (especially in a 64-bit process).
  • Last, but not least, causality chains and return stacks can be nonlinear. That is, in the most general case, they’re directed graphs having current statement as a sink (causality graph) or source (return graph). Nonlinearity in asynchronous code is due to forks (parallel asynchronous operations originating from one) and joins (continuation scheduled to run upon completion of a set of parallel asynchronous operations). For the purpose of this article, and due to platform limitations (explained later), I’ll consider only linear causality chains and return stacks, which are subsets of corresponding graphs.

Luckily, if asynchrony is introduced into a program by using async and await keywords with no forks or joins, and all async methods are awaited, the causality chain is still identical to the return stack, just as in synchronous code. In this case, both of them are equally useful in orienting yourself in the control flow.

On the other hand, causality chains are rarely equal to return stacks in programs employing explicitly scheduled continuations, a notable example being Task Parallel Library (TPL) dataflow. This is due to the nature of data flowing from a source block to a target block, never returning to the former.

Existing Tools

Consider a quick example:

static void Main()
async static Task OperationAsync()
  await Task.Delay(1000);
  Console.WriteLine("Where is my call stack?");

By extrapolating the abstraction developers were used to in synchronous debugging, they would expect to see the following causality chain/return stack when execution is paused at the Console.WriteLine method:

ConsoleSample.exe!ConsoleSample.Program.OperationAsync() Line 19
ConsoleSample.exe!ConsoleSample.Program.Main() Line 13

But if you try this, you’ll find that in the Call Stack window the Main method is missing, while the stack trace starts directly in the OperationAsync method preceded by [Resuming Async Method]. Parallel Stacks has both methods; however, it doesn’t show that Main calls OperationAsync. Parallel Tasks doesn’t help either, showing “No tasks to display.”

Note: At this point the debugger is aware of the Main method being part of the call stack—you might have noticed that by the gray background behind the call to OperationAsync. The CLR and Windows Runtime (WinRT) have to know where to continue execution after the topmost stack frame returns; thus, they do indeed store return stacks. In this article, though, I’ll only delve into causality tracking, leaving return stacks as a topic for another article.

Preserving Causality Chains

In fact, causality chains are never stored by the runtime. Even call stacks that you see when debugging synchronous code are, in essence, return stacks—as was just said, they’re necessary for the CLR and Windows Runtime to know which methods to execute after the topmost frame returns. The runtime doesn’t need to know what caused a particular method to execute.

To be able to view causality chains during live and post-mortem debugging, you have to explicitly preserve them along the way. Presumably, this would require storing (synchronous) stack trace information at every point where continuation is scheduled and restoring this data when continuation starts to execute. These stack trace segments could then be stitched together to form a causality chain.

We’re more interested in transferring causality information across await constructs, as this is where abstraction of similarity with synchronous code breaks. Let’s see how and when this data can be captured.

As Stephen Toub points out (bit.ly/yF8eGu), provided that FooAsync returns a Task, the following code:

await FooAsync();

is transformed by the compiler to a rough equivalent of this:

var t = FooAsync();
var currentContext = SynchronizationContext.Current;
  if (currentContext == null)
    currentContext.Post(delegate { RestOfMethod(); }, null);
}, TaskScheduler.Current);

From looking at the expanded code, it appears there are at least two extension points that might allow for capturing causality information: TaskScheduler and SynchronizationContext. Indeed, both offer similar pairs of virtual methods where it should be possible to capture call stack segments at the right moments: QueueTask/TryDequeue on TaskScheduler and Post/OperationStarted on SynchronizationContext.

Unfortunately, you can only substitute default TaskScheduler when explicitly scheduling a delegate via the TPL API, such as Task.Run, Task.ContinueWith, TaskFactory.StartNew and so on. This means that whenever continuation is scheduled outside a running task, the default TaskScheduler will be in force. Thus, the TaskScheduler-­based approach won’t be able to capture necessary information.

As for SynchronizationContext, although it’s possible to override the default instance of this class for the current thread by calling the SynchronizationContext.SetSynchronizationContext method, this has to be done for every thread in the application. Thus, you’d have to be able to control thread lifetime, which is infeasible if you aren’t planning to re-implement a thread pool. Moreover, Windows Forms, Windows Presentation Foundation (WPF) and ASP.NET all provide their own implementations of SynchronizationContext in addition to SynchronizationContext.Default, which schedules work to the thread pool. Hence, your implementation would have to behave differently depending on the origin of the thread in which it’s working.

Also note that when awaiting a custom awaitable, it’s entirely up to implementation whether to use SynchronizationContext to schedule a continuation.

Luckily, there are two extension points suitable for our scenario: subscribing to TPL events without having to modify the existing codebase, or explicitly opting in by slightly modifying every await expression in the application. The first approach only works in desktop .NET applications, while the second can accommodate Windows Store apps. I’ll detail both in the following sections.

Introducing EventSource

The .NET Framework supports Event Tracing for Windows (ETW), having defined event providers for practically every aspect of the runtime (bit.ly/VDfrtP). Particularly, TPL fires events that allow you to track Task lifetime. Although not all of these events are documented, you can obtain their definitions yourself by delving into mscorlib.dll with a tool such as ILSpy or Reflector or peeking into framework reference source (referencesource.microsoft.com/) and searching for the TplEtwProvider class. Of course, the usual reflection disclaimer applies: If the API isn’t documented, there’s no guarantee that empirically observed behavior will be retained in the next release.

TplEtwProvider inherits from System.Diagnostics.Tracing.EventSource, which was introduced in the .NET Framework 4.5 and is now a recommended way to fire ETW events in your application (previously you had to deal with manual ETW manifest generation). In addition, EventSource allows for consumption of events in process, by subscribing to them via EventListener, also new in the .NET Framework 4.5 (more on this momentarily).

The event provider can be identified by either a name or GUID. Each particular event type is in turn identified by event ID and, optionally, a keyword to distinguish from other unrelated types of events fired by this provider (TplEtwProvider doesn’t use keywords). There are optional Task and Opcode parameters that you might find useful for filtering, but I’ll rely solely on event ID. Each event also defines the level of verbosity.

TPL events have a variety of uses besides causality chains, such as tracking of tasks in-flight, telemetry and so on. They don’t fire for custom awaitables, though.

Introducing EventListener

In the .NET Framework 4, in order to capture ETW events, you had to be running an out-of-process ETW listener, such as Windows Performance Recorder or Vance Morrison’s PerfView, and then correlate captured data with the state you observed in the debugger. This posed additional problems, as data was stored outside process memory space and crash dumps didn’t include it, which made this solution less suitable for post-mortem debugging. For example, if you rely on Windows Error Reporting to provide dumps, you won’t get any ETW traces and thus causality information will be missing.

However, starting in the .NET Framework 4.5, it’s possible to subscribe to TPL events (and other events fired by EventSource inheritors) via System.Diagnostics.Tracing.EventListener (bit.ly/XJelwF). This allows the capture and preservation of stack trace segments in the process memory space. Therefore, a mini-dump with heap should be enough to extract causality information. In this article, I’ll only detail EventListener-based subscriptions.

It’s worth mentioning that the advantage of an out-of-process listener is that you can always get the call stacks by listening to the Stack ETW Events (either relying on an existing tool or doing tedious stack walking and module address tracking yourself). When subscribing to the events using EventListener, you can’t get call stack information in Windows Store apps, because the StackTrace API is prohibited. (An approach that works for Windows Store apps is described later.)

In order to subscribe to events, you have to inherit from Event­Listener, override the OnEventSourceCreated method and make sure that an instance of your listener gets created in every AppDomain of your program (subscription is per application domain). After EventListener is instantiated, this method will be called to notify the listener of event sources that are being created. It will also provide notifications for all event sources that existed before the listener was created. After filtering event sources either by name or GUID (performance-wise, comparing GUIDs is a better idea), a call to EnableEvents subscribes the listener to the source:

private static readonly Guid tplGuid =
  new Guid("2e5dba47-a3d2-4d16-8ee0-6671ffdcd7b5");
protected override void OnEventSourceCreated(EventSource eventSource)
  if (eventSource.Guid == tplGuid)
    EnableEvents(eventSource, EventLevel.LogAlways);

To process events, you need to implement abstract method OnEventWritten. For the purpose of preserving and restoring stack trace segments, you need to capture the call stack right before an asynchronous operation is scheduled, and then, when it starts execution, associate a stored stack trace segment with it. To correlate these two events, you can use the TaskID parameter. Parameters passed to a corresponding event-firing method in an event source are boxed into a read-only object collection and passed in as the Payload property of EventWrittenEventArgs.

Interestingly, there are special fast paths for EventSource events that are consumed as ETW (not via EventListener), where boxing doesn’t occur for their arguments. This does provide a performance improvement, but it’s mostly zeroed out due to cross-process machinery.

In the OnEventWritten method, you need to distinguish between event sources (in case you subscribe to more than one) and identify the event itself. The stack trace will be captured (stored) when TaskScheduled or TaskWaitBegin events fire, and associated with a newly started asynchronous operation (restored) in TaskWaitEnd. You also need to pass in taskId as the correlation identifier. Figure 1 shows the outline of how the events will be handled.

Figure 1 Handling of TPL Events in the OnEventWritten Method

protected override void OnEventWritten(EventWrittenEventArgs eventData)
  if (eventData.EventSource.Guid == tplGuid)
    int taskId;
    switch (eventData.EventId)
      case 7: // Task scheduled
        taskId = (int)eventData.Payload[2];
      case 10: // Task wait begin
        taskId = (int)eventData.Payload[2];
        bool waitBehaviorIsSynchronous =
          (int)eventData.Payload[3] == 1;
        if (!waitBehaviorIsSynchronous)
      case 11: // Task wait end
        taskId = (int)eventData.Payload[2];

Note: Explicit values (“magic numbers”) in code are a bad programming practice and are used here only for brevity. The accompanying sample code project has them conveniently structured in constants and enumerations to avoid duplication and risk of typos.

Note that in TaskWaitBegin, I check for TaskWaitBehavior being synchronous, which happens when a task being awaited is executed synchronously or has already completed. In this case, a synchronous call stack is still in place, so it doesn’t need to be stored explicitly.

Async-Local Storage

Whatever data structure you choose to preserve call stack segments needs the following quality: Stored value (causality chain) should be preserved for every asynchronous operation, following control flow along the way across await boundaries and continuations, bearing in mind that continuations may execute on different threads.

This suggests a thread-local-like variable that would preserve its value pertaining to the current asynchronous operation (a chain of continuations), instead of a particular thread. It can be roughly named “async-local storage.”

The CLR already has a data structure called ExecutionContext that’s captured on one thread and restored on the other (where continuation gets to execute), thus being passed along with control flow. This is essentially a container that stores other contexts (SynchronizationContext, CallContext and so on) that might be needed to continue execution in exactly the same environment, where they were interrupted. Stephen Toub has the details at bit.ly/M0amHk. Most importantly, you can store arbitrary data in CallContext (by calling its static methods LogicalSetData and LogicalGetData), which seems to suit the aforementioned purpose.

Bear in mind that CallContext (actually, internally there are two of them: LogicalCallContext and IllogicalCallContext) is a heavy object, designed to flow across remoting boundaries. When no custom data is stored, the runtime doesn’t initialize the contexts, sparing the cost of maintaining them with the control flow. As soon as you call the CallContext.LogicalSetData method, a mutable ExecutionContext and several Hashtables have to be created and passed along or cloned from then on.

Unfortunately, ExecutionContext (together with all its constituents) is captured before the described TPL events fire and restored shortly afterward. Thus, any custom data saved in CallContext in between is discarded after ExecutionContext is restored, which makes it unsuitable for our particular purpose.

In addition, the CallContext class isn’t available in the .NET for Windows Store apps subset, so an alternative is needed for this scenario.

One way to build an async-local storage that would work around these problems is to maintain the value in thread-local storage (TLS) while the synchronous portion of code is executing. Then, when the TaskWaitStart event fires, store the value in a shared (non-TLS) dictionary, keyed by the TaskID. When the counterpart event, TaskWaitEnd, fires, remove the preserved value from the dictionary and save it back to TLS, possibly on a different thread.

As you might know, values stored in TLS are preserved even after a thread is returned to the thread pool and gets new work to execute. So, at some point, the value has to be removed from TLS (otherwise, some other asynchronous operation executing on this thread later might access the value stored by the previous operation as if it were its own). You can’t do this in the TaskWaitBegin event handler because, in case of nested awaits, TaskWaitBegin and TaskWaitEnd events occur multiple times, once per await, and a stored value might be needed in between, such as in the following snippet:

async Task OuterAsync()
  await InnerAsync();
async Task InnerAsync()
  await Task.Delay(1000);

Instead, it’s safe to consider that the value in TLS is eligible to be cleared when the current asynchronous operation is no longer being executed on a thread. Because the CLR doesn’t have an in-­process event that would notify of a thread being recycled back to the thread pool (there’s an ETW one—bit.ly/ZfAWrb), for this purpose I’ll use ThreadPoolDequeueWork fired by FrameworkEventSource (also undocumented), which occurs when a new operation is started on a thread pool thread. This leaves out non-pooled threads, for which you’d have to manually clean the TLS, such as when a UI thread returns to the message loop.

For a working implementation of this concept together with stack segments capturing and concatenation, please refer to the StackStorage class in the accompanying source code download. There’s also a cleaner abstraction, AsyncLocal<T>, which allows you to store any value and transfer it with the control flow to subsequent asynchronous continuations. I’ll use it as causality chain storage for Windows Store apps scenarios.

Tracing Causality in Windows Store Apps

The described approach would still hold up in a Windows Store scenario if the System.Diagnostics.StackTrace API were available. For better or for worse, it isn’t, which means you can’t get any information about call stack frames above the current one from within your code. Thus, even while TPL events are still supported, a call to TaskWaitStart or TaskWaitEnd is buried deep in the framework method calls, so you have no information about your code that caused these events to fire.

Luckily, .NET for Windows Store apps (as well as the .NET Framework 4.5) provides CallerMemberNameAttribute (bit.ly/PsDH0p) and its peers CallerFilePathAttribute and CallerLine­NumberAttribute. When optional method arguments are decorated with these, the compiler will initialize the arguments with corresponding values at compile time. For example, the following code will output “Main() in c:\Full\Path\To\Program.cs at line 14”:

static void Main(string[] args)
static void LogCurrentFrame([CallerMemberName] string name = null,
  [CallerFilePath] string path = null, 
    [CallerLineNumber] int line = 0)
  Console.WriteLine("{0}() in {1} at line {2}", name, path, line);

This only allows the logging method to get information about the calling frame, which means you have to ensure it gets called from all the methods you want captured in the causality chain. One convenient location for this would be decorating each await expression with a call to an extension method, like this:

await WorkAsync().WithCausality();

Here, the WithCausality method captures the current frame, appends it to causality chain and returns a Task or awaitable (depending on what WorkAsync returns), which upon completion of the original one removes the frame from the causality chain.

As multiple different things can be awaited, there should be multiple overloads of WithCausality. This is straightforward for a Task<T> (and even easier for a Task):

public static Task<T> WithCausality<T>(this Task<T> task,
  [CallerMemberName] string member = null,
  [CallerFilePath] string file = null,
  [CallerLineNumber] int line = 0)
  var removeAction =
    AddFrameAndCreateRemoveAction(member, file, line);
  return task.ContinueWith(t => { removeAction(); return t.Result; });

However, it’s trickier for custom awaitables. As you might know, the C# compiler allows you to await an instance of any type that follows a particular pattern (see bit.ly/AmAUIF), which makes writing overloads that would accommodate any custom awaitable impossible using static typing only. You may make a few shortcut overloads for awaitables predefined in the framework, such as YieldAwaitable or ConfiguredTaskAwaitable—or the ones defined in your solution—but in general you have to resort to the Dynamic Language Runtime (DLR). Handling all the cases requires a lot of boilerplate code, so feel free to look into the accompanying source code for details.

It’s also worth noting that in case of nested awaits, WithCausality methods will be executed from inner to outer (as await expressions are evaluated), so care must be taken to assemble the stack in the correct order.

Viewing Causality Chains

Both described approaches keep causality information in memory as lists of call stack segments or frames. However, walking them and concatenating into a single causality chain for display is tedious to do by hand.

The easiest option to automate this is to leverage the debugger evaluator. In this case, you author a public static property (or method) on a public class, which, when called, walks the list of stored segments and returns a concatenated causality chain. Then you can evaluate this property during debugging and see the result in the text visualizer.

Unfortunately, this approach doesn’t work in two situations. One occurs when the topmost stack frame is in native code, which is quite a common scenario for debugging application hangs, as kernel-based synchronization primitives do call into native code. The debugger evaluator would just display, “Cannot evaluate expression because the code of the current method is optimized” (Mike Stall describes these limitations in detail at bit.ly/SLlNuT).

The other issue is with post-mortem debugging. You can actually open a mini-dump in Visual Studio and, surprisingly (given that there’s no process to debug, only its memory dump), you’re allowed to examine property values (run property getters) and even call some methods! This amazing piece of functionality is built into the Visual Studio debugger and works by interpreting a watch expression and all methods that it calls into (in contrast to live debugging, where compiled code gets executed).

Obviously, there are limitations. For example, while doing dump debugging, you can’t in any way call into native methods (meaning that you can’t even execute a delegate, because its Invoke method is generated in native code) or access some restricted APIs (such as System.Reflection). Interpreter-based evaluation is also expectedly slow—and, sadly, due to a bug, the evaluation timeout for dump debugging is limited to 1 second in Visual Studio 2012, regardless of configuration. This, given the number of method calls required to traverse the list of stack trace segments and iterate over all frames, prohibits the use of the evaluator for this purpose.

Luckily, the debugger always allows access to field values (even in dump debugging or when the top stack frame is in native code), which makes it possible to crawl through the objects constituting a stored causality chain and reconstruct it. This is obviously tedious, so I wrote a Visual Studio extension that does this for you (see accompanying sample code). Figure 2 shows what the final experience looks like. Note that the graph on the right is also generated by this extension and represents the async equivalent of Parallel Stacks.

Causality Chain for an Asynchronous Method and “Parallel” Causality for All Threads
Figure 2 Causality Chain for an Asynchronous Method and “Parallel” Causality for All Threads

Comparison and Caveats

Both causality-tracking approaches are not free. The second one (caller-info-based) is more lightweight, as it doesn’t involve the expensive StackTrace API, relying instead on the compiler to provide caller frame information during compile time, which means “free” in a running program. However, it still uses eventing infrastructure with its cost to support AsyncLocal<T>. On the other hand, the first approach provides more data, not skipping frames without awaits. It also automatically tracks several other situations where Task-based asynchrony arises without await, such as the Task.Run method; on the other hand, it does not work with custom awaitables.

An additional benefit of the TPL events-based tracker is that existing asynchronous code doesn’t have to be modified, while for the caller info attributes-based approach, you have to alter every await statement in your program. But only the latter supports Windows Store apps.

The TPL events tracker also suffers from a lot of boilerplate framework code in stack trace segments, though it can be easily filtered out by frame namespace or class name. See the sample code for a list of common filters.

Another caveat concerns loops in asynchronous code. Consider the following snippet:

async static Task Loop()
  for (int i = 0; i < 10; i++)
    await FirstAsync();
    await SecondAsync();
    await ThirdAsync();

By the end of the method, its causality chain would grow to more than 30 segments, repeatedly alternating between FirstAsync, SecondAsync and ThirdAsync frames. For a finite loop, this may be tolerable, though it’s still a waste of memory to store duplicate frames 10 times. However, in some cases, a program might introduce a valid infinite loop, for example, in the case of a message loop. Moreover, infinite repetition might be introduced without loop or await constructs—a timer rescheduling itself on every tick is a perfect example. Tracking an infinite causality chain is a sure way to run out of memory, so the amount of data stored has to be reduced to a finite amount somehow.

This issue doesn’t affect the caller-info-based tracker, as it removes a frame from the list immediately upon the start of a continuation. There are two (combinable) approaches to fix this for the TPL events scenario. One is to cut older data based on the rolling maximum storage amount. The other is to represent loops efficiently and avoid duplication. For both approaches, you might also detect common infinite loop patterns and cut the causality chain explicitly at these points.

Feel free to refer to the accompanying sample project to see how loop folding might be implemented.

As stated, the TPL events API only lets you capture a causality chain, not a graph. This is because the Task.WaitAll and Task.WhenAll methods are implemented as countdowns, where continuation is scheduled only when the last task comes in completed and the counter reaches zero. Thus, only the last completed task forms a causality chain.

Wrapping Up

In this article, you’ve learned the difference between a call stack, a return stack and a causality chain. You should now be aware of extension points that the .NET Framework provides to track scheduling and execution of asynchronous operations and be able to leverage these to capture and preserve causality chains. The approaches described cover tracking causality in classic and Windows Store apps, both in live and post-mortem debugging scenarios. You also learned about the concept of async-local storage and its possible implementation for Windows Store apps.

Now you can go ahead and incorporate causality tracking into your asynchronous codebase or use async-local storage in parallel calculations; explore the event sources that the .NET Framework 4.5 and .NET for Windows Store apps offer to build something new, such as a tracker for unfinished tasks in your program; or use this extension point to fire your own events to fine-tune the performance of your application.

Andriy (Andrew) Stasyuk is a software development engineer in test II on the Managed Languages team at Microsoft. He has seven years of experience as a participant, task author, jury member, and coach at various national and international programming contests. He worked in financial software development at Paladyne/Broadridge Financial Solutions Inc. and Deutsche Bank AG before moving to Microsoft. His main interests in programming are algorithms, parallelism and brainteasers.

Thanks to the following technical experts for reviewing this article: Vance Morrison and Lucian Wischik