September 2012

Volume 27 Number 09

Windows with C++ - The Pursuit of Efficient and Composable Asynchronous Systems

By Kenny Kerr | September 2012

Kenny KerrThe implementation of computer hardware heavily influenced the design of the C programming language to follow an imperative approach to computer programming. This approach describes a program as a sequence of statements that embody the program’s state. This was an intentional choice by C designer Dennis Ritchie. It allowed him to produce a viable alternative to assembly language. Ritchie also adopted a structured and procedural design, which has proven to be effective at improving the quality and maintainability of programs, leading to the creation of vastly more sophisticated and powerful system software.

A particular computer’s assembly language typically consists of the set of instructions supported by the processor. The programmer can refer to registers—literally small amounts of memory on the processor itself—as well as addresses in main memory. Assembly language will also contain some instructions for jumping to different locations in the program, providing a simplistic way to create reusable routines. In order to implement functions in C, a small amount of memory called the “stack” is reserved. For the most part, this stack, or call stack, stores information about each function that’s called so the program can automatically store state—both local and shared with its caller—and know where execution should resume once the function completes. This is such a fundamental part of computing today that most programmers don’t give it a second thought, yet it’s an incredibly important part of what makes it possible to write efficient and comprehensible programs. Consider the following code:

int sum(int a, int b) { return a + b; }
int main()
{
  int x = sum(3, 4);
  return sum(x, 5);
}

Given the assumption of sequential execution, it’s obvious—if not explicit—what the state of the program will be at any given point. These functions would be meaningless without first assuming there’s some automatic storage for function arguments and return values, as well as some way for the program to know where to resume execution when the function calls return. For C and C++ programmers, it’s the stack that makes this possible and allows us to write simple and efficient code. Unfortunately, it’s also our dependency on the stack that causes C and C++ programmers a world of hurt when it comes to asynchronous programming. Traditional systems programming languages such as C and C++ must adapt in order to remain competitive and productive in a world filled with increasingly asynchronous operations. Although I suspect C programmers will continue to rely on traditional techniques to accomplish concurrency for some time, I’m hopeful that C++ will evolve more quickly and provide a richer language with which to write efficient and composable asynchronous systems.

Last month I explored a simple technique that you can use today with any C or C++ compiler to implement lightweight cooperative multitasking by simulating coroutines with macros. Although adequate for the C programmer, it presents some challenges for the C++ programmer, who naturally and rightly relies on local variables among other constructs that break the abstraction. In this column, I’m going to explore one possible future direction for C++ to directly support asynchronous programming in a more natural and composable way.

Tasks and Stack Ripping

As I mentioned in my last column (msdn.microsoft.com/magazine/jj553509), concurrency doesn’t imply threaded programming. This is a conflation of two separate issues but is prevalent enough to cause some confusion. Because the C++ language originally didn’t provide any explicit support for concurrency, programmers naturally used different techniques to achieve the same. As programs became more complex, it became necessary—and perhaps obvious—to divide programs into logical tasks. Each task would be a sort of mini program with its own stack. Typically, an OS implements this with threads, and each thread is given its own stack. This allows tasks to run independently and often preemptively depending on the scheduling policy and the availability of multiple processing cores. However, each task, or mini C++ program, is simple to write and can execute sequentially thanks to its stack isolation and the state the stack embodies. This one-thread-per-task approach has some obvious limitations, however. The per-thread overhead is prohibitive in many cases. Even if it were not so, the lack of cooperation between threads leads to much complexity due to the necessity to synchronize access to shared state or communicate between threads.

Another approach that has gained much popularity is event-driven programming. It’s perhaps more evident that concurrency doesn’t imply threaded programming when you consider the many examples of event-driven programming in UI development and libraries relying on callback functions to implement a form of cooperative task management. But the limitations of this approach are at least as problematic as those for the one-thread-per-task approach. Immediately your clean, sequential program becomes a web—or, optimistically, a spaghetti stack—of callback functions instead of a cohesive sequence of statements and function calls. This is sometimes called stack ripping, because a routine that was previously a single function call is now ripped into two or more functions. This in turn also frequently leads to a ripple effect throughout a program. Stack ripping is disastrous if you care at all about complexity. Instead of one function, you now have at least two. Instead of relying on automatic storage of local variables on the stack, you must now explicitly manage the storage for this state, as it must survive between one stack location and another. Simple language constructs such as loops must be rewritten to accommodate this separation. Finally, debugging stack-ripped programs is much harder because the state of the program is no longer embodied in the stack and must often be manually “reassembled” in the programmer’s head. Consider the example of a simple flash storage driver for an embedded system from my last column, expressed with synchronous operations to provide obviously sequential execution:

void storage_read(void * buffer, uint32 size, uint32 offset);
void storage_write(void * buffer, uint32 size, uint32 offset);
int main()
{
  uint8 buffer[1024];
  storage_read(buffer, sizeof(buffer), 0);
  storage_write(buffer, sizeof(buffer), 1024);
}

It’s not hard to figure out what’s going on here. A 1KB buffer that’s backed by the stack is passed to the storage_read function, suspending the program until the data has been read into the buffer. This same buffer is then passed to the storage_write function, suspending the program until the transfer completes. At this point, the program returns safely, automatically reclaiming the stack space that was used for the copy operation. The obvious downside is that the program isn’t doing useful work while suspended, waiting for the I/O to complete.

In my last column I demonstrated a simple technique for imple­menting cooperative multitasking in C++ in a way that lets you return to a sequential style of programming. However, without the ability to use local variables, it’s somewhat limited. Although stack management remains automatic as far as function calls and returns go, the loss of automatic stack variables is a pretty severe limitation. Still, it beats full-blown stack ripping. Consider what the preceding code might look like using a traditional event-driven approach and you can plainly see stack ripping in action. First, the storage functions would need to be redeclared to accommodate some sort of event notification, commonly by means of a callback function:

typedef void (* storage_done)(void * context);
void storage_read(void * b, uint32 s, uint32 o, storage_done, void * context);
void storage_write(void * b, uint32 s, uint32 o, storage_done, void * context);

Next, the program itself would need to be rewritten to implement the appropriate event handlers:

void write_done(void *)
{
  ... signal completion ...
}
void read_done(void * b)
{
  storage_write(b, 1024, 1024, write_done, nullptr);
}
int main()
{
  uint8 buffer[1024];
  storage_read(buffer, sizeof(buffer), 0, read_done, buffer);
  ... wait for completion signal ...
}

This is clearly far more complex than the earlier synchronous approach, yet it’s very much the norm today among C and C++ programs. Notice how the copy operation that was originally confined to the main function is now spread over three functions. Not only that, but you almost need to reason about the program in reverse, as the write_done callback needs to be declared before read_done and it needs to be declared before the main function. Still, this program is somewhat simplistic, and you should appreciate how this would only get more cumbersome as the “chain of events” was fully realized in any real-world application.

C++11 has made some notable steps toward an elegant solution, but we’re not quite there yet. Although C++11 now has much to say about concurrency in the standard library, it’s still largely silent in the language itself. The libraries themselves also don’t go far enough to allow the programmer to write more complex composable and asynchronous programs easily. Nevertheless, great work has been done, and C++11 provides a good foundation for further refinements. First, I’m going to show you what C++11 offers, then what’s missing and, finally, a possible solution.

Closures and Lambda Expressions

In general terms, a closure is a function coupled with some state identifying any nonlocal information that the function needs in order to execute. Consider the TrySubmitThreadpoolCallback function I covered in my thread pool series last year (msdn.microsoft.com/magazine/hh335066):

void CALLBACK callback(PTP_CALLBACK_INSTANCE, void * state) { ... }
int main()
{
  void * state = ...
  TrySubmitThreadpoolCallback(callback, state, nullptr);
  ...
}

Notice how the Windows function accepts both a function as well as some state. This is in fact a closure in disguise; it certainly doesn’t look like your typical closure, but the functionality is the same. Arguably, function objects achieve the same end. Closures as a first-class concept rose to fame in the functional programming world, but C++11 has made strides to support the concept as well, in the form of lambda expressions:

void submit(function<void()> f) { f(); }
int main()
{
  int state = 123;
  submit([state]() { printf("%d\n", state); });
}

In this example there’s a simple submit function that we can pretend will cause the provided function object to execute in some other context. The function object is created from a lambda expression in the main function. This simple lambda expression includes the necessary attributes to qualify as a closure and the conciseness to be convincing. The [state] part indicates what state is to be “captured,” and the rest is effectively an anonymous function that has access to this state. You can plainly see that the compiler will create the moral equivalent of a function object to pull this off. Had the submit function been a template, the compiler might even have optimized away the function object itself, leading to performance gains in addition to the syntactic gains. The bigger question, however, is whether this is really a valid closure. Does the lambda expression really close the expression by binding the nonlocal variable? This example should clarify at least part of the puzzle:

int main()
{
  int state = 123;
  auto f = [state]() { printf("%d\n", state); };
  state = 0;
  submit(f);
}

This program prints “123” and not “0” because the state variable was captured by value rather than by reference. I can, of course, tell it to capture the variable by reference:

int main()
{
  int state = 123;
  auto f = [&]() { printf("%d\n", state); };
  state = 0;
  submit(f);
}

Here I’m specifying the default capture mode to capture variables by reference and letting the compiler figure out that I’m referring to the state variable. As is expected, the program now dutifully prints “0” rather than “123.” The problem, of course, is that the storage for the variable is still bound to the stack frame in which it was declared. If the submit function delays execution and the stack unwinds, then the state would be lost and your program would be incorrect.

Dynamic languages such as JavaScript get around this problem by merging the imperative world of C with a functional style that relies far less on the stack, with each object essentially being an unordered associative container. C++11 provides the shared_ptr and make_shared templates, which provide efficient alternatives even if they’re not quite as concise. So, lambda expressions and smart pointers solve part of the problem by allowing closures to be defined in context and allowing state to be freed from the stack without too much syntactic overhead. It’s not ideal, but it’s a start.

Promises and Futures

At first glance, another C++11 feature called futures might appear to provide the answer. You can think of futures as enabling explicitly asynchronous function calls. Of course, the challenge is in defining what exactly that means and how it gets implemented. It’s easier to explain futures with an example. A future-enabled version of the original synchronous storage_read function might look like this:

// void storage_read(void * b, uint32 s, uint32 o);
future<void> storage_read(void * b, uint32 s, uint32 o);

Notice that the only difference is that the return type is wrapped in a future template. The idea is that the new storage_read function will begin or queue the transmission before returning a future object. This future can then be used as a synchronization object to wait for the operation to complete:

int main()
{
  uint8 buffer[1024];
  auto f = storage_read(buffer, sizeof(buffer), 0);
  ...
  f.wait();
  ...
}

This might be called the consumer end of the asynchronous equation. The storage_read function abstracts away the provider end and is equally simple. The storage_read function would need to create a promise and queue it along with the parameters of the request and return the associated future. Again, this is easier to understand in code:

future<void> storage_read(void * b, uint32 s, uint32 o)
{
  promise<void> p;
  auto f = p.get_future();
  begin_transfer(move(p), b, s, o);
  return f;
}

Once the operation completes, the storage driver can signal to the future that it’s ready:

p.set_value();

What value is this? Well, no value at all, because we’re using the promise and future specializations for void, but you can imagine a file system abstraction built on top of this storage driver that might include a file_read function. This function might need to be called without knowing the size of a particular file. It could then return the actual number of bytes transferred:

future<int> file_read(void * b, uint32 s, uint32 o);

In this scenario, a promise with type int would also be used, thus providing a channel through which to communicate the number of bytes actually transferred:

promise<int> p;
auto f = p.get_future();
...
p.set_value(123);
...
f.wait();
printf("bytes %d\n", f.get());

The future provides the get method through which the result may be obtained. Great, we have a way of waiting on the future, and all our problems are solved! Well, not so fast. Does this really solve our problem? Can we kick off multiple operations concurrently? Yes. Can we easily compose aggregate operations or even just wait on any or all outstanding operations? No. In the original synchronous example, the read operation necessarily completed before the write operation began. So futures do not in fact get us very far. The problem is that the act of waiting on a future is still a synchronous operation and there’s no standard way to compose a chain of events. There’s also no way to create an aggregate of futures. You might want to wait for not one but any number of futures. You might need to wait for all futures or just the first one that’s ready.

Futures in the Future

The problem with futures and promises is that they don’t go far enough and arguably are completely flawed. Methods such as wait and get, both of which block until the result is ready, are antithetical to concurrency and asynchronous programming. Instead of get we need something such as try_get that will attempt to retrieve the result if it’s available, but return immediately, regardless:

int bytes;
if (f.try_get(bytes))
{
  printf("bytes %d\n", bytes);
}

Going further, futures should provide a continuation mechanism so we can simply associate a lambda expression with the completion of the asynchronous operation. This is when we start to see the composability of futures:

int main()
{
  uint8 buffer[1024];
  auto fr = storage_read(buffer, sizeof(buffer), 0);
  auto fw = fr.then([&]()
  {
    return storage_write(buffer, sizeof(buffer), 1024);
  });
  ...
}

The storage_read function returns the read future (fr), and a lambda expression is used to construct a continuation of this future using its then method, resulting in a write future (fw). Because futures are always returned, you might prefer a more implicit but equivalent style:

auto f = storage_read(buffer, sizeof(buffer), 0).then([&]()
{
  return storage_write(buffer, sizeof(buffer), 1024);
});

In this case there’s only a single explicit future representing the culmination of all the operations. This might be called sequential composition, but parallel AND and OR composition would also be essential for most nontrivial systems (think WaitForMultipleObjects). In this case we would need a pair of wait_any and wait_all variadic functions. Again, these would return futures, allowing us to provide a lambda expression as a continuation of the aggregate using the then method as before. It might also be useful to pass the completed future to the continuation in cases where the specific future that completed isn’t apparent.

For a more exhaustive look at the future of futures, including the essential topic of cancelation, please look at Artur Laksberg and Niklas Gustafsson’s paper, “A Standard Programmatic Interface for Asynchronous Operations,” at bit.ly/MEgzhn.

Stay tuned for the next installment, where I’ll dig deeper into the future of futures and show you an even more fluid approach to writing efficient and composable asynchronous systems.


Kenny Kerr is a software craftsman with a passion for native Windows development. Reach him at kennykerr.ca.

Thanks to the following technical expert for reviewing this article: Artur Laksberg