November 2012

Volume 27 Number 11

Windows with C++ - The Evolution of Synchronization in Windows and C++

By Kenny Kerr | November 2012

Kenny KerrWhen I first started writing concurrent software, C++ had no support for synchronization. Windows itself had only a handful of synchronization primitives, all of which were implemented in the kernel. I tended to use a critical section unless I needed to synchronize across processes, in which case I used a mutex. In general terms, both of these are locks, or lock objects.

The mutex takes its name from the concept of “mutual exclusion,” another name for synchronization. It refers to the guarantee that only one thread can access some resource at a time. The critical section takes its name from the actual section of code that might be accessing such a resource. To ensure correctness, only one thread can execute this critical section of code at a time. These two lock objects have different capabilities, but it’s helpful to remember that they’re just locks, they both provide mutual exclusion guarantees and both can be used to demarcate critical sections of code.

Today the synchronization landscape has changed dramatically. There are a plethora of choices for the C++ programmer. Windows now supports many more synchronization functions, and C++ itself at long last provides an interesting collection of concurrency and synchronization capabilities for those using a compiler supporting the C++11 standard.

In this month’s column I’m going to explore the state of synchronization in Windows and C++. I’ll start with a review of the synchronization primitives provided by Windows itself and then consider the alternatives provided by the Standard C++ Library. If portability is your main concern, then the new C++ library additions will be very appealing. If, however, portability is less of a concern and performance is paramount, then getting familiar with what Windows now offers will be important. Let’s dive right in.

Critical Section

First up is the critical section object. This lock is used heavily by countless applications but has a sordid history. When I first started using critical sections, they were really simple. To create such a lock, all you needed was to allocate a CRITICAL_SECTION structure and call the InitializeCriticalSection function to prepare it for use. This function doesn’t return a value, implying that it can’t fail. Back in those days, however, it was necessary for this function to create various system resources, notably a kernel event object, and it was possible that in extremely low-memory situations this would fail, resulting in a structured exception being raised. Still, this was rather rare, so most developers ignored this possibility.

With the popularity of COM, the use of critical sections skyrocketed because many COM classes used critical sections for synchronization, but in many cases there was little to no actual contention to speak of. When multiprocessor computers became more widespread, the critical section’s internal event saw even less usage because the critical section would briefly spin in user mode while waiting to acquire the lock. A small spin count meant that many short-lived periods of contention could avoid a kernel transition, greatly improving performance.

Around this time some kernel developers realized that they could dramatically improve the scalability of Windows if they deferred the creation of critical section event objects until there was enough contention to necessitate their presence. This seemed like a good idea until the developers realized this meant that although InitializeCriticalSection could now not possibly fail, the EnterCriticalSection function (used to wait for lock ownership) was no longer reliable. This could not as easily be overlooked by developers, because it introduced a variety of failure conditions that would’ve made critical sections all but impossible to use correctly and break countless applications. Still, the scalability wins couldn’t be overlooked.

A kernel developer finally arrived at a solution in the form of a new, and undocumented, kernel event object called a keyed event. You can read a little about it in the book, “Windows Internals,” by Mark E. Russinovich, David A. Solomon and Alex Ionescu (Microsoft Press, 2012), but basically, instead of requiring an event object for every critical section, a single keyed event could be used for all critical sections in the system. This works because a keyed event object is just that: It relies on a key, which is just a pointer-sized identifier that’s naturally address-space local.

There was surely a temptation to update critical sections to use keyed events exclusively, but because many debuggers and other tools rely on the internals of critical sections, the keyed event was only used as a last resort if the kernel failed to allocate a regular event object.

This may sound like a lot of irrelevant history but for the fact that the performance of keyed events was significantly improved during the Windows Vista development cycle, and this lead to the introduction of a completely new lock object that was both simpler and faster—but more on that in a minute.

As the critical section object is now exempt from failure due to low-memory conditions, it really is very straightforward to use. Figure 1 provides a simple wrapper.

Figure 1 The Critical Section Lock

class lock
{
  CRITICAL_SECTION h;
  lock(lock const &);
  lock const & operator=(lock const &);
public:
  lock()
  {
    InitializeCriticalSection(&h);
  }
  ~lock()
  {
    DeleteCriticalSection(&h);
  }
  void enter()
  {
    EnterCriticalSection(&h);
  }
  bool try_enter()
  {
    return 0 != TryEnterCriticalSection(&h);
  }
  void exit()
  {
    LeaveCriticalSection(&h);
  }
  CRITICAL_SECTION * handle()
  {
    return &h;
  }
};

The EnterCriticalSection function I already mentioned is complemented by a TryEnterCriticalSection function that provides a nonblocking alternative. The LeaveCriticalSection function releases the lock, and DeleteCriticalSection releases any kernel resources that might have been allocated along the way.

So the critical section is a reasonable choice. It performs quite well as it attempts to avoid kernel transitions and resource allocation. Still, it has a bit of baggage that it must carry due to its history and application compatibility.

Mutex

The mutex object is a true kernel-synchronization object. Unlike critical sections, a mutex lock always consumes kernel-allocated resources. The benefit, of course, is that the kernel is then able to provide cross-process synchronization due to its awareness of the lock. As a kernel object, it provides the usual attributes—such as a name—that can be used to open the object from other processes or just to identify the lock in a debugger. You can also specify an access mask to restrict access to the object. As an intraprocess lock, it’s overkill, a little more complicated to use and a lot slower. Figure 2 provides a simple wrapper for an unnamed mutex that’s effectively process-local.

Figure 2 The Mutex Lock

#ifdef _DEBUG
  #include <crtdbg.h>
  #define ASSERT(expression) _ASSERTE(expression)
  #define VERIFY(expression) ASSERT(expression)
  #define VERIFY_(expected, expression) ASSERT(expected == expression)
#else
  #define ASSERT(expression) ((void)0)
  #define VERIFY(expression) (expression)
  #define VERIFY_(expected, expression) (expression)
#endif
class lock
{
  HANDLE h;
  lock(lock const &);
  lock const & operator=(lock const &);
public:
  lock() :
    h(CreateMutex(nullptr, false, nullptr))
  {
    ASSERT(h);
  }
  ~lock()
  {
    VERIFY(CloseHandle(h));
  }
  void enter()
  {
    VERIFY_(WAIT_OBJECT_0, WaitForSingleObject(h, INFINITE));
  }
  bool try_enter()
  {
    return WAIT_OBJECT_0 == WaitForSingleObject(h, 0);
   }
  void exit()
  {
    VERIFY(ReleaseMutex(h));
  }
  HANDLE handle()
  {
    return h;
  }
};

The CreateMutex function creates the lock, and the common CloseHandle function closes the process handle, which effectively decrements the lock’s reference count in the kernel. Waiting for lock ownership is accomplished with the general-purpose WaitForSingleObject function, which checks and optionally waits for the signaled state of a variety of kernel objects. Its second parameter indicates how long the calling thread should block while waiting to acquire the lock. The INFINITE constant is—not surprisingly—an indefinite wait, while a value of zero prevents the thread from waiting at all and will only acquire the lock if it’s free. Finally, the ReleaseMutex function releases the lock.

The mutex lock is a big hammer with a lot of power, but it comes at a cost to performance and complexity. The wrapper in Figure 2 is littered with assertions to indicate the possible ways that it can fail, but it’s the performance implications that disqualify the mutex lock in most cases.

Event

Before I talk about a high-performance lock, I need to introduce one more kernel-synchronization object, one to which I alluded already. Although not actually a lock, in that it doesn’t provide a facility to directly implement mutual exclusion, the event object is vitally important for the coordination of work among threads. In fact, it’s the same object used internally by the critical section lock—and besides, it comes in handy when implementing all kinds of concurrency patterns in an efficient and scalable way.

The CreateEvent function creates the event and—like the mutex—the CloseHandle function closes the handle, releasing the object in the kernel. Because it isn’t actually a lock, it has no acquire/release semantics. Rather, it’s the very embodiment of the signaling functionality provided by many of the kernel objects. To understand how the signaling works, you need to appreciate that an event object can be created in one of two states. If you pass TRUE to CreateEvent’s second parameter, then the resulting event object is said to be a manual-reset event; otherwise an auto-reset event is created. A manual-reset event requires that you manually set and reset the object’s signaled state. The SetEvent and ResetEvent functions are provided for this purpose. An auto-reset event auto­matically resets (changes from signaled to nonsignaled) when a waiting thread is released. So an auto-reset event is useful when one thread needs to coordinate with one other thread, whereas a manual-reset event is useful when one thread needs to coordinate with any number of threads. Calling SetEvent on an auto-reset event will release at most one thread, whereas with a manual-reset event, that call will release all waiting threads. Like the mutex, waiting for an event to become signaled is achieved with the WaitForSingleObject function. Figure 3 provides a simple wrapper for an unnamed event that can be constructed in either mode.

Figure 3 The Event Signal

class event
{
  HANDLE h;
  event(event const &);
  event const & operator=(event const &);
public:
  explicit event(bool manual = false) :
    h(CreateEvent(nullptr, manual, false, nullptr))
  {
    ASSERT(h);
  }
  ~event()
  {
    VERIFY(CloseHandle(h));
  }
  void set()
  {
    VERIFY(SetEvent(h));
  }
  void clear()
  {
    VERIFY(ResetEvent(h));
  }
  void wait()
  {
    VERIFY_(WAIT_OBJECT_0, WaitForSingleObject(h, INFINITE));
  }
};

Slim Reader/Writer Lock

The name of the Slim Reader/Writer (SRW) lock might be a mouthful, but the operative word is “slim.” Programmers might overlook this lock because of its ability to distinguish between shared readers and exclusive writers, perhaps thinking that this is overkill when all they need is a critical section. As it turns out, this is the simplest lock to deal with and also by far the fastest, and you certainly don’t need to have shared readers in order to use it. It has this speedy reputation not only because it relies on the efficient keyed event object, but also because it’s mostly implemented in user mode and only falls back to the kernel if contention is such that the thread would be better off sleeping. Again, the critical section and mutex objects provide additional features you might require, such as recursive or inter-process locks, but more often than not, all you need is a fast and lightweight lock for internal use.

This lock relies exclusively on the keyed events I mentioned before, and as such it’s extremely lightweight despite providing a great deal of functionality. The SRW lock requires only a pointer-­sized amount of storage, which is allocated by the calling process rather than the kernel. For this reason, the initialization function, InitializeSRWLock, can’t fail and merely ensures the lock contains the appropriate bit pattern before being used.

Waiting for lock ownership is achieved using either the Acquire­SRWLockExclusive function for a so-called writer lock or using the AcquireSRWLockShared function for reader locks. However, the exclusive and shared terminology is more appropriate. There are corresponding release and try-acquire functions as one would expect for both exclusive and shared modes. Figure 4 provides a simple wrapper for an exclusive-mode SRW lock. It wouldn’t be hard for you to add the shared-mode functions if needed. Notice, however, that there’s no destructor because there are no resources to free.

Figure 4 The SRW Lock

class lock
{
  SRWLOCK h;
  lock(lock const &);
  lock const & operator=(lock const &);
public:
  lock()
  {
    InitializeSRWLock(&h);
  }
  void enter()
  {
    AcquireSRWLockExclusive(&h);
  }
  bool try_enter()
  {
    return 0 != TryAcquireSRWLockExclusive(&h);
  }
  void exit()
  {
    ReleaseSRWLockExclusive(&h);
  }
  SRWLOCK * handle()
  {
    return &h;
  }
};

Condition Variable

The final synchronization object I need to introduce is the condition variable. This is perhaps the one with which most programmers will be unfamiliar. I have, however, noticed a renewed interest in condition variables in recent months. This might have something to do with C++11, but the idea isn’t new and support for this concept has been around for some time on Windows. Indeed, the Microsoft .NET Framework has supported the condition variable pattern since the very first release, although it was merged into the Monitor class, limiting its usefulness in some ways. But this renewed interest is also thanks to the amazing keyed events that allowed condition variables to be introduced by Windows Vista, and they have only improved since. Although a condition variable is merely a concurrency pattern and, thus, can be implemented with other primitives, its inclusion in the OS means that it can achieve amazing performance and frees the programmer from having to ensure the correctness of such code. Indeed, if you’re employing OS synchronization primitives, it’s almost impossible to ensure the correctness of some concurrency patterns without the help of the OS itself.

The condition variable pattern is quite common if you think about it. A program needs to wait for some condition to be met before it can proceed. Evaluating this condition involves acquiring a lock to evaluate some shared state. If, however, the condition hasn’t yet been met, then the lock must be released to allow some other thread to fulfill the condition. The evaluating thread must then wait until such time as the condition is met before once again acquiring the lock. Once the lock is reacquired, the condition must again be evaluated to avoid the obvious race condition. Implementing this is harder than it seems because there are, in fact, other pitfalls to worry about—and implementing it in an efficient way is harder still. The following pseudocode illustrates the problem:

lock-enter
while (!condition-eval)
{
  lock-exit
  condition-wait
  lock-enter
}
// Do interesting stuff here
lock-exit

But even in this illustration there lies a subtle bug. In order to function correctly, the condition must be waited upon before the lock is exited, but doing so wouldn’t work because the lock would then never be released. The ability to atomically release one object and wait on another is so critical that Windows provides the SignalObjectAndWait function to do so for certain kernel objects. But because the SRW lock lives mostly in user mode, a different solution is needed. Enter condition variables.

Like the SRW lock, the condition variable occupies only a single pointer-sized amount of storage and is initialized with the fail-safe InitializeConditionVariable function. As with SRW locks, there are no resources to free, so when the condition variable is no longer required the memory can simply be reclaimed.

Because the condition itself is program-specific, it’s left to the caller to write the pattern as a while loop with the body being a single call to the SleepConditionVariableSRW function. This function atomically releases the SRW lock while waiting to be woken up once the condition is met. There’s also a corresponding SleepConditionVariableCS function if you wish to use condition variables with a critical section lock instead.

The WakeConditionVariable function is called to wake a single waiting, or sleeping, thread. The woken thread will reacquire the lock before returning. Alternatively, the WakeAllConditionVariable function can be used to wake all waiting threads. Figure 5provides a simple wrapper with the necessary while loop. Note that it’s possible for the sleeping thread to be woken unpredictably, and the while loop ensures that the condition is always rechecked after the thread reacquires the lock. It’s also important to note that the predicate is always evaluated while holding the lock.

Figure 5 The Condition Variable

class condition_variable
{
  CONDITION_VARIABLE h;
  condition_variable(condition_variable const &);
  condition_variable const & operator=(condition_variable const &);
public:
  condition_variable()
  {
    InitializeConditionVariable(&h);
  }
  template <typename T>
  void wait_while(lock & x, T predicate)
  {
    while (predicate())
    {
      VERIFY(SleepConditionVariableSRW(&h, x.handle(), INFINITE, 0));
    }
  }
  void wake_one()
  {
    WakeConditionVariable(&h);
  }
  void wake_all()
  {
    WakeAllConditionVariable(&h);
  }
};

Blocking Queue

To give this some shape, I’ll use a blocking queue as an example. Let me stress that I do not recommend blocking queues in general. You may be better served using an I/O completion port or the Windows thread pool, which is just an abstraction over the former, or even the Concurrency Runtime’s concurrent_queue class. Anything nonblocking is generally preferred. Still, a blocking queue is a simple concept to grasp and something many developers seem to find useful. Admittedly, not every program needs to scale, but every program needs to be correct. A blocking queue also provides ample opportunity to employ synchronization for correctness—and, of course, ample opportunity to get it wrong.

Consider implementing a blocking queue with just a lock and an event. The lock protects the shared queue and the event signals to a consumer that a producer has pushed something onto the queue. Figure 6 provides a simple example using an auto-reset event. I used this event mode because the push method queues only a single element and, thus, I only want one consumer to be woken to pop it off the queue. The push method acquires the lock, queues the element and then signals the event to wake up any waiting consumer. The pop method acquires the lock and then waits until the queue isn’t empty before dequeuing an element and returning it. Both methods use a lock_block class. For brevity it hasn’t been included, but it simply calls the lock’s enter method in its constructor and the exit method in its destructor.

Figure 6 Auto-Reset Blocking Queue

template <typename T>
class blocking_queue
{
  std::deque<T> q;
  lock x;
  event e;
  blocking_queue(blocking_queue const &);
  blocking_queue const & operator=(blocking_queue const &);
public:
  blocking_queue()
  {
  }
  void push(T const & value)
  {
    lock_block block(x);
    q.push_back(value);
    e.set();
  }
  T pop()
  {
    lock_block block(x);
    while (q.empty())
    {
      x.exit(); e.wait(); // Bug!
      x.enter();
    }
    T v = q.front();
    q.pop_front();
    return v;
  }
};

However, notice the likely deadlock because the exit and wait calls aren’t atomic. If the lock were a mutex, I could use the SignalObjectAndWait function, but the performance of the blocking queue would suffer.

Another option is to use a manual-reset event. Rather than signaling whenever an element is queued, simply define two states. The event can be signaled for as long as there are elements in the queue and nonsignaled when it’s empty. This will also perform a lot better because there are fewer calls into the kernel to signal the event. Figure 7 provides an example of this. Notice how the push method sets the event if the queue has one element. This avoids unnecessary calls to the SetEvent function. The pop method dutifully clears the event if it finds the queue empty. As long as there are multiple queued elements, any number of consumers can pop elements off the queue without involving the event object, thus improving scalability.

Figure 7 Manual-Reset Blocking Queue

template <typename T>
class blocking_queue
{
  std::deque<T> q;
  lock x;
  event e;
  blocking_queue(blocking_queue const &);
  blocking_queue const & operator=(blocking_queue const &);
public:
  blocking_queue() :
    e(true) // manual
  {
  }
  void push(T const & value)
  {
    lock_block block(x);
    q.push_back(value);
    if (1 == q.size())
    {
      e.set();
    }
  }
  T pop()
  {
    lock_block block(x);
    while (q.empty())
    {
      x.exit();
      e.wait();
      x.enter();
    }
    T v = q.front();
    q.pop_front();
    if (q.empty())
    {
      e.clear();
    }
    return v;
  }
};

In this case there’s no potential deadlock in the exit-wait-enter sequence because another consumer can’t steal the event, given that it’s a manual-reset event. It’s hard to beat that in terms of performance. Nevertheless, an alternative (and perhaps more natural) solution is to use a condition variable instead of an event. This is easily done with the condition_variable class in Figure 5 and is similar to the manual-reset blocking queue, although it’s a bit simpler. Figure 8 provides an example. Notice how the semantics and concurrency intentions become clearer as the higher-level synchronization objects are employed. This clarity helps to avoid concurrency bugs that often plague more-obscure code.

Figure 8 Condition Variable Blocking Queue

template <typename T>
class blocking_queue
{
  std::deque<T> q;
  lock x;
  condition_variable cv;
  blocking_queue(blocking_queue const &);
  blocking_queue const & operator=(blocking_queue const &);
public:
  blocking_queue()
  {
  }
  void push(T const & value)
  {
    lock_block block(x);
    q.push_back(value);
    cv.wake_one();
  }
  T pop()
  {
    lock_block block(x);
    cv.wait_while(x, [&]()
    {
      return q.empty();
    });
    T v = q.front();
    q.pop_front();
    return v;
  }
};

Finally, I should mention that C++11 now provides a lock, called a mutex, as well as a condition_variable. The C++11 mutex has nothing to do with the Windows mutex. Likewise, the C++11 condition_variable isn’t based on the Windows condition variable. This is good news in terms of portability. It can be used anywhere you can find a conforming C++ compiler. On the other hand, the C++11 implementation in the Visual C++ 2012 release performs quite poorly compared to the Windows SRW lock and condition variable. Figure 9 provides an example of a blocking queue implemented with the Standard C++11 Library types.

Figure 9 C++11 Blocking Queue

template <typename T>
class blocking_queue
{
  std::deque<T> q;
  std::mutex x;
  std::condition_variable cv;
  blocking_queue(blocking_queue const &);
  blocking_queue const & operator=(blocking_queue const &);
public:
  blocking_queue()
  {
  }
  void push(T const & value)
  {
    std::lock_guard<std::mutex> lock(x);
    q.push_back(value);
    cv.notify_one();
  }
  T pop()
  {
    std::unique_lock<std::mutex> lock(x);
    cv.wait(lock, [&]()
    {
      return !q.empty();
    });
    T v = q.front();
    q.pop_front();
    return v;
  }
};

The Standard C++ Library implementation will undoubtedly improve in time, as will the library’s support for concurrency in general. The C++ committee has taken some small, conservative steps toward concurrency support that should be recognized, but the work is not yet done. As I discussed in my last three columns, the future of C++ concurrency is still in question. For now, the combination of some excellent synchronization primitives in Windows and the state-of-the-art C++ compiler makes for a compelling toolkit for producing lightweight and scalable concurrency-safe programs.


Kenny Kerr is a software craftsman with a passion for native Windows development. Reach him at kennykerr.ca.

Thanks to the following technical expert for reviewing this article: Mohamed Ameen Ibrahim