Concurrent Affairs

Build a Richer Thread Synchronization Lock

Jeffrey Richter

Code download available at:ConcurrentAffairs0603.exe(115 KB)

Contents

What Is an Optex?
Synchronization Fairness
Avoid Convoys: Building a Better Optex
Critical Regions

In my last column, I showed the various thread synchronization mechanisms employed by the Microsoft® .NET Framework (see Concurrent Affairs: Performance-Conscious Thread Synchronization). I then examined the performance characteristics of all these mechanisms and determined that the Interlocked methods performed the best because the calling thread never has to transition to kernel mode. I then looked at how to build some simple locks using the Interlocked methods.

While the Interlocked methods are fast, they are, unfortunately, very limited—they can only manipulate 32-bit and 64-bit values atomically. Frequently, an object may have several fields in it that you need to modify as a single unit atomically. While you cannot use the Interlocked methods to modify multiple fields as a single atomic unit, you can leverage the Interlocked methods to build a more sophisticated thread synchronization mechanism. Here, I'll build a thread synchronization lock, which I call an Optex (Optimized mutual exclusive lock).

What Is an Optex?

An Optex works much like a Win32 CRITICAL_SECTION or the common language runtime (CLR) System.Threading.Monitor class. The lock grants one thread at a time access to the resource that is logically guarded by the Optex. However, unlike a CRITICAL_SECTION and the Monitor class, my Optex lock doesn't support recursion, so a thread cannot acquire the lock multiple times without releasing the lock. This lack of recursion, however, does allow one thread to acquire the lock and a different thread to release the lock—a useful feature in some scenarios.

As you'll see, my Optex lock executes really fast when there is no contention on the lock. In fact, when there is no contention, the thread attempting to acquire the lock will execute some quick field assignments and return; the thread will execute entirely in user mode. The thread will not transition into kernel mode and it will not wait on a kernel object. On the other hand, when there is contention for the lock, the lock becomes much slower because threads will transition to kernel mode and they will wait on a semaphore kernel object. Also, this lock will be pretty equitable, waking threads that wait on the lock using a FIFO algorithm. However, this fair and equitable approach is not guaranteed. Figure 1 shows the code for my Optex lock, and Figure 2 demonstrates how to define a class that uses an Optex object in order to allow only one thread at a time access to any instance of the SomeResource class.

Figure 2 Define Class That Uses Optex

public class SomeResource { private Optex m_optex = new Optex(); public void AccessResource() { try { m_optex.Enter(); // Access the resource... } finally { m_optex.Exit(); } } }

Figure 1 Optex Lock

public sealed class Optex : IDisposable { private Int32 m_Waiters = 0; private Semaphore m_WaiterLock = new Semaphore(0, Int32.MaxValue); public Optex() { } public void Dispose() { if (m_WaiterLock != null) { m_WaiterLock.Close(); m_WaiterLock = null; } } public void Enter() { Thread.BeginCriticalRegion(); // Add ourself to the set of threads interested in the Optex if (Interlocked.Increment(ref m_Waiters) == 1) { // If we were the first thread to show interest, we got it. return; } // Another thread has the Optex, we need to wait for it m_WaiterLock.WaitOne(); // When WaitOne returns, this thread now has the Optex } public void Exit() { // Subtract ourself from the set of threads interested in the Optex if (Interlocked.Decrement(ref m_Waiters) > 0) { // Other threads are waiting, wake 1 of them m_WaiterLock.Release(1); } Thread.EndCriticalRegion(); } }

As you can see, an Optex has just two data fields inside of it: m_Waiters and m_WaiterLock. The m_Waiters field is initialized to zero indicating that no threads own the lock.

When a thread comes along and calls Enter, Interlocked.Increment is used to add 1 to the m_Waiters field. If m_Waiters is incremented to 1, then it was 0 before. Therefore, the calling thread can own the lock and Enter returns immediately. In this case Enter executes very quickly. On the other hand, if m_Waiters is incremented to a value greater than 1, then other threads are trying to enter or have entered the Optex, and the Optex waits on the semaphore. In this case, Enter is slow because the thread transitions to kernel mode. But this isn't really a problem because the thread needs to wait anyway until the lock is available.

Optex's Exit method is fairly straightforward. First, a thread never sleeps or enters a wait state when releasing a lock. When Exit is called, it subtracts one from m_Waiters. If the result becomes zero, then no other threads are waiting on the Optex and Exit simply returns; here performance is very good. If, however, decrementing m_Waiters results in a value that is greater than zero, then at least one other thread is waiting for the lock via the semaphore. In this case, the Exit method releases one of these waiting threads by signaling the semaphore and then returns immediately. Here performance is good but not great because the thread that's calling Exit must transition from user mode to kernel mode and back to signal the semaphore.

The Optex class is very simple and yet quite effective. It would be easy to add additional features to Optex if you so desired. For example, you could add a TryEnter method which would allow a thread to do something else if the lock is not available. TryEnter could even accept a timeout value instead of just assuming zero. It would also be possible to modify the Enter method: if contention is detected, the thread could spin in user mode several times trying to get the lock before transitioning to kernel mode to wait on the semaphore. This could improve performance if a thread happens to call Enter at just the wrong time.

In addition, you could easily add recursion support so that a single thread could acquire the lock multiple times without deadlocking, making the lock behave exactly like a Win32 CRITICAL_SECTION or the CLR Monitor class. The possibilities are endless.

Synchronization Fairness

In Windows®, threads waiting on a kernel object (mutex, semaphore, event, file handle, and so forth) are, for the most part, processed fairly. This means that when the kernel object becomes signaled, Windows wakes up the waiting threads using a FIFO algorithm. This algorithm, by the way, does not take thread priority into account. So, if a lower-priority thread starts waiting on a mutex and then a higher-priority thread later waits on the same mutex, Windows will wake the lower-priority thread since it started waiting first.

I say that waiting is mostly fair because there are times when this fairness is violated. For example, when debugging a process, breakpoints cause all threads in the process to be suspended. If any of these threads happen to be waiting on a thread synchronization object, then Windows pulls these threads out of the FIFO queue so they are not waiting on the thread synchronization object anymore. When the execution resumes after the breakpoint, the threads are resumed and they race back to re-enter the FIFO queue. But, these threads enter the FIFO queue from the back like all other threads, so they lose their original position in line. Furthermore, if two threads were waiting and they both race back to the queue, they may get back into the queue in a different order than they were in before.

In Windows there are other ways that waiting threads can be prevented from being processed in a FIFO fashion. For example, waiting threads can be prematurely awakened due to asynchronous procedure calls (APCs), especially if the threads are in an alertable state. It turns out that the CLR makes fairness even less likely to occur. The culprit of this added unfairness is the garbage collector.

When the CLR wants to start a garbage collection, it will determine which threads are currently executing managed code and which are currently executing unmanaged code. After making this determination, the CLR will suspend the threads that are executing managed code. Threads that are currently executing unmanaged code will hijack themselves when they attempt to return back to managed code. It doesn't happen frequently, but every once in a rare while there is a small window of time in which a thread meets these conditions: it is in managed code, the CLR thinks it needs to suspend this thread, the thread calls into the unmanaged Win32 WaitForSingleObject or WaitForMultipleObjects functions, and the CLR suspends the thread while in one of these functions.

In addition, the CLR always puts waiting threads in an alertable state, which makes it even more likely that the threads can be awakened and pulled out of the queue. The CLR does this to handle APIs like Thread's Interrupt method. The point of all this is to say that locks in the CLR and in Windows are not fair. And, in fact, it is entirely possible (but very unlikely) that a thread waiting on a lock will not get it at all if there is a steady stream of threads that are attempting to enter the lock, especially if these threads are also allocating objects causing frequent garbage collections to occur.

Fortunately, most applications access locks infrequently so fairness is not an issue. In fact, most applications will run fine without fair thread synchronization, but at least you should be aware of this issue.

Avoid Convoys: Building a Better Optex

As it turns out, trying to have a lock treat its waiting threads fairly causes a problem known as the convoy problem. Simply stated, the convoy problem results when the lock remains in contention for a long time, causing waiting threads to constantly transition to kernel mode. Here the locks exhibits worse performance than if you had just used a kernel object (such as a mutex) in the first place. Since the Optex class shown earlier tries to treat its waiting treads fairly, it can cause the convoy problem. Now, let me explain the convoy problem in more detail.

You'll notice that when an Optex lock is under contention, the lock never becomes free. That is, the m_Waiters field never goes to zero (indicating that the lock is free) until there are no threads waiting on the lock. So, when a thread calls Enter, if the lock is already owned, the calling thread transitions to kernel mode to wait. When the thread that owns the lock calls Exit, it detects that another thread is waiting and it also transitions to kernel mode to set the semaphore. Unfortunately, the way that the Optex class is currently written, the thread leaving the lock actually owns the lock longer than necessary because it transitions to kernel mode to release the semaphore. And, releasing the semaphore causes Windows to do a thread reschedule in order to context-switch to another thread. All of this—holding the lock longer, rescheduling, context switching—ends up hurting performance significantly. It also means that the lock tends to remain under contention for a longer time and this means that threads calling Enter continuously call WaitOne, causing them to transition to kernel mode. Since the lock rarely becomes free, threads keep transitioning to kernel mode and the lock performs worse than if you used a kernel object to begin with.

Figure 3 shows a new version of the Optex class that fixes the convoy problem and improves performance. In this new version, the m_LockState field contains two pieces of information: whether the lock is free or owned (bit 0) and the number of threads waiting on the lock (bits 1 through 31). I also defined a few constants in the Optex class to help manipulate the m_LockState field. Naturally, the bits of this Int32 field will be manipulated using various interlocked methods. In fact, I needed to define a few helper methods (using the techniques discussed in my previous column) such as InterlockedAdd, InterlockedOr, and IfThen to make the code easier to write and more maintainable. You can see these methods implemented towards the bottom of the Optex class.

Figure 3 Optex.cs

using System; using System.Diagnostics; using System.Threading; namespace MsdnMag.Threading { public sealed class Optex : IDisposable { // Bit 0: 0=Lock is free, 1=Lock is owned // Bits 1-31: Number of waiters private Int32 m_LockState = c_lsFree; private const Int32 c_lsFree = 0x00000000; private const Int32 c_lsOwned = 0x00000001; private const Int32 c_1Waiter = 0x00000002; private Semaphore m_WaiterLock = new Semaphore(0, Int32.MaxValue); public Optex() { } public void Dispose() { if (m_WaiterLock != null) { m_WaiterLock.Close(); m_WaiterLock = null; } } public void Enter() { Thread.BeginCriticalRegion(); while (true) { // Turn on the "owned" bit Int32 ls = InterlockedOr(ref m_LockState, c_lsOwned); // If lock was free, this thread got it, return if ((ls & c_lsOwned) == c_lsFree) return; // Another thread owned the lock, add 1 waiter if (IfThen(ref m_LockState, ls, ls + c_1Waiter)) { // If successfully added 1, wait for lock m_WaiterLock.WaitOne(); } // We weren't able to add 1 waiter or waiter woke, // attempt to get the lock } } public void Exit() { // Pre-condition: Lock's state must be Owned // Post-condition: Lock's state must become Free // (the lock is never passed) // Phase 1: Free the lock Int32 ls = InterlockedAnd(ref m_LockState, ~c_lsOwned); if (ls == c_lsOwned) { // If no waiters, nothing to do, we can just return } else { // Phase 2: Possibly wake waiters // If lock is free, try to subtract 1 from the // number of waiters ls &= ~c_lsOwned; if (IfThen(ref m_LockState, ls & ~c_lsOwned, ls - c_1Waiter)) { // We sucessfully subtracted 1, wake 1 waiter m_WaiterLock.Release(1); } else { // Lock's state changed by other thread, // other thread will deal with it } } Thread.EndCriticalRegion(); } #region Interlocked Helpers private static Int32 InterlockedAnd(ref Int32 target, Int32 with) { Int32 i, j = target; do { i = j; j = Interlocked.CompareExchange(ref target, i & with, i); } while (i != j); return j; } private static Int32 InterlockedOr(ref Int32 target, Int32 with) { Int32 i, j = target; do { i = j; j = Interlocked.CompareExchange(ref target, i | with, i); } while (i != j); return j; } private static Boolean IfThen(ref Int32 val, Int32 @if, Int32 then) { return (Interlocked.CompareExchange(ref val, @then, @if) == @if); } private static Boolean IfThen(ref Int32 val, Int32 @if, Int32 then, out Int32 prevVal) { prevVal = Interlocked.CompareExchange(ref val, @then, @if); return (prevVal == @if); } #endregion } }

This version of the Enter method contains a while loop that continues until the calling thread acquires the lock. Inside the loop, InterlockedOr is called to make sure the lock is taken by any thread. If InterlockedOr returns c_lsFree (0), then the lock was previously unowned, the calling thread acquired it, and it can return from Enter immediately; the result is very good performance.

If InterlockedOr doesn't return c_lsFree, then another thread owns the lock and the calling thread must increment the count of waiting threads and then wait on the semaphore. When this thread wakes, it will loop around and attempt to acquire the lock just like any other thread. This is a bit inefficient and has two potential problems, which I'll address shortly. For now, let's look at the new version of the Exit method.

To avoid the convoy problem, you'll want the Exit method to release the lock entirely in user mode. Then, if another thread comes along, it can acquire the lock without contention and without transitioning to kernel mode, thereby executing with great performance. However, another thread could have tried to enter the lock earlier, had contention, and gone into kernel mode to wait until the lock was released. This thread does need to be awakened but I'm going to do this differently than in the previous version of the Optex class.

When Exit is called you know that the lock must be owned (m_LockState must have the c_lsOwned bit set) and you want the lock to be free before the method returns. The first thing Exit does is call InterlockedAnd to turn off bit 0 making the lock free. If InterlockedAnd returns c_lsOwned, then this thread owned the lock and no other threads are waiting on the lock. We expect this no-contention situation to be very common, and in this case Exit simply returns; again, performance is very good.

If InterlockedAnd doesn't return c_lsOwned, it means that some other bits are turned on—indicating that 1 or more threads are waiting on the lock. Now, if the lock is still free, I will try to decrement the count of waiting threads by 1 (by calling my IfThen method subtracting c_1Waiter from the count). If Exit successfully subtracts one waiter, then I'll wake up one waiting thread by releasing the semaphore. If the call to IfThen fails to subtract 1 from the count of waiting threads, then another thread must have snuck in and taken the lock; therefore, I will not wake the waiting thread. The thread that snuck in will wake the thread when it calls Exit to leave the lock.

Now, let me address the two problems I spoke of earlier. The first problem is that this lock is unfair because it is possible that a waiting thread got awakened, but the lock was given to another thread that snuck in by calling Enter. However, I discussed fairness in the previous section and in most situations this unfairness is not an issue worth worrying about. The second problem is that Exit woke up a waiting thread which looped around and then simply went back to waiting on the lock again. This, of course, hurts performance. However, it's all a numbers game. For most applications, the performance of this technique—which eliminates the convoy problem—outperforms the other technique of passing the lock from thread to thread. It is unlikely that the awakened thread will miss the lock and have to go back to wait on it again; usually, the awakened thread will get the lock, improving performance. The big performance win, of course, is that most threads that call Enter will not see contention on the lock and they will not transition to kernel mode.

In all my testing, this new version of the Optex class definitely outperforms the previous version and I would never use the first version in any production code. By the way, the Monitor class's Enter and Exit methods are implemented to be unfair in order to avoid the convoy problem. The result is improved performance, so feel free to use the Monitor Enter/Exit methods in your own code.

Critical Regions

You'll notice that in my SpinWaitLock code in my last column and in the two versions of the Optex code here, I have all the Enter methods call Thread's BeginCriticalRegion method and I have Exit call Thread's EndCriticalRegion method. Let me explain what these calls are about.

The Microsoft SQL Server™ team really needed the CLR to be much more robust and reliable than versions 1.0 or 1.1. Towards this end, version 2.0 now supports critical regions. A critical region is a block of code that might be dependent on state shared between multiple threads. Certainly, the fields of a thread synchronization lock object qualify as shared state. The resource that is being protected by the lock itself would be considered shared state as well.

The CLR would like to know when a thread has entered into a critical region in case an unhandled exception is thrown from within the region. You see, if an unhandled exception occurs, then it is likely that state shared between threads is corrupt, and allowing other threads in the AppDomain to run and access the shared state is just asking for trouble. Usually a host process (like SQL Server) will want to know that an unhandled exception occurred in a critical region and the host will unload the entire AppDomain.

Code can notify the CLR that a critical section is starting by calling Thread's BeginCriticalRegion method. Internally, all this method does is increment a counter associated with the calling thread by 1. Code notifies the CLR that it is leaving a critical region by calling Thread's EndCriticalRegion method, which internally decrements the thread's counter by 1. A host application (such as SQL Server) can determine when a thread executing in an AppDomain has experienced an unhandled exception. The host can then determine if the thread was in the process of accessing shared state (inside a critical region) and the host can decide how to handle this situation. For example, the host could unload the AppDomain (along with all its state) or the host could shut down the CLR entirely within the host's process.

Microsoft has updated the Monitor and Mutex classes to call BeginCriticalRegion and EndCriticalRegion appropriately. However, if you are building your own locks as I've done in this column, then you must call the methods manually. Or, if you're accessing shared state without locks, then you should be calling these methods manually as well in order to properly support hosts such as SQL Server.

Send your questions and comments for Jeffrey to  mmsync@microsoft.com.

Jeffrey Richter is a cofounder of Wintellect (www.Wintellect.com), a training and consulting firm. He is the author of several books, including Applied Microsoft .NET Framework Programming (Microsoft Press, 2002). Jeffrey is also a contributing editor to MSDN Magazine and has been consulting with Microsoft since 1990.