The CHESS scheduler – the key to finding concurrency bugs
Hey everyone, I am Madan Musuvathi, the architect of CHESS. In this blog, I will describe the CHESS scheduler, a key component of CHESS. Before I start, if you haven’t seen Tom’s previous post about the CHESS Devlabs release, you should. Download CHESS, play around, and let us know what you think.
When CHESS attaches to your program, it takes complete control over the scheduling of threads in your program. You are no longer at the mercy of the Windows scheduler, which can nondeterministically preempt your threads whenever it wants. Instead, CHESS, and indirectly you, the user, decide how the threads interleave during execution. This allows CHESS to systematically drive your program along possible interleavings, hunting for bugs. Also, once CHESS finds a bug, the repro is as easy as telling the CHESS scheduler to execute the interleaving that produced that bug.
To explain how the CHESS scheduler controls the interleaving, lets look at a simple program. (This is our favorite example, you will see it again and again.)
This example involves two threads, a deposit thread that adds $100 to the balance, and a withdraw thread that deducts $100 from the balance. If the balance starts out as $100, we expect the final value to be $100. Both threads protect their accesses to the shared variable balance with a critical section. But, the withdraw thread does not hold the critical section lock for its entire computation, resulting in an ‘atomicity’ violation. The assert at the end fires if the deposit thread executes after the withdraw thread leaves the critical section but before it enters the critical section for the second time. This interleaving is shown below:
CHESS can find this bug by systematically exploring all interleavings.
First, CHESS instruments the input program to make calls to the CHESS scheduler. We use various techniques to perform this instrumentation (which will not be described this blog post). This is how the code above will look after the instrumentation.
By default, CHESS instruments a call to the CHESS scheduler before every synchronization operation. In mchess, you can use the /volatile option to instrument a call before volatile accesses as well.
What does the scheduler to do at ChessSchedule() calls? The goal is to insert delays to force different interleavings. To see this, let us look at a “first-cut” solution, where the scheduler inserts random delays, as shown below:
Now, depending on how the random numbers are seeded, we are likely to get different interleavings of the two threads on each run. Convince yourselves that for particular random sleep sequences, one can emulate the interleaving that results in the bug shown above. But, there are still many problems with this solution. For instance, many (in fact, infinitely many) random sequences will result in the same interleaving, causing a wastage of test resources. Also, starting from the same random seed, you are still not guaranteed to see the same interleaving --- the Windows scheduler still decides how long a thread sleeps and when to run threads. So, repro is still a problem.
CHESS improves upon this “first-cut” solution by selectively blocking and enabling threads, in such a way that the Windows scheduler has absolutely no control. The CHESS scheduler creates a semaphore for each thread. At each call to ChessSchedule(), the calling thread queries the scheduler to see if the thread should block, and if so, the it blocks on its semaphore. Simultaneously, the scheduler releases the semaphore of the thread that needs to run next. By doing so, the scheduler is able to orchestrate an interleaving of its choice. For instance, the buggy interleaving can be achieved as follows:
Initially, the semaphore for the two threads “withSem” and “depSem” are not available. At the beginning of the program, the scheduler decides that it wants the withdraw thread to run by releasing withSem. Note, the deposit thread will block on depSem while the withdraw thread makes progress. Also, as the deposit thread is blocked, the Windows scheduler has no choice but to schedule the withdraw thread. After the withdraw thread leaves the critical section, it calls into the CHESS scheduler again (at the instrumentation before the next EnterCriticalSection). At this point, the scheduler decides to block this thread and enables the deposit thread by releasing depSem. Now, the withdraw thread blocks on its semaphore while the deposit thread makes progress. Again, as the Windows scheduler sees that the only enabled thread is the deposit thread, it has no choice but to schedule that thread.
By picking different choices at the instrumentation points, the scheduler is able to force different interleavings. Another interleaving example is shown below:
Repro is now simple. Before an execution, the scheduler dumps to disk the choices it is going to make at each ChessSchedule() call. For repro, the scheduler simply reads these choices from the disk.
There are still some intricacies involved: as the scheduler inserts artificial synchronizations with these semaphores, these can conflict with other synchronizations in the program resulting in a deadlock. Imagine, a thread holds a lock and then blocks on the semaphore that CHESS has created for that thread. Now, the CHESS scheduler schedules another thread that simply goes and blocks on the lock, we have a deadlock that CHESS introduced. While we try hard to find your bugs, we don’t want you to be chasing false bugs that we generated. The CHESS scheduler is smart enough and such deadlocks don’t happen. If you are a concurrency wizard, you probably know how hard this can be, especially when programs can use a variety of synchronizations, including spinlocks. But, as we have solved that problem, you can sleep tight and be a blissful user of CHESS :)