Masking bugs with dual processors

A while back, I had to track down a bizarre bug in a test build. A tester reported that on his machine, some code that ran to grab a decompressed video image took several seconds to run. On my machine, it was near instantaneous.

My machine was certainly faster than his, but it didn't seem like it should be that much faster. I turned on some tracing, ran it against the code on my machine, and looked to see what the code was really doing. Nothing looked out of the ordinary at first glance.

Then I collected the same traces on the test machine. The trace looked completely different -- it was decompressing hundreds of frames more than it was on my machine. Not only was his machine slower in general, it was also having to do far more work.

One interesting difference -- my machine was dual proc (and hyperthreaded), and his machine was single proc (with no hyperthreading). I used Task Manager to bind the test app to a single processor on my PC. Sure enough, now my machine took several seconds, and the trace looked a lot like that of the slower machine.

A quick review: most A/V engines (such as DirectShow) involve at least a couple of threads. Typically, the app calls into the A/V engine on one thread and issues commands (such as pause, stop, run, seek). The A/V engine will, on the calling thread, do some setup, and then kick off another thread to pump the media samples through the various components (parser, codec, renderer). That way, the app can get control back and go do something else (like process user input) while the A/V engine plays audio and video.

With DirectShow, a seek command first issues a flush command. The idea is that since any component in the graph can store samples, you want to clear all those out before you jump to the new time. So the app thread calls seek, which turns into a flush, which goes through the various filters in the graph.

The base renderer class responds to the Flush command by setting an event which unblocks the rendering thread. That way, if the renderer was blocked waiting for it to be time to render a sample, or because the graph was paused, it will unblock so that the streaming thread can be released. In this case, the graph was paused, so the renderer was blocking the streaming thread.

There were two bugs: the first was that one of the upstream filters was calling the downstream filter's input pin's BeginFlush() method, and after it returned, setting the event that would cause the renderer thread to stop pushing samples. The second was that a codec filter, between the first filter and the renderer, was returning S_OK instead of S_FALSE from its IMemInputPin::Receive call.

And here we get to a design decision that exacerbated the bugs: the app thread was running as low priority.

The BeginFlush comand flowed through the graph, on the low-priority app thread, until it got to the renderer, which set the event to unblock itself. The rendering thread was running as normal priority. The renderer dropped the sample, knowing it was in the middle of a flush, and returned back up the stack to the buggy filter -- which hadn't yet gotten the command to stop sending samples, and, since it got S_OK from the codec filter, didn't realize that there was a flush going on. So it sent another sample, which was decompressed and dropped by the renderer -- and so on.

The only thing that stopped it was the NT thread scheduler, which, if a thread has been blocked for three seconds, boosts its priority to prevent it from being completely starved. So the app thread eventually got boosted and then got to send the command to the renderer to stop, and everything finished.

Two observations from this: a few years ago, the problem with doing multi-threaded programming was that almost no one had dual proc machines, and there was a whole class of bugs that were masked by single processor machines. Now almost everyone in development has a dual proc (or at least a hyperthreaded -- kind of like proc and a half) box, so now you can get bugs masked by that.

The second is that, in this case, there was really no reason for the calling thread to be running as low priority. (Of course, if it hadn't, these two bugs probably would have been masked.) It was running as low priority to try to avoid blocking the UI and other tasks from the relatively unimportant task of grabbing a bitmap from a video file. But it didn't matter -- that low priority thread wasn't actually doing any of the heavy lifting. All of the video decompression and other hard work involved in processing the video was running on other threads, created by the various DirectShow filters, and those were all running at normal priority.

So, try code on both multi-proc and single-proc boxes, and verify that you're actually accomplishing what you think you are by setting thread priority.