Where did all the CPU go?

I was doing some debugging of a busy hang which didn't fit the usual pattern today and I thought that it was interesting enough to share with you.

The customer was reporting that their process was hanging with 100% CPU. Ok, that sounds like a busy hang.

So, I got my three dumps at one minute intervals as I usually do. I loaded them up into WinDbg as usual and gave the !runaway command. I got a list of runtimes for each of the threads. There were about 5 threads in the 2-3 minute range and a LOT of threads with practically no runtime. I checked the runtime of the process and it had been up for about 3 hours. 5 threads at 2:30 each is about 12 minutes so it hasn’t been pegged at high CPU for long. I checked with the customer and they told me that the system had been hung for around 40 minutes.


Those sums didn’t seem to gel and busy hangs normally have a single thread which is eating the CPU though there can be more if spinlocks are involved. I dumped out the stacks for all the threads and there were the usual worker threads doing normal worker threads and a lot of threads in a very odd state. When I say “a lot”, I mean getting on for 1000 threads. So, it looked like the load was not confined to a single thread and something was eating CPU cycles.


I should have worked it out a bit quicker but I hadn’t come across one of these for quite a while.


Thread switches are not free. Generally, the overhead is not that bad but when there are a lot of threads it can swamp the real work that the app is doing. How many threads should you have? It depends on your application. If you have a lot of blocking calls such as calls to a DB then you probably care more about scalability than performance. In that case, you want quite a lot of threads because the thread switch overhead is swamped by the blocking calls. It will still eat CPU but not too bad. If you are writing a highly performant app that rarely blocks then the rules are different. You want not many more than 1 thread per CPU or virtual CPU in the case of hyperthreading. Extra threads will block and just cost you cycles.


In this case, the process was doing lots of thread context switches and that was killing performance. The threads were in short wait states so they were constantly waking and then blocking again. The real worker threads were being starved so no real work was happening.


Oh, the bug? A thread leak that we fixed 2 years back. It was turned off by default because the condition is rare (requires a specific condition in specific version of a third party database) and there are advantages to having it disabled under all other circumstances. Turned it on, all was well.


Signing out