If you can’t stand the heat…
You should get out of the PC kitchen. This is another silent system killer that most people don’t want to acknowledge. (Though I will admit it’s gotten easier the last 2-3 years, as Intel, AMD, nVidia, and ATI have cranked up the wattage to the point where even the most stubborn have to recognize heat as a design issue.) While not often a problem for a brand-name computers (which are built with high tolerances and an eye towards good heat dissipation characteristics), it can kill a homemade desktop or server. I won't go into solutions, just explain why it's hard to work on these when you're on the other end of a phone line or e-mail thread.
This is another one that can be a nightmare to work on, at least from the perspective of someone troubleshooting the operating system. The way it manifests itself is very similar to random memory problems: Blue screens and access violations with no discernable pattern. The way I usually go at it? Open the case up and stick a big ol’ box fan pointing into the case. Low tech? Sure. Effective? Heck yeah.
One of the axioms we live by is that a software problem should be consistently reproducible. Sometimes figuring out the parameters for reproducing a problem can be tricky, but if we see a closely related set of behaviors around multiple failures, you can feel good that it is something you can fix in software. Bad hardware on the other hand, plays by no ones rules.
Someone taking a 30,000 view of the problem might say: “It is consistent. I run for this long, and it always blue screens!” When we dig into the details though, a different picture emerges. What the CPU was doing at one time, in terms of software, could be drastically different. Running notepad, SQL, minesweeper, core OS functions, it doesn’t matter. You have to look at the state of the system itself, and see if the CPU is doing exactly what it should be, or if RAM has conspicuous patterns that don’t match anything software would likely create, or a device is returning noise instead of data. Getting to the root cause of a problem like this can be terribly difficult without the right tools, especially when you only have a snapshot of the system provided by a memory.dmp file, instead of the live (or more appropriately, freshly dead) system sitting in front of you.