How a Bluescreen Button (NMI) can Save Your Bacon
I know, another title that seems ridiculous. Why in the world would anyone want a button that intentionally bluescreens your system?! When you’re confronted with a hard hang though, (no mouse or keyboard) you’re in for a heck of a time trying to figure out what’s wrong without one. That’s where the NMI button can come in handy.
Many people are already familiar with the mechanism introduced in Windows 2000 for these kinds of issues. The gist is that by setting a registry key, you can enable a key sequence (at the local keyboard only) that will bluescreen the machine. Thus if you’re having problems with hangs, you can get a memory.dmp and send it to your OEM or Microsoft for analysis.
However, this mechanism can’t cover every scenario that will result in a hang. The keyboard interrupt is typically a fairly low priority on the system in relation to the rest of the devices. If your hang isn’t the result of a deadlock in the kernel itself, the key sequence will never get through and initiate the crash. It’s simply too easy for other devices and drivers to turn off that interrupt while doing their own I/O.
This is where the Non-Maskable Interrupt (NMI) comes in to save the day. As the name implies, this is an interrupt that cannot be hidden by software. When the interrupt is generated, the CPU will always get it, and the interrupt handler (which you also must explicitly enable in the registry) will start the process of bluescreening the box. It will then break into the kernel debugger if attached, or generate a STOP 0x00000080 blue screen if not.
Now if the NMI doesn’t work, you can be confident that something is seriously wrong with your system, and it’s probably hardware. The CPU typically has to move into an unknown state for this feature to fail. It’s time to contact your hardware vendor, and quick. If you’re wondering why no one uses this feature, you’d be surprised. A number of major server vendors do in fact ship systems with this button, but they keep it hidden (for good reason) and don’t really use it as a feature to sell the box. They consider it purely diagnostic.
Personally, I’d want every system in my server room to have this mechanism. I don’t want 2 or 3 hangs before I can even begin to troubleshoot. I want it done the first time, every time.