Crash Dump Analysis
I'm sure that many of you have had the unfortunate experience of watching the windows Blue Screen Of Death (BSOD) while working, and possibly have lost important data. A common reaction in this case is to blame Microsoft and continue working after the following reboot, as if nothing had happened. Another unfortunate experience is to see an application crash, while using it. In this case, there is a window that comes up, asking you, if you want to send the data to Microsoft for analysis (the same window also comes up after the BSOD). Many people might be afraid of the contents of the data that is sent to the network, so they select "No". The goal of this post is to help you understand what is going on in the background in each of the two cases and also to help clarify some misconceptions.
QUESTION 1: What causes all these reboots?
First of all, because of the architecture of the windows kernel in the NT/2000/XP/Vista series, an application cannot corrupt data that belongs to another application or to the kernel. This means that each application is totally isolated and cannot harm the system. The worst thing that can happen is that the application does something invalid and crashes without any further implications for the rest of the system. On the other hand, the windows kernel and the device drivers have unlimited access to the system. If the kernel or a driver misbehaves, then it can corrupt the whole system. The immediate result of this, is that the reason for all the blue screens lies either in the windows kernel or in the windows device drivers. That's why, whenever an application crashes, the system keeps working without a problem, whereas if there is a bug in the kernel or in a device driver, the whole system goes down.
Now that we've identified the possible causes of the crashes, it's time to go even further. According to the reports that were sent to Microsoft until April 2004 (from all those people, who pressed "Yes", when they were asked to send the data to Microsoft) the reasons for the crashes can be split as follows:
- Third-party device drivers: 70%
- Unknown, because of severe memory corruption: 15%
- Hardware error: 10%
- Microsoft code: 5%
This shows that Microsoft is not the one to blame. The main cause for these crashes is poorly written third-party (non-Microsoft) device drivers.
QUESTION 2: Why does the system crash, when there is a kernel-mode error?
So, from the above analysis it is obvious that the system can overcome an application crash, however a kernel-mode error causes it to reboot. Why can't the system continue, after finding a kernel-mode error? Actually, what happens is that while kernel-mode code (either a device driver or the windows kernel) is executing, some discrepancy is found. For example, a pointer might be pointing to an invalid address, a data structure might have invalid values, etc. Even though this problem was found, it's possible that the "bad" code that caused this problem might have corrupted more data. For example, it might be possible that basic kernel structures are corrupted. That's why the function KeBugCheckEx is called (inside the kernel, this type of forced crash is called "bugcheck"), in order to write logging information to the page file, paint the blue screen and show some information about the crash in the screen.
QUESTION 3: How do we configure, whether we want to send anything to Microsoft?
Even though, it's not well-known to most people, you can configure what will be sent to Microsoft. Somebody might want to report only the crashes that have to do with the operating system (after the BSOD). Somebody else might want to report only the crashes from the applications (either for all of them or only for some particular applications). In order to configure this, you need to go to
Control Panel | System | Advanced | Error Reporting
From there you'll be able to select which types of errors you want to report. Actually, even if you select a particular type of error to be reported, Windows will ask you again, after a program that belongs in that category crashes. So, by selecting a category, it doesn't mean that all crashes will be reported automatically.
QUESTION 4: Exactly what kind of data is sent to Microsoft?
Before answering this question, it is useful to understand what information is stored, when the system or an application crashes. Let's talk first about the files that are generated after a system crash. In order to set this, you need to go to
Control Panel | System | Advanced | Startup and Recovery -> Settings
In the "System Failure" part you'll be able to configure, if you want to write the crash to the system log, if you want to send an administrative alert and if you want to reboot after the crash. Also, you have the option of creating a memory dump and you can select the directory, in which you want to save it. There are 3 types of memory dumps:
- Small memory dump or minidump (64kb for 32-bit systems, 128kb for 64-bit systems): It includes a minimum amount of information about the system before the crash, e.g. the bugcheck code, the loaded drivers, information for the current process and thread, etc.
- Kernel memory dump: This includes all the kernel-mode memory that was in physical memory at the time of the crash. There is no default size for this dump, however it should be around 50-100MB for most "normal" systems (with <= 2GB of RAM).
- Full memory dump: This includes all the physical memory. The size of the file will be the same as the physical memory of the system.
Here it's worth mentioning that at the time of the crash, the information is written to the pagefile. Therefore, the pagefile must be configured to be larger than the size of the dump. This might be a problem especially in the case of the Full memory dump. In order to set the size of the pagefile, you need to go to
Control Panel | System | Advanced | Performance Settings -> Advanced | Virtual Memory -> Change
After the system reboots, the information is copied from the page file to the file that was specified above. The reason that the information is not written directly to that file is that the kernel doesn't know the root of the problem at the time of the crash, so it's trying to use as fewer drivers as possible (the pagefile is already open and in use, so theoritically it's the safest destination).
On the other hand, in order to configure the data that is stored, when an application crashes, you need to open a command prompt (or go to Start | Run) and execute drwtsn32.exe. This application is called Dr. Watson and there you'll be able to configure the destination file for the dump, as well as the type of the dump. Here the only options are the minidump and the full memory dump. There is also an option to create an old-style NT-compatible full memory dump. In addition, in the textarea "Application Errors" you can look at the application crashes that had been logged in the system. You can select any of them and click "View", if you are interested in looking at exactly what the log file includes.
So, now it's time to answer the initial question: What exactly is sent to Microsoft? The answer is that regardless of the crash dump file that you have selected, the only thing that is sent to Microsoft is a minidump (both for kernel-mode and user-mode crashes). Of course, it would be impossible to send a 100MB kernel-mode dump or a 1GB full-memory dump, so that's why only the 64kb minidump is sent. Apart from the minidump, the information also includes an XML file with basic information about the version of the operating system and the loaded drivers, which you can look at, when you are prompted to select, if you want to send the data to Microsoft. There is no personal private information or anything like that. In fact, you can open the minidumps and check the included information. I'll also show a way of analyzing the minidumps and looking at the data that Microsoft has access to.
QUESTION 5: What does Microsoft do with this information?
When the minidump is received by Microsoft, it goes through some preprocessing and is stored in a server. If many minidumps that seem to have the same problem are received, then there is a team that analyzes them and finds the root of the problem. Afterwards, a webpage is created that shows exactly what the cause of the problem is. Most often it points to the driver causing the problem and gives a link to the manufacturer's webpage, so that a new version can be downloaded (if it exists). After the webpage is created, if somebody submits a dump that has the same problem, he is shown the corresponding webpage that will help him find the solution. Therefore, if somebody clicks "Yes", when asked to submit a crash dump he might either find the solution to the problem or help Microsoft find the solution and present it to the users in the future.
QUESTION 6: How can we analyze a crash dump?
Fortunately, there is a tool that can be used to analyze both the user-mode and the kernel-mode crash dumps: windbg. Microsoft has included the dump analysis algorithms in windbg, so in some basic cases it's easy to find the cause of the problem. Of course, there are many causes, in which there are many corrupted data structures and it's impossible to pinpoint the problem automatically. In that case, more advanced manual methods are used by the Online Crash Analysis (OCA) team in Microsoft.
In order to perform the analysis by yourselves (either because you are unable to submit the data or because there is no answer in Microsoft's website), you need to open windbg, go to "File" | "Open Crash Dump" and select the dump file that you want to analyze. As I wrote in my previous post you need to set the path to the symbol files and reload them. If this step is omitted, then it won't be possible to analyze the file.
The next step is to execute the command (this might take some time):
At the top of the output you'll see the bugcheck code, it's description and some additional information (e.g. the address of the invalid memory that was accessed, whether it was a Read or a Write operation, etc). Further down you'll see the call stack of the dump under the title STACK_TEXT. This includes all the functions that were called, when the crash occurred. The function on the top is the most current one (it was called by the function below it, which was called by the function below it, etc), whereas the function at the bottom is the oldest function in the stack. The reason that the system crashed is because one of these functions did something invalid (e.g. passed or received an invalid argument that forced it to perform an invalid operation). Of course it's possible that the data was corrupted by another function that is not in the call stack. Fortunately, windbg has already done an automated analysis and points to the module that most probably caused the crash. You should look at the following fields:
- SYMBOL_NAME: Exactly where the invalid operation was caused (module + function)
- MODULE_NAME: The name of the module that caused the crash
- IMAGE_NAME: The file, in which the problematic code resides
In order to find more information about the problematic code you can execute:
lm kv m MODULE_NAME*
for example, if MODULE_NAME is problematic_driver, you should execute:
lm kv m problematic_driver*
lm stands for "list modules", k stands for "kernel modules", v stands for verbose and m stands for "match".
Another option is to find the problematic file name, from the IMAGE_NAME tag, search it in the hard drive and either look at its properties, in order to identify its manufacturer or search it in the internet. Afterwards, you might need to update the buggy driver.
Of course, it's possible that windbg's automatic analysis was not able to pinpoint the faulting driver. The reason for that might be that the call stack was corrupted or that some important code or data structure was overwritten or that there was a memory leak, etc. In that case, you might want to proceed into some manual solutions, in order to detect the problem. From this point there is no automated way to proceed, so I can just provide a few useful commands that might help you find more information about your system.
First you can execute
!process 0 0
The first command prints information about all the running processes. this way you might be able to find a suspicious process. The second command shows information about the current process. If you execute
then you'll see more information about the particular process. You can find the addresses of the processes from the "!process 0 0" command. The process information includes information about its threads. You can find more information about each thread by executing
and if the thread belongs to a driver with pending IRPs, then you can find more information about them by executing
Also, as I wrote above, in order to see all the loaded modules you can try
In order to find more information about the used memory (and possibly detect memory leaks), you can execute:
!vm and !memusage
Finally, it's possible that the system hangs and does not crash. In order to debug it, you need to force a crash. The only way to do that is to go to the registry key HKLM\System\CurrentControlSet\Services\i8042prt\Parameters\CrashOnCtrlScroll and set it to 1. This works only for PS2 keyboards (not for USB). When the system hangs, you can keep the right control key pressed and press the scroll lock key twice. This will cause a crash, which you will be able to debug using windbg.
A useful command in that case is:
!locks (prints the locks, which are currently held, provided that there is at least one additional thread waiting on them).
Another tool that can help you, if !analyze cannot find the root case is the Driver Verifier. This tool enables additional system checks, that will make it possible for the system to crash immediately, when a driver does something invalid, without allowing it to corrupt more data. This way, the crash dump will point directly to the driver. In order to execute it, you need to open a command prompt with administrative privileges (or from Start | Run) and execute "verifier.exe".
There are some small differences between the User Interface in Windows 2000, Windows XP and Vista, so I'll explain the Windows XP interface. What you'll see is a window and some tasks that you're called to select. You need to select the task "Create custom settings (for code developers)", and then "Select individual settings from a full list". After that you'll see a screen with the following options:
- Special pool: This option forces the memory allocation routines to operate on a special pool. For example, if a driver wants to allocate 100kb, then he is given a pointer that points to 100kb before the end of a free page. The rest of the page is marked with a specific signature. Also, the pages that are before and after this particular page are marked invalid. So, if a driver tries to write something after the end of the allocated space, there will be a page fault and the Driver Verifier will crash the system immediately. If the driver tries to write before the beginning of the allocated space, then after he frees the memory, the Driver Verifier will check the signature of the page, find that it's invalid and crash the system. The crash dump that will be generated from this crash will point directly to the faulting driver.
- Pool tracking: Each space allocation is marked with a special tag that is different for each driver. When the driver is unloaded, the Driver Verifier will check for the corresponding tags and if it finds any, then this means that the driver has a memory leak, so the system will crash.
- Force IRQL checking: Whenever a driver goes to IRQL at DPC/dispatch level or above, the Driver Verifier will cause all the pageable memory to be paged out to disk. So, if the driver tries to access this memory, then the system will crash with a bugcheck code equal to IRQL_NOT_LESS_OR_EQUAL
- I/O verification: All the IRPs are allocated from a special pool. If any of them is completed with an invalid I/O status, then the system crashes.
- Enhanced I/O verification (used to be "I/O Verification level 2" in Windows 2000): This includes even deeper tests for the IRPs. The I/O manager checks if the drivers complete asynchronous IRPs complete correctly, if they manage the device stack locations correctly and if they delete the device objects only once.
- DMA checking: The I/O manager makes sure that all the drivers configure the DMA operations correctly, otherwise it crashes the system.
- Deadlock detection: This option enables deadlock detection. When a deadlock is detected, the system crashes and you can use the !deadlock command from windbg to find more information about what is causing it.
- Low Resources simulation: 7 minutes after the boot completes (so all the drivers have been loaded), the I/O manager starts failing random memory allocations. This way, if a driver doesn't check the status of a memory allocation, the system will crash.
- Disk Integrity Verification (only in Windows Server 2003): Windows keeps checksums of written data, so after each read it checks, if the data is still valid. If there is a discrepancy, the system crashes.
- SCSI Verification (included automatically, when a SCSI miniport driver is monitored): It includes some additional SCSI-related checks.
You should select all of the existing options, apart from the "Low Resources Simulation" (because it includes very heavy operations and the system will be really slow). In the next screen, you need to select the drivers that you want to debug. You should start by selecting drivers that you think that are suspicious (either because windbg pointed to them or because the crashes started after you installed them, etc). Reboot and run the system for some time and observe its behaviour. If the system continues crashing, but not because of the drivers that you specified (i.e. the crash dump files are still vague and don't pinpoint to a specific driver), the next step is to select all the unsigned drivers. If you don't see any result, then the next step is to start enabling driver verifier for bigger groups of drivers, until you find the buggy one.
Another useful tool for debugging memory leaks is poolmon, which is part of the Windows Support Tools and can be found here (for Windows XP). This tool displays information about memory allocations (both from paged pool and from non-paged pool), as well as discrepancies between allocations and deallocations. You just execute the tool from a command prompt (there is no User Interface) and select the types of memory pool that you want to look at. If the amount of allocated memory increases constantly, this means that there is a possible memory leak. You can find an overview here and a more detailed explanation here.