Why is my Failover Clustering node blue screening with a Stop 0x0000009E?

Article
06/12/2009

John Marlin here from the Windows Cluster Support Team again and today I want to talk about the Stop 0x0000009E and hang detection in Windows Server 2008, 2008R2, or 2012 Failover Clustering. Just to set some expectations for the blog, I am not going to tell you exactly what the problem is, I am more going to show you what you will be seeing depending on the settings you have in place and what the ramifications are based on your settings. Some would see this as a flaw or a problem caused by Failover Clustering, but I wanted to put you at ease that the blue screen is not because of Failover Clustering. We are just reacting to a hanging or degraded condition that Windows is experiencing.

First, a brief explanation on the hang detection we have for Failover Clustering. The Clustering Service incorporates a detection mechanism that may detect unresponsiveness in user-mode components. This detection is a big deal in the high availability market that no one else incorporates. The Cluster Network Driver monitors the health of the Cluster based on periodic communication between its user-mode and kernel-mode components. Periodic communication between user-mode and kernel-mode is a heartbeat. We will do this and track them through what is called a watchdog timer. This “watchdog” keeps counting from a set number down to zero. If the event it is monitoring occurs before it reaches zero, it resets to the starting number and starts counting down again. If the timer reaches zero, it performs some action that has be predefined or configured.

From a Windows perspective, watchdog timers can detect that basic kernel or user services are not executing. Resource starvation issues (including memory leaks, lock contention, and scheduling priority misconfiguration) can block critical user-mode components without blocking deferred procedure calls (DPCs) or draining the non-paged memory pool.

Kernel components can extend watchdog timer functionality to user mode by periodically monitoring critical applications. This bug check indicates that a user-mode health check failed in a way that prevents graceful shutdown. This bug check restores critical services by restarting or enabling application failover to other servers.

To see what your current Failover Clustering settings for these are, you can run the command:

 C:\> cluster /cluster:clustername /prop

The Failover Clustering service in has two properties that control the behavior of this:

ClusSvcHangTimeout
This property controls how long we wait between heartbeats before determining that the Cluster Service has stopped responding. The default for the ClusSvcHangTimeout is 60 seconds. If you want to change the setting, you would issue the command:

 C:\> cluster /cluster:clustername /prop ClusSvcHangTimeout=x
 * where x is in seconds <<– default is 60 seconds

HangRecoveryAction
This property controls the action to take if the user-mode processes have stopped responding. For the HangRecoveryAction, we have 4 different settings with 3 being the default.

0 = Disables the heartbeat and monitoring mechanism.
1 = Logs an event in the system log of the Event Viewer.
2 = Terminates the Cluster Service.
3 = Causes a Stop error (Bugcheck) on the cluster node. <<– default for 2008

If you want to change the setting, you would issue the command:

 C:\> cluster /cluster:clustername /prop HangRecoveryAction=x
  * where x is the action to take

Since HangRecoveryAction=3 (bugcheck the box) is the default, I will start with this one. This setting will actually call into Windows to bugcheck the machine and create a dump file (MEMORY.DMP). The dump file created will be based on the settings in Windows (Kernel Dump as a default). On one hand, you may ask why would I want to blue screen my box and cause a brief production outage? However, on the other hand, if the node is in a hung or degraded state, powering the machine off forcefully may be your only recourse in order to move the services over to another node. When hangs occur, connectivity and or productivity can be severely impacted.

Keep in mind the following scenario of a hung machine. If Failover Clustering detects this problem in say one minute and forces a failover that takes another 2 minutes to bring everything online, you have been down 3 minutes. If this was not in place and this occurred, it may take users several minutes to notice there is some sort of problem. They may wait several more minutes before calling helpdesk to report the problem. Then the helpdesk takes several minutes to log the problem. On it goes before someone can eventually get to the machine to see what is going on. Say they go ahead and hard power off the machine to get your services back into production. What if this took 45 minutes? In a company that values high availability, this additional 42 minutes could have cost you thousands of even millions of dollars!!!

What if it was determined that you needed to get Microsoft involved at this point? What data can you provide? In most cases of hung or degraded machines, the engineer would want the following:

System Event Log
Application Event Log
Performance Log (if any)
Pool Monitor Log (if any)
Dump file (if any)

If we had not had the setting we have, then you would be left with only the event logs. If nothing is there that points to anything concrete, which seems like most of the time, you would need to configure the system to capture more data and wait for this to happen again. With the Failover Clustering HangRecoveryAction setting in place, then you would have a dump file (snapshot in time) to go through that could point out the cause of the hang and can then correct right now.

So, say you have this problem, what is going to happen is it will bugcheck only the box having this issue and reboot. Because a reboot occurred, all resources that were present on this node are going to move to another and come online to get you back into production. On the reboot of this node, you would see the following event in the System Event Log:

Event Type: Information
Event ID: 1001
Source: BugCheck
Description: The computer has rebooted from a bugcheck. The bugcheck was 0x0000009E (process id, timeout value, reserved, reserved).

The Stop Error values (in parenthesis) will vary. These are the values of these entries:

process id = Process that failed to satisfy a health check within the configured timeout
timeout value = Health monitoring timeout (seconds)
reserved = will always be zeroes
reserved = will always be zeroes

So now we see the event, let’s look at a dump file. The dump file I am using is from a 64-bit machine.

 0: kd> .bugcheck
Bugcheck code 0000009E
Arguments fffffa80`0fdef7e0 00000000`0000003c 00000000`00000000 00000000`00000000

Looking at the Process above, we can see that it is the Cluster Service.

 0: kd> !process fffffa800fdef7e0 0
PROCESS fffffa800fdef7e0
SessionId: 0  Cid: 0a40    Peb: 7fffffd8000  ParentCid: 02e8
DirBase: 2355da000  ObjectTable: fffff880089cb830  HandleCount: 4288.
Image: clussvc.exe

Looking at the thread that called the bugcheck, we see this:

 0: kd> !thread
THREAD fffff80001dc4b80  Cid 0000.0000  Teb: 0000000000000000 Win32Thread: 0000000000000000 RUNNING on processor 0
Not impersonating
DeviceMap                 fffff880000061c0
Owning Process            fffff80001dc50c0       Image:         Idle
Attached Process          fffffa80072d4110       Image:         System
Wait Start TickCount      0              Ticks: 108665 (0:00:28:15.184)
Context Switch Count      5054015
UserTime                  00:00:00.000
KernelTime                00:20:09.319
Win32 Start Address nt!KiIdleLoop (0xfffff80001caab00)
Stack Init fffff80004331db0 Current fffff80004331d40
Base fffff80004332000 Limit fffff8000432c000 Call 0
Priority 16 BasePriority 0 PriorityDecrement 0 IoPriority 0 PagePriority 0
Child-SP          RetAddr           : Args to Child             : Call Site
fffff800`04331a18 fffffa60`011d63c8 :  *** removed for space ***  : nt!KeBugCheckEx
fffff800`04331a20 fffff800`01ca88b3 :  *** removed for space ***  : netft!NetftWatchdogTimerDpc+0xb8
fffff800`04331a70 fffff800`01ca9238 :  *** removed for space ***  : nt!KiTimerListExpire+0x333
fffff800`04331ca0 fffff800`01ca9a9f :  *** removed for space ***  : nt!KiTimerExpiration+0x1d8
fffff800`04331d10 fffff800`01caab62 :  *** removed for space ***  : nt!KiRetireDpcList+0x1df
fffff800`04331d80 fffff800`01e785c0 :  *** removed for space ***  : nt!KiIdleLoop+0x62
fffff800`04331db0 00000000`fffff800 :  *** removed for space ***  : nt!zzz_AsmCodeRange_End+0x4
fffff800`0432b0b0 00000000`00000000 :  *** removed for space ***  : 0xfffff800

From a debugging perspective, all we see is that the Cluster Service timed out its health monitoring so called into KeBugCheckEx. One point I wanted to stress again is that even though the Cluster Service created the dump, this is not the cause or focus of your problem resolution steps moving forward. There was something bad occurring with the system that we detected and reacted to. While it may appear extreme, it is one of the better options to ensure availability and faster recovery.

In dumps, such as these, you would not want to focus on the Cluster Service and what it was doing, but more from a generic hanging stance. Something in User Mode caused the Failover Clustering Service to become unresponsive, so User Mode processes and general hang debugging is your focus. For this blog, I am not going to go into debugging hang dumps. For more information on debugging hang dumps, you should visit our NTDebugging Blog site for steps, tricks, and tips. Something else to consider is that since we create a dump based on the Windows Crash Settings, the default of kernel dump may or may not show you the exact cause since User Mode Space is not kept. The Crash Setting of Complete Dump may need to be set for any future stop errors.

Let’s look at what happens if you change the HangRecoveryAction to terminate the Cluster Service. If you want to change the setting, you would issue the command:

 C:\> cluster /cluster:clustername /prop HangRecoveryAction=2

If we get a hang that we detect and need to react to, we would see the following in the System Event Log.

Event ID: 4870
Source: Microsoft-Windows-FailoverClustering
Description: User mode health monitoring has detected that the system is not being responsive. The Failover cluster virtual adapter has lost contact with the Cluster Server process with a process ID ‘%1’, for ‘%2’ seconds. Recovery action will be taken.

* where %1 is the Process ID you would see in Task Manager
* where %2 is the value of ClusSvcHangTimeout

Event ID: 7031
Source: Service Control Manager
Description: The Cluster Service service terminated unexpectedly.

If you generate a Cluster Log, you would see the below:

processid:threadid GMT-time [ERR] Watchdog timer timeout for the client process (ID x) and it will terminate the client process.

* where x is the Process ID you would see in Task Manager

At that point, we are going to attempt to terminate the Cluster Service in order to attempt to move everything over to another node so that you can get back to production. When we are terminating the Cluster Service, taking resources offline, sending out notifications, etc, we are going to use user mode space to accomplish some of these tasks. If you have a hang in user mode, we may not be able to complete it. The reality is that the machine is in this degraded/hung state. We are going to try and gracefully recover from this state, and if we cannot, you may be looking at having to hard power the machine off to get things properly moved over anyway.

Troubleshooting this may be a more difficult as all you would have to look through would be the Event Logs and a Cluster Log (if generated). The Cluster Log would only show you what is going on with the Cluster, so it most likely may be of no use unless there were actual resource failures prior to the termination. An example would be a File Server resource failure with an Error 1130 (not enough server storage). You would then need to review the System Event Log for any performance type errors (2019 nonpaged pool, 2020 paged pool, etc) or even if any other services may have failed shortly before hand. But even then, you are not going to find the root cause of it. If you were wanting to keep this setting, you would want to look at:

Use Task Manager to work with applications or services consuming large amounts of memory
Generate a System Diagnostics Report (perfmon /report)
Start Resource Monitor (perfmon /res)
Open Event Viewer and viewing events related to failover clustering
Run Performance Monitor over a longer period of time and look for anything there
Any other hanging type monitoring utilities you may use

Now, let’s look at what happens if you change the HangRecoveryAction to simply log an event. If you want to change the setting, you would issue the command:

 C:\> cluster /cluster:clustername /prop HangRecoveryAction=1

If we get a hang that we detect and need to react to, we would only see the following in the System Event Log.

Event ID: 4869
Source: Microsoft-Windows-FailoverClustering
Description: User mode health monitoring has detected that the system is not being responsive. The Failover cluster virtual adapter has lost contact with the ‘C:WindowsClusterclussvc.exe’ process with a process ID ‘%1’, for ‘%2’ seconds. Please use Performance Monitor to evaluate the health of the system and determine which process may be negatively impacting the system.

* where %1 is the Process ID you would see in Task Manager
* where %2 is the value of ClusSvcHangTimeout

This is all we are going to do. If a hanging condition is occurring over a long period, you could see this event repeat every 60 seconds (or whatever the value you have set for ClusSvcHangTimeout). Since we do not react in any other way, we would basically be at the mercy of Windows and how it reacts. If it hangs, then we may or may not be able to fail anything over. If it not affecting the Cluster Service or any resources, we would just run along like nothing is going on. We could also see problems that do affect the resources and get inadvertant failovers due to loss of communication between the nodes, resource failures, etc. Just like the prior action, you would need to:

Use Task Manager to work with applications or services consuming large amounts of memory
Generate a System Diagnostics Report (perfmon /report)
Start Resource Monitor (perfmon /res)
Open Event Viewer and viewing events related to failover clustering
Run Performance Monitor over a longer period and look for anything there
Any other hanging type monitoring utilities you may use

The last action we have is to do disable the health monitor checking. If you want to change the setting, you would issue the command:

 C:\> cluster /cluster:clustername /prop HangRecoveryAction=0

If we get a hang, then we do nothing as we will detect nothing. Like the action of 1, we are only going to do anything if it causes us communication issues between the nodes or causes resources to fail. We will react to that, but that would be it.

I hope that this gives you a better knowledge and understanding of this feature. Remember, just because we create a dump or terminate the service, does not mean that Failover Clustering caused the issue or the downtime. On the contrary, Failover Clustering just reacted based on what the hang detection settings are and gets you back up into production quicker with the benefit of additional data that can be reviewed to assist getting a resolution of the true problem. Look at this from a performance perspective and treat it as you would any other stand-alone system that has sluggishness, hangs, etc.

Why is my Failover Clustering node blue screening with a Stop 0x0000009E?

Additional resources