Resource Failure

Article
05/31/2018

Whether you are creating a resource DLL or trying to maximize the availability of a resource, it is important to understand how the Cluster service detects and responds to resource failure and how you can affect that process.

At intervals determined by a resource's LooksAlivePollInterval property, the Cluster service checks to see if the resource appears operational. The Cluster service invokes the Resource Monitor to call the LooksAlive entry point function of the resource DLL managing the resource. This function implements resource-specific procedures to detect whether the resource is working properly and then communicates the results back through the Resource Monitor to the Cluster service. If LooksAlive reports failure, the Cluster service immediately calls the IsAlive entry point function. Note that LooksAlive is not called for all resources. A resource DLL can be implemented in a way that prevents LooksAlive checks. For more information see Open.
At intervals determined by a resource's IsAlivePollInterval property, the Cluster service checks to see if the resource is still operational. The Cluster service invokes the Resource Monitor to call the IsAlive entry point function of the resource DLL managing the resource. This function implements resource-specific procedures to detect whether the resource is working properly and then communicates the results back through the Resource Monitor to the Cluster service. If IsAlive reports failure, the resource is considered failed.
A resource DLL can be designed to report failure at any time, even in the intervals between IsAlive checks. For more information see Open.
When a resource has failed, and the RestartAction property is set to allow a restart, the Cluster service attempts to restart the resource on the same node. The resource's RestartThreshold property determines the maximum number of restart attempts that can occur within a time period specified by the RestartPeriod property. If the Cluster service exceeds the maximum number of restart attempts within the specified time period, and the resource is still not operational, the Cluster service may attempt failover (see next bullet).
When a resource has failed and restart attempts on the same node have failed, its RestartAction property determines whether the group to which the resource belongs will be failed over to another node (see Failover).
If a failed resource does not fail over successfully, it remains in a failed state for an interval determined by the RetryPeriodOnFailure property. Then, if RestartAction is set to allow restarts, the Cluster service makes a final attempt to bring the resource online. If this does not succeed, the failed resource remains in its failed state until the problem is corrected manually.

You can control a resource's failure detection procedures by adjusting the RestartPeriod, LooksAlivePollInterval, IsAlivePollInterval, RetryPeriodOnFailure, and RestartAction properties. Developers of custom resource types can implement LooksAlive and IsAlive with failure detection procedures specific to the resource being supported. For information, see Implementing LooksAlive and Implementing IsAlive.

Note

Both resources and resource types have the IsAlivePollInterval and LooksAlivePollInterval properties. Resource type properties apply to all resources of that type in the cluster. Resource properties apply to an individual resource and can override resource type properties. For example, the Physical Disk resource type's IsAlivePollInterval property might be set to 300 milliseconds. All Physical Disk resources in the cluster would default to this property value. However, an administrator could set an individual Physical Disk resource's IsAlivePollInterval property to 500 milliseconds, overriding the property value of the resource type.

Resource Failure

Additional resources