Health Monitoring and Tracking

A health model defines what it means for a system to be healthy (operating within normal conditions) or unhealthy (failed or degraded) and the transitions in and out of such states. Good information on a system's health is necessary for the maintenance and diagnosis of running systems. The contents of the health model become the basis for system events and instrumentation on which monitoring and automated recovery is built.

To keep an application up and running, the operations team needs to watch the application's health metrics, detect symptoms of a problem, diagnose the cause, and fix the problem before the application performs unacceptably. This is referred to as health monitoring and tracking.

To create a health model, the modeler needs to do the following:

  • Collect the right information about the application at the right time—when running normally or when something fails.
  • Document all management instrumentation exposed by an application or service.
  • Document all service health states and transitions that the application can experience when running.
  • Identify the steps and determine the instrumentation (events, traces, performance counters, and Windows Management Instrumentation (WMI) objects/probes) necessary to detect, verify, diagnose, and recover from bad or degraded health states.
  • Document all dependencies, diagnostics steps, and possible recovery actions.
  • Identify which conditions will require intervention from an administrator.
  • Improve the model by incorporating feedback from customers, product support, and testing resources.