Monitoring for reliability

Monitoring and diagnostics are crucial for resiliency. If something fails, you need to know that it failed, when it failed — and why.

Checklist

How do you monitor and measure application health?


  • The application is instrumented with semantic logs and metrics.
  • Application logs are correlated across components.
  • All components are monitored and correlated with application telemetry.
  • Key metrics, thresholds, and indicators are defined and captured.
  • A health model has been defined based on performance, availability, and recovery targets.
  • Azure Service Health events are used to alert on applicable service level events.
  • Azure Resource Health events are used to alert on resource health events.
  • Monitor long-running workflows for failures.

Azure services for monitoring

Reference architecture

Next step