Health modeling for mission-critical workloads

Article
11/30/2023

The monitoring of applications and infrastructure is an important part of any infrastructure deployment. For a mission-critical workload, monitoring is a critical part of the deployment. Monitoring application health and key metrics of Azure resources helps you understand if the environment is working as expected.

To fully understand these metrics and evaluate the overall health of a workload, requires a holistic understanding of all of the data monitored. A health model can assist with evaluation of the overall health status by displaying a clear indication of the health of the workload instead of raw metrics. The status is often presented as "traffic light" indicators such as red, green, or yellow. Representation of a health model with clear indicators makes it intuitive for an operator to understand the overall health of the workload and respond quickly to issues that arise.

Health modeling can be expanded into the following operational tasks for the mission-critical deployment:

Application Health Service - Application component on the compute cluster that provides an API to determine the health of a stamp.
Monitoring - Collection of performance and application counters that evaluate the health and performance of the application and infrastructure.
Alerting - Actionable alerts of issues or outages in the infrastructure and application.
Failure analysis - Breakdown and analysis of any failures including documentation of root cause.

These tasks make up a comprehensive health model for the mission-critical infrastructure. Development of a health model can and should be an exhaustive and integral part of any mission-critical deployment.

For more information, see Health modeling and observability of mission-critical workloads on Azure.

Application Health Service

The Application Health Service (HealthService) is an application component that resides with the Catalog Service (CatalogService) and the Background Processor (BackgroundProcessor) within the compute cluster. The HealthService provides a REST API for Azure Front Door to call to determine the health of a stamp. The HealthService is a complex component that reflects the state of dependencies, in addition to its own.

When the compute cluster is down, the health service won't respond. When the services are up and running, it performs periodic checks against the following components in the infrastructure:

It attempts to do a query against Azure Cosmos DB.
It attempts to send a message to Event Hubs. The message is filtered out by the background worker.
It looks up a state file in the storage account. This file can be used to turn off a region, even while the other checks are still operating correctly. This file can be used to communicate with other processes. For example, if the stamp is to be vacated for maintenance purposes, the file could be deleted in order to force an unhealthy state and reroute traffic.
It queries the health model to determine if all operational metrics are within the predetermined thresholds. When the health model indicates the stamp is unhealthy, traffic shouldn't be routed to the stamp even though the other tests the HealthService performs return successfully. The Health Model takes a more complete view of the health status into account.

All health check results are cached in memory for a configurable number of seconds, by default 10. This operation does potentially add a small latency in detecting outages, but it ensures not every HealthService query requires backend calls, thus reducing load on the cluster and downstream services.

This caching pattern is important, because the number of HealthService queries grows significantly when using a global router like Azure Front Door: Every edge node in every Azure datacenter that serves requests will call the Health Service to determine if it has a functional backend connection. Caching the results reduces extra cluster load generated by health checks.

Configuration

The HealthService and the CatalogService have configuration settings common between the components except for the following settings used exclusively by the HealthService:

Setting	Value
`HealthServiceCacheDurationSeconds`	Controls the expiration time of memory cache, in seconds.
`HealthServiceStorageConnectionString`	Connection string for the Storage Account where the status file should be present.
`HealthServiceBlobContainerName`	Storage Container where the status file should be present.
`HealthServiceBlobName`	Name of the status file - health check will look for this file.
`HealthServiceOverallTimeoutSeconds`	Timeout for the whole check - defaults to 3 seconds. If the check doesn't finish in this interval, the service reports unhealthy.

Implementation

All checks are performed asynchronously and in parallel. If either of them fails, the whole stamp will be considered unavailable.

Check results are cached in memory, using the standard, non-distributed ASP.NET Core MemoryCache. Cache expiration is controlled by SysConfig.HealthServiceCacheDurationSeconds and is set to 10 seconds by default.

Note

The SysConfig.HealthServiceCacheDurationSeconds configuration setting reduces the additional load generated by health checks as not every request will result in downstream call to the dependent services.

The following table details the health checks for the components in the infrastructure:

Component	Health check
Storage account Blob	The blob check currently serves two purposes: 1. Test if it's possible to reach Blob Storage. The storage account is used by other components in the stamp and is considered a critical resource. 2. Manually "turn off" a region by deleting the state file. A design decision was made that the check would only look for a presence of a state file in the specified blob container. The content of the file isn't processed. There's a possibility to set up a more sophisticated system that reads the content of the file and return different status based on the content of the file. Examples of content are HEALTHY, UNHEALTHY, and MAINTENANCE. Removal of the state file will disable the stamp. Ensure the health file is present after deploying the application. Absence of the health file will cause the service to always respond with UNHEALTHY. Front Door won't recognize the backend as available. The file is created by Terraform and should be present after the infrastructure deployment.
Event Hub	Event Hubs health reporting is handled by the `EventHubProducerService`. This service reports healthy if it's able to send a new message to Event Hubs. For filtering, this message has an identifying property added to it: `HEALTHCHECK=TRUE` This message is ignored on the receiving end. The `AlwaysOn.BackgroundProcessor.EventHubProcessorService.ProcessEventHandlerAsync()` configuration setting checks for the `HEALTHCHECK` property.
Azure Cosmos DB	Azure Cosmos DB health reporting is handled by the `CosmosDbService`, which reports healthy if it is: 1. Able to connect to Azure Cosmos DB database and perform a query. 2. Able to write a test document to the database. The test document has a short Time-to-Live set, Azure Cosmos DB automatically removes it. The HealthService performs two separate probes. If Azure Cosmos DB is in a state where reads work and writes don't, the two probes ensure an alert is triggered.

Azure Cosmos DB queries

For the Read-only query, the following query is being used, which doesn't fetch any data and doesn't have a large effect on overall load:

SELECT GetCurrentDateTime ()

The write query creates a dummy ItemRating with minimum content:

var testRating = new ItemRating()
{
    Id = Guid.NewGuid(),
    CatalogItemId = Guid.NewGuid(), // Create some random (=non-existing) item id
    CreationDate = DateTime.UtcNow,
    Rating = 1,
    TimeToLive = 10 // will be auto-deleted after 10sec
};

await AddNewRatingAsync(testRating);

Monitoring

Azure Log Analytics is used as the central store fo logs and metrics for all application and infrastructure components. Azure Application Insights is used for all application monitoring data. Each stamp in the infrastructure has a dedicated Log Analytics workspace and Application Insights instance. A separate Log Analytics workspace is used for the globally shared resources such as Front Door and Azure Cosmos DB.

All stamps are short-lived and continuously replaced with each new release. The per-stamp Log Analytics workspaces are deployed as a global resource in a separate monitoring resource group as the stamp Log Analytics resources. These resources don't share the lifecycle of a stamp.

For more information, see Unified data sink for correlated analysis.

Monitoring: Data sources

Diagnostic settings: All Azure services used for Azure Mission-Critical are configured to send all their Diagnostic data including logs and metrics to the deployment specific (global or stamp) Log Analytics Workspace. This process happens automatically as part of the Terraform deployment. New options will be identified automatically and added as part of terraform apply.
Kubernetes monitoring: Diagnostic settings are used to send AKS logs and metrics to Log Analytics. AKS is configured to use Container Insights. Container Insights deploys the OMSAgentForLinus via a Kubernetes DaemonSet on each node in the AKS clusters. The OMSAgentForLinux is capable of collecting extra logs and metrics from within the Kubernetes cluster and sends them to its corresponding Log Analytics workspace. These extra logs and metrics contain more granular data about pods, deployments, services and the overall cluster health. To gain more insights from the various components like ingress-nginx, cert-manager, and other components deployed to Kubernetes next to the mission-critical workload, it's possible to use Prometheus scraping. Prometheus scraping configures the OMSAgentForLinux to scrape Prometheus metrics from various endpoints within the cluster.
Application Insights telemetry: Application Insights is used to collect telemetry data from the application. The code has been instrumented to collect data on the performance of the application with the Application Insights SDK. Critical information, such as the resulting status code and duration of dependency calls and counters for unhandled exceptions is collected. This information is used in the Health Model and is available for alerting and troubleshooting.

Monitoring: Application Insights availability tests

To monitor the availability of the individual stamps and the overall solution from an outside point of view, Application Insights Availability Tests are set up in two places:

Regional availability tests: These tests are set up in the regional Application Insights instances and are used to monitor the availability of the stamps. These tests target the clusters and the static storage accounts of the stamps directly. To call the ingress points of the clusters directly, requests need to carry the correct Front Door ID header, otherwise they're rejected by the ingress controller.
Global availability test: These tests are set up in the global Application Insights instance and are used to monitor the availability of the overall solution by pinging Front Door. Two tests are used: One to test an API call against the CatalogService and one to test the home page of the website.

Monitoring: Queries

Azure Mission-Critical uses different Kusto Query Language (KQL) queries to implement custom queries as functions to retrieve data from Log Analytics. These queries are stored as individual files in our code repository, separated for global and stamp deployments. They're imported and applied automatically via Terraform as part of each infrastructure pipeline run.

This approach separates the query logic from the visualization layer. The Log Analytics queries are called directly from code, for example from the HealthService API. Another example is from a visualization tool such as Azure Dashboards, Monitor Workbooks, or Grafana.

Monitoring: Visualization

For visualizing the results of our Log Analytics health queries, we've used Grafana in our reference implementation. Grafana is used to show the results of Log Analytics queries and doesn't contain any logic itself. The Grafana stack isn't part of the solution's deployment lifecycle, but released separately.

For more information, see Visualization.

Alerting

Alerts are an important part of the overall operations strategy. Proactive monitoring such as the use of dashboards should be used with alerts that raise immediate attention to issues.

These alerts form an extension of the health model, by alerting the operator to a change in health state, either to degraded/yellow state or to unhealthy/red state. By setting the alert to the root node of the Health Model, the operator is immediately aware of any business level affect to the state of the solution: After all, this root node will turn yellow or red if any of the underlying user flows or resources report yellow or red metrics. The operator can direct their attention to the Health Model visualization for troubleshooting.

For more information, see Automated incident response.

Failure analysis

Composing the failure analysis is mostly a theoretical planning exercise. This theoretical exercise should be used as input for the automated failure injections that are part of the continuous validation process. By simulating the failure modes defined here, we can validate the resiliency of the solution against these failures to ensure they won't lead to outages.

The following table lists example failure cases of the various components of the Azure Mission-Critical reference implementation.

Service	Risk	Impact/Mitigation/Comment	Outage
Microsoft Entra ID	Microsoft Entra ID becomes unavailable.	Currently no possible mitigation in place. A multi-region approach won't mitigate any outages as it's a global service. This service is a hard dependency. Microsoft Entra ID is used for control plane operations like the creation of new AKS nodes, pulling container images from ACR or to access Key Vault on pod startup. It's expected that existing, running components should be able to keep running when Microsoft Entra ID experiences issues. It's likely that new pods or AKS nodes will be unable to spawn. In scale operations are required during this time, it could lead to a decreased user experience and potentially to outages.	Partial
Azure DNS	Azure DNS becomes unavailable and DNS resolution fails.	If Azure DNS becomes unavailable, the DNS resolution for user requests and between different components of the application will likely fail. Currently no possible mitigation in place for this scenario. A multi-region approach won't mitigate any outages as it's a global service. Azure DNS is a hard dependency. External DNS services as backup wouldn't help, since all the PaaS components used rely on Azure DNS. Bypassing DNS by switching to IP isn't an option. Azure services don’t have static, guaranteed IP addresses.	Full
Front Door	General Front Door outage.	If Front Door goes down entirely, there's no mitigation. This service is a hard dependency.	Yes
Front Door	Routing/frontend/backend configuration errors.	Can happen due to mismatch in configuration when deploying. Should be caught in testing stages. Frontend configuration with DNS is specific to each environment. Mitigation: Rolling back to previous configuration should fix most issues.. As changes take a couple of minutes in Front Door to deploy, it will cause an outage.	Full
Front Door	Managed TLS/SSL certificate is deleted.	Can happen due to mismatch in configuration when deploying. Should be caught in testing stages. Technically the site would still work, but TLS/SSL cert errors will prevent users from accessing it. Mitigation: Re-issuing the cert can take around 20 minutes, plus fixing and re-running the pipeline..	Full
Azure Cosmos DB	Database/collection is renamed.	Can happen due to mismatch in configuration when deploying – Terraform would overwrite the whole database, which could result in data loss. Data loss can be prevented by using database/collection level locks. Application won't be able to access any data. App configuration needs to be updated and pods restarted.	Yes
Azure Cosmos DB	Regional outage	Azure Mission-Critical has multi-region writes enabled. If there's a failure on a read or write, the client retries the current operation. All future operations are permanently routed to the next region in order of preference. In case the preference list had one entry (or was empty) but the account has other regions available, it will route to the next region in the account list.	No
Azure Cosmos DB	Extensive throttling due to lack of RUs.	Depending on the number of RUs and the load-balancing employed at the Front Door level, certain stamps could run hot on Azure Cosmos DB utilization while other stamps can serve more requests. Mitigation: Better load distribution or more RUs.	No
Azure Cosmos DB	Partition full	Azure Cosmos DB logical partition size limit is 20 GB. If data for a partition key within a container reaches this size, additional write requests will fail with the error "Partition key reached maximum size".	Partial (DB writes disabled)
Azure Container Registry	Regional outage	Container registry uses Traffic Manager to fail over between replica regions. Any request should be automatically rerouted to another region. At worst, Docker images can't be pulled for a few minutes by AKS nodes while DNS failover happens.	No
Azure Container Registry	Image(s) deleted	No images can be pulled. This outage should only affect newly spawned/rebooted nodes. Existing nodes should have the images cached. **Mitigation: If detected quickly rerunning the latest build pipelines should bring the images back into the registry.	No
Azure Container Registry	Throttling	Throttling can delay scale-out operations which can result in a temporarily degraded performance. Mitigation: Azure Mission-Critical uses the Premium SKU which provides 10k read operations per minute. Container images are optimized and have only small numbers of layers. ImagePullPolicy is set to IfNotPresent to use locally cached versions first. Comment: Pulling a container image consists of multiple read operations, depending on the number of layers. The number of read operations per minute is limited and depends on the ACR SKU size.	No
Azure Kubernetes Service	Cluster upgrade fails	AKS Node upgrades should occur at different times across the stamps. if one upgrade fails, the other cluster shouldn't be affected. Cluster upgrades should deploy in a rolling fashion across the nodes to prevent all nodes from becoming unavailable.	No
Azure Kubernetes Service	Application pod is killed when serving request.	This could result in end user facing errors and poor user experience. Mitigation: Kubernetes by default removes pods in a graceful way. Pods are removed from services first and the workload receives a SIGTERM with a grace period to finish open requests and write data before terminating. The application code needs to understand SIGTERM and the grace period might need to be adjusted if the workload takes longer to shutdown.	No
Azure Kubernetes Service	Compute capacity unavailable in region to add more nodes.	Scale up/out operations will fail, but it shouldn’t affect existing nodes and their operation. Ideally traffic should shift automatically to other regions for load balancing.	No
Azure Kubernetes Service	Subscription reaches CPU core quota to add new nodes.	Scale up/out operations will fail, but it shouldn’t affect existing nodes and their operation. Ideally traffic should shift automatically to other regions for load balancing.	No
Azure Kubernetes Service	Let’s Encrypt TLS/SSL certificates can’t be issued/renewed.	Cluster should report unhealthy towards Front Door and traffic should shift to other stamps. Mitigation: Investigate root cause of issue/renew failure.	No
Azure Kubernetes Service	When resource requests/limits are configured incorrectly, pods can reach 100% CPU utilization and fail requests. Application retry mechanism should be able to recover failed requests. Retries could cause a longer request duration, without surfacing the error to the client. Excessive load will eventually cause failure.	No (if load not excessive)
Azure Kubernetes Service	3rd-party container images / registry unavailable	Some components like cert-manager and ingress-nginx require downloading container images and helm charts from external container registries (outbound traffic). In case one or more of these repositories or images are unavailable, new instances on new nodes (where the image is not already cached) might not be able to start. Possible mitigation: In some scenarios is could make sense to import 3rd-party container images into the per-solution container registry. This adds additional complexity and should be planned and operationalized carefully.	Partially (during scale and update/upgrade operations)
Event Hub	Messages can't be sent to the Event Hubs	Stamp becomes unusable for write operations. Health service should automatically detect this and take the stamp out of rotation.	No
Event Hub	Messages can't be read by the BackgroundProcessor	Messages will queue up. Messages shouldn't get lost since they're persisted. Currently, this failure isn't covered by the Health Service. There should be monitoring/alerting in place on the worker to detect errors in reading messages. Mitigation: The stamp should be manually disabled until the problem is fixed.	No
Storage account	Storage account becomes unusable by the worker for Event Hubs check pointing	Stamp won't process messages from the Event Hubs. The storage account is also used by the HealthService. It's expected issues with storage should be detected by the HealthService and the stamp should be taken out of rotation. It can be expected that other services in the stamp will be affected concurrently.	No
Storage account	Static website encounters issues.	If serving of the static web site encounters issues, this failure should be detected by Front Door. Traffic won't be sent to this storage account. Caching at Front Door can also alleviate this issue.	No
Key Vault	Key Vault unavailable for `GetSecret` operations.	At the start of new pods, the AKS CSI driver will fetch all secrets from Key Vault and fail. Pods will be unable to start. There's an automatic update currently every 5 minutes. The update will fail. Errors should show up in `kubectl describe pod` but the pod keeps working.	No
Key Vault	Key Vault unavailable for `GetSecret` or `SetSecret` operations.	New deployments can't be executed. Currently, this failure might cause the entire deployment pipeline to stop, even if only one region is affected.	No
Key Vault	Key Vault throttling	Key Vault has a limit of 1000 operations per 10 seconds. Because of the automatic update of secrets, you could in theory hit this limit if you had many (thousands) of pods in a stamp. Possible mitigation: Decrease update frequency even further or turn it off completely.	No
Application	Misconfiguration	Incorrect connection strings or secrets injected to the app. Should be mitigated by automated deployment (pipeline handles configuration automatically) and blue-green rollout of updates.	No
Application	Expired credentials (stamp resource)	If for example, the event hub SAS token or storage account key was changed without properly updating them in Key Vault so that the pods can use them, the respective application component will fail. This failure should then affect the Health Service, and the stamp should be taken out of rotation automatically. Mitigation: Use Microsoft Entra ID-based authentication for all services which support it. AKS requires pods to authenticate using Microsoft Entra Workload ID (preview). Use of Pod Identity was considered in the reference implementation. It was found that Pod Identity wasn't stable enough currently and was decided against use for the current reference architecture. Pod identity could be a solution in the future.	No
Application	Expired credentials (globally shared resource)	If for example, the Azure Cosmos DB API key was changed without properly updating it in all stamp Key Vaults so that the pods can use them, the respective application components will fail. This failure would bring all stamps down at same time and cause a workload-wide outage. For a possible way around the need for keys and secrets using Microsoft Entra auth, see the previous item.	Full
Virtual network	Subnet IP address space exhausted	If the IP address space on a subnet is exhausted, no scale out operations, such as creating new AKS nodes or pods, can happen. It will not lead to an outage but might decrease performance and impact user experience. Mitigation: increase the IP space (if possible). If that is not an option, it might help to increase the resources per Node (larger VM SKUs) or per pod (more CPU/memory), so that each pod can handle more traffic, thus decreasing the need for scale out.	No

Next steps

Deploy the reference implementation to get a full understanding of the resources and their configuration used in this architecture.

Implementation: Mission-Critical Online