Events
May 19, 6 PM - May 23, 12 AM
Calling all developers, creators, and AI innovators to join us in Seattle @Microsoft Build May 19-22.
Register todayThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Failure mode analysis (FMA) is a process for building resiliency into a system, by identifying possible failure points in the system. The FMA should be part of the architecture and design phases, so that you can build failure recovery into the system from the beginning.
Here is the general process to conduct an FMA:
Identify all of the components in the system. Include external dependencies, such as identity providers, third-party services, and so on.
For each component, identify potential failures that could occur. A single component may have more than one failure mode. For example, you should consider read failures and write failures separately, because the impact and possible mitigation steps will be different.
Rate each failure mode according to its overall risk. Consider these factors:
For each failure mode, determine how the application will respond and recover. Consider tradeoffs in cost and application complexity.
As a starting point for your FMA process, this article contains a catalog of potential failure modes and their mitigation steps. The catalog is organized by technology or Azure service, plus a general category for application-level design. The catalog isn't exhaustive, but covers many of the core Azure services.
Note
Failures should be distinguished from errors. A failure is an unexpected event within a system that prevents it from continuing to function normally. For example, a hardware malfunction that causes a network partition is a failure. Usually, failures require intervention or specific design for that class of failures. In contrast, errors are an expected part of normal operations, are dealt with immediately and the system will continue to operate at the same capacity following an error. For example, errors discovered during input validation can be handled through business logic.
Detection. Possible causes:
Expected shutdown
Always On
setting is disabled.)Unexpected shutdown
Application_End logging will catch the app domain shutdown (soft process crash) and is the only way to catch the application domain shutdowns.
Recovery:
Application_End
method.Always On
setting in the web app. See Configure web apps in Azure App Service.ReadOnly
level. See Lock resources with Azure Resource Manager.Diagnostics. Application logs and web server logs. See Enable diagnostics logging for web apps in Azure App Service.
Detection. Authenticate users and include user ID in application logs.
Recovery:
Diagnostics. Log all authentication requests.
Detection. Monitor the application health through the Azure portal (see Monitor Azure web app performance) or implement the health endpoint monitoring pattern.
Recovery:. Use multiple deployment slots and roll back to the last-known-good deployment. For more information, see Basic web application.
Detection. Possible failure modes include:
Recovery:
AuthenticationFailed
events.Detection. Catch Microsoft.Rest.Azure.CloudException
errors.
Recovery:
The Search .NET SDK automatically retries after transient failures. Any exceptions thrown by the client SDK should be treated as nontransient errors.
The default retry policy uses exponential back-off. To use a different retry policy, call SetRetryPolicy
on the SearchIndexClient
or SearchServiceClient
class. For more information, see Automatic Retries.
Diagnostics. Use Search Traffic Analytics.
Detection. Catch Microsoft.Rest.Azure.CloudException
errors.
Recovery:
The Search .NET SDK automatically retries after transient failures. Any exceptions thrown by the client SDK should be treated as nontransient errors.
The default retry policy uses exponential back-off. To use a different retry policy, call SetRetryPolicy
on the SearchIndexClient
or SearchServiceClient
class. For more information, see Automatic Retries.
Diagnostics. Use Search Traffic Analytics.
Detection. Catch the exception. For .NET clients, this will typically be System.Web.HttpException
. Other client may have other exception types. For more information, see Cassandra error handling done right.
Recovery:
Diagnostics. Application logs
Detection. The RoleEnvironment.Stopping event is fired.
Recovery. Override the RoleEntryPoint.OnStop method to gracefully clean up. For more information, see The Right Way to Handle Azure OnStop Events (blog).
Detection. Catch System.Net.Http.HttpRequestException
or Microsoft.Azure.Documents.DocumentClientException
.
Recovery:
ConnectionPolicy.RetryOptions
. Exceptions that the client raises are either beyond the retry policy or are not transient errors.DocumentClientException
. If you're getting error 429 consistently, consider increasing the throughput value of the collection.
PreferredLocations
parameter. This is an ordered list of Azure regions. All reads will be sent to the first available region in the list. If the request fails, the client will try the other regions in the list, in order. For more information, see How to set up global distribution in Azure Cosmos DB for NoSQL.Diagnostics. Log all errors on the client side.
Detection. Catch System.Net.Http.HttpRequestException
or Microsoft.Azure.Documents.DocumentClientException
.
Recovery:
ConnectionPolicy.RetryOptions
. Exceptions that the client raises are either beyond the retry policy or are not transient errors.DocumentClientException
. If you're getting error 429 consistently, consider increasing the throughput value of the collection.Diagnostics. Log all errors on the client side.
Detection. After N retry attempts, the write operation still fails.
Recovery:
Diagnostics. Use storage metrics.
Detection. Application specific. For example, the message contains invalid data, or the business logic fails for some reason.
Recovery:
Move the message to a separate queue. Run a separate process to examine the messages in that queue.
Consider using Azure Service Bus Messaging queues, which provides a dead-letter queue functionality for this purpose.
Note
If you're using Storage queues with WebJobs, the WebJobs SDK provides built-in poison message handling. See How to use Azure queue storage with the WebJobs SDK.
Diagnostics. Use application logging.
Detection. Catch StackExchange.Redis.RedisConnectionException
.
Recovery:
Diagnostics. Use Azure Cache for Redis diagnostics.
Detection. Catch StackExchange.Redis.RedisConnectionException
.
Recovery:
Diagnostics. Use Azure Cache for Redis diagnostics.
Detection. Connection fails.
Recovery:
Enable zone redundancy. By enabling zone redundancy, Azure SQL Database automatically replicates your writes across multiple Azure availability zones within supported regions. For more information, see Zone-redundant availability.
Enable geo-replication. If you're designing a multi-region solution, consider enabling SQL Database active geo-replication.
Prerequisite: The database must be configured for active geo-replication. See SQL Database Active Geo-Replication.
The replica uses a different connection string, so you'll need to update the connection string in your application.
Detection. Catch System.InvalidOperationException
errors.
Recovery:
Diagnostics. Application logs.
Detection. Azure SQL Database limits the number of concurrent workers, logins, and sessions. The limits depend on the service tier. For more information, see Azure SQL Database resource limits.
To detect these errors, catch System.Data.SqlClient.SqlException
and check the value of SqlException.Number
for the SQL error code. For a list of relevant error codes, see SQL error codes for SQL Database client applications: Database connection error and other issues.
Recovery. These errors are considered transient, so retrying may resolve the issue. If you consistently hit these errors, consider scaling the database.
Diagnostics. - The sys.event_log query returns successful database connections, connection failures, and deadlocks.
Detection. Catch exceptions from the client SDK. The base class for Service Bus exceptions is MessagingException. If the error is transient, the IsTransient
property is true.
For more information, see Service Bus messaging exceptions.
Recovery:
Detection. Catch exceptions from the client SDK. The base class for Service Bus exceptions is MessagingException. If the error is transient, the IsTransient
property is true.
For more information, see Service Bus messaging exceptions.
Recovery:
The Service Bus client automatically retries after transient errors. By default, it uses exponential back-off. After the maximum retry count or maximum timeout period, the client throws an exception.
If the queue quota is exceeded, the client throws QuotaExceededException. The exception message gives more details. Drain some messages from the queue before retrying, and consider using the Circuit Breaker pattern to avoid continued retries while the quota is exceeded. Also, make sure the BrokeredMessage.TimeToLive property isn't set too high.
Within a region, resiliency can be improved by using partitioned queues or topics. A non-partitioned queue or topic is assigned to one messaging store. If this messaging store is unavailable, all operations on that queue or topic will fail. A partitioned queue or topic is partitioned across multiple messaging stores.
Use zone redundancy to automatically replicate changes between multiple availability zones. If one availability zone fails, failover happens automatically. For more information, see Best practices for insulating applications against Service Bus outages and disasters.
If you're designing a multi-region solution, create two Service Bus namespaces in different regions, and replicate the messages. You can use either active replication or passive replication.
For more information, see GeoReplication sample and Best practices for insulating applications against Service Bus outages and disasters.
Detection. Examine the MessageId
and DeliveryCount
properties of the message.
Recovery:
If possible, design your message processing operations to be idempotent. Otherwise, store message IDs of messages that are already processed, and check the ID before processing a message.
Enable duplicate detection, by creating the queue with RequiresDuplicateDetection
set to true. With this setting, Service Bus automatically deletes any message that is sent with the same MessageId
as a previous message. Note the following:
Diagnostics. Log duplicated messages.
Detection. Application specific. For example, the message contains invalid data, or the business logic fails for some reason.
Recovery:
There are two failure modes to consider.
PeekLock
mode. In this mode, if the lock expires, the message becomes available to other receivers. If the message exceeds the maximum delivery count or the time-to-live, the message is automatically moved to the dead-letter queue.For more information, see Overview of Service Bus dead-letter queues.
Diagnostics. Whenever the application moves a message to the dead-letter queue, write an event to the application logs.
Detection. The client receives errors when writing.
Recovery:
Retry the operation, to recover from transient failures. The retry policy in the client SDK handles this automatically.
Implement the Circuit Breaker pattern to avoid overwhelming storage.
If N retry attempts fail, perform a graceful fallback. For example:
Diagnostics. Use storage metrics.
Detection. The client receives errors when reading.
Recovery:
Diagnostics. Use storage metrics.
Detection. Network connection errors.
Recovery:
Diagnostics. Log events at service boundaries.
Detection. Configure a Load Balancer health probe that signals whether the VM instance is healthy. The probe should check whether critical functions are responding correctly.
Recovery. For each application tier, put multiple VM instances into the same availability set, and place a load balancer in front of the VMs. If the health probe fails, the Load Balancer stops sending new connections to the unhealthy instance.
Diagnostics. - Use Load Balancer log analytics.
Detection. N/A
Recovery. Set a resource lock with ReadOnly
level. See Lock resources with Azure Resource Manager.
Diagnostics. Use Azure Activity Logs.
Detection. Pass a cancellation token to the WebJob function. For more information, see Graceful shutdown.
Recovery. Enable the Always On
setting in the web app. For more information, see Run Background tasks with WebJobs.
Detection. Depends on the application. Typical symptoms:
Recovery:
Scale out to handle increased load.
Mitigate failures to avoid having cascading failures disrupt the entire application. Mitigation strategies include:
Diagnostics. Use App Service diagnostic logging. Use a service such as Azure Log Analytics, Application Insights, or New Relic to help understand the diagnostic logs.
A sample is available here. It uses Polly for these exceptions:
Detection. After N retry attempts, it still fails.
Recovery:
Diagnostics. Log all operations (successful and failed), including compensating actions. Use correlation IDs, so that you can track all operations within the same transaction.
Detection. HTTP error code.
Recovery:
Diagnostics. Log all remote call failures.
See Resiliency and dependencies in the Azure Well-Architected Framework. Building failure recovery into the system should be part of the architecture and design phases from the beginning to avoid the risk of failure.
Events
May 19, 6 PM - May 23, 12 AM
Calling all developers, creators, and AI innovators to join us in Seattle @Microsoft Build May 19-22.
Register today