Error handling for resilient applications in Azure
Ensuring your application can recover from errors is critical when working in a distributed system. You test your applications to prevent errors and failure, but you need to be prepared for when applications encounter issues or fail. Understanding how to handle errors and prevent potential failure becomes important, as testing doesn't always catch everything.
Many things in a distributed system are outside your span of control and your means to test. This can be the underlying cloud infrastructure, third party runtime dependencies, etc. You can be sure something will fail eventually, so you need to prepare for that.
Key points
- Uncover issues or failures in your application's retry logic.
- Configure request timeouts to manage inter-component calls.
- Implement retry logic to handle transient application failures and transient failures with internal or external dependencies.
- Configure and test health probes for your load balancers and traffic managers.
- Segregate read operations from update operations across application data stores.
Transient fault handling
Track the number of transient exceptions and retries over time to uncover issues or failures in your application's retry logic. A trend of increasing exceptions over time may indicate that the service is having an issue and may fail. To learn more, reference Retry service specific guidance.
Use the Retry pattern, paying particular attention to issues and considerations. Avoid overwhelming dependent services by implementing the Circuit Breaker pattern. Review and incorporate additional best practices guidance for Transient fault handling. While calling systems that have Throttling pattern implemented, ensure that your retries are not counter productive.
A reference implementation is available here.It uses Polly and IHttpClientBuilder to implement the Circuit Breaker pattern.
Request timeouts
When making a service call or a database call, ensure that appropriate request timeouts are set. Database Connection timeouts are typically set to 30 seconds. For guidance on how to troubleshoot, diagnose, and prevent SQL connection errors, see transient errors for SQL Database.
Leverage design patterns that encapsulate robust timeout strategies like Choreography pattern or Compensating Transaction pattern.
A reference implementation is available on GitHub.
Cascading Failures
The Circuit Breaker pattern provides stability while the system recovers from a failure and minimizes the impact on performance. It can help to maintain the response time of the system by quickly rejecting a request for an operation that's likely to fail, rather than waiting for the operation to time out, or never return.
A circuit breaker might be able to test the health of a service by sending a request to an endpoint exposed by the service. The service should return information indicating its status.
Retry pattern. Describes how an application can handle anticipated temporary failures when it tries to connect to a service or network resource by transparently retrying an operation that has previously failed.
Samples related to this pattern are here.
Application Health Probes
Configure and test health probes for your load balancers and traffic managers. Ensure that your health endpoint checks the critical parts of the system and responds appropriately.
- For Azure Front Door and Azure Traffic Manager, the health probe determines whether to fail over to another region. Your health endpoint should check any critical dependencies that are deployed within the same region.
- For Azure Load Balancer, the health probe determines whether to remove a VM from rotation. The health endpoint should report the health of the VM. Don't include other tiers or external services. Otherwise, a failure that occurs outside the VM will cause the load balancer to remove the VM from rotation.
Samples related to heath probes are here.
ARM template that deploys an Azure Load Balancer and health probes that detect the health of the sample service endpoint.
An ASP.NET Core Web API that shows configuration of health checks at startup.
Command and Query Responsibility Segregation (CQRS)
Achieve levels of scale and performance needed for your solution by segregating read and write interfaces by implementing the CQRS pattern.
Next step
Related links
- For information on transient faults, see Troubleshoot transient connection errors.
- For guidance on implementing health monitoring in your application, see Health Endpoint Monitoring pattern.
Go back to the main article: Testing