您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

自我修复型设计Design for self healing

设计应用程序以在故障发生时进行自我修复Design your application to be self healing when failures occur

分布式系统中会发生故障。In a distributed system, failures happen. 硬件会故障。Hardware can fail. 网络也有可能发生暂时性故障。The network can have transient failures. 极少数情况下,整个服务或区域可能会遇到中断,但这些故障必须在计划之内。Rarely, an entire service or region may experience a disruption, but even those must be planned for.

因此,设计的应用程序在故障发生时可进行自我修复。Therefore, design an application to be self healing when failures occur. 这需要从三个方面入手:This requires a three-pronged approach:

  • 检测故障。Detect failures.
  • 从容应对故障。Respond to failures gracefully.
  • 记录和监视故障,获取操作见解。Log and monitor failures, to give operational insight.

如何应对特定类型的故障可能取决于应用程序的可用性需求。How you respond to a particular type of failure may depend on your application's availability requirements. 例如,如果需要非常高的可用性,则可能在区域中断期间自动故障转移到次要区域。For example, if you require very high availability, you might automatically fail over to a secondary region during a regional outage. 然而,这将使成本高于单区域部署。However, that will incur a higher cost than a single-region deployment.

此外,不要只考虑像区域中断这类大事件,因为这种情况通常鲜有发生。Also, don't just consider big events like regional outages, which are generally rare. 应该尽可能将注意力集中在处理本地短期的故障上,例如网络连接故障或数据库连接失败等。You should focus as much, if not more, on handling local, short-lived failures, such as network connectivity failures or failed database connections.

建议Recommendations

重试失败的操作Retry failed operations. 发生暂时性故障的原因可能有:短暂的网络连接中断、删除了数据库连接或服务因繁忙而超时。Transient failures may occur due to momentary loss of network connectivity, a dropped database connection, or a timeout when a service is busy. 在应用程序中构建重试逻辑来处理暂时性故障。Build retry logic into your application to handle transient failures. 对于许多 Azure 服务,客户端 SDK 可实施自动重试。For many Azure services, the client SDK implements automatic retries. 有关详细信息,请参阅暂时性故障处理重试模式For more information, see Transient fault handling and the Retry pattern.

保护故障远程服务(断路器)Protect failing remote services (Circuit Breaker). 在暂时性故障后最好进行重试,但如果故障仍然存在,最终可能会有非常多的调用方攻击故障服务。It's good to retry after a transient failure, but if the failure persists, you can end up with too many callers hammering a failing service. 因为请求进行了备份,这可能导致级联故障。This can lead to cascading failures, as requests back up. 当操作可能失败时,使用断路器模式进行快速失败(不进行远程调用)。Use the Circuit Breaker pattern to fail fast (without making the remote call) when an operation is likely to fail.

隔离关键资源(隔层)Isolate critical resources (Bulkhead). 子系统中的故障有时会发生级联。Failures in one subsystem can sometimes cascade. 如果某个故障导致某些资源(例如线程或套接字)无法及时释放,导致资源耗尽,则可能就会发生这种连锁反应。This can happen if a failure causes some resources, such as threads or sockets, not to get freed in a timely manner, leading to resource exhaustion. 为了避免此问题,请将系统分区为独立的组,使一个分区中的故障不会导致整个系统瘫痪。To avoid this, partition a system into isolated groups, so that a failure in one partition does not bring down the entire system.

执行负载分级Perform load leveling. 应用程序可能会遇到突发流量高峰,导致后端上的服务瘫痪。Applications may experience sudden spikes in traffic that can overwhelm services on the backend. 为了避免此问题,请使用基于队列的负载调节模式使工作项排队进行异步运行。To avoid this, use the Queue-Based Load Leveling pattern to queue work items to run asynchronously. 队列充当可平缓负载高峰的缓冲区。The queue acts as a buffer that smooths out peaks in the load.

故障转移Fail over. 如果无法访问某个实例,请故障转移到另一个实例。If an instance can't be reached, fail over to another instance. 对于 Web 服务器之类的无状态对象,请在负载均衡器或流量管理器后放置一些实例。For things that are stateless, like a web server, put several instances behind a load balancer or traffic manager. 对于数据库之类的存储状态的对象,请使用副本和故障转移。For things that store state, like a database, use replicas and fail over. 根据数据存储和复制方式,可能需要应用程序处理最终的一致性。Depending on the data store and how it replicates, this may require the application to deal with eventual consistency.

补偿失败的事务Compensate failed transactions. 一般情况下,需避免分布式事务,因为它们需要协调服务和资源。In general, avoid distributed transactions, as they require coordination across services and resources. 相反,应该用较小的单个事务组成操作。Instead, compose an operation from smaller individual transactions. 如果在中途操作失败,请使用补偿事务撤销已完成的所有步骤。If the operation fails midway through, use Compensating Transactions to undo any step that already completed.

检查点长时间运行的事务Checkpoint long-running transactions. 如果长时间运行的操作失败,检查点可以提供复原能力。Checkpoints can provide resiliency if a long-running operation fails. 当操作重新启动时(例如,它被另一个 VM 选中),它可以从上一个检查点恢复。When the operation restarts (for example, it is picked up by another VM), it can be resumed from the last checkpoint.

正常降级Degrade gracefully. 有时某个问题无法解决,但可以提供仍然有用的缩减版功能。Sometimes you can't work around a problem, but you can provide reduced functionality that is still useful. 假设某个应用程序显示图书目录。Consider an application that shows a catalog of books. 如果该应用程序无法检索封面的缩略图图像,它可能显示占位符图像。If the application can't retrieve the thumbnail image for the cover, it might show a placeholder image. 整个子系统可能对应用程序不重要。Entire subsystems might be noncritical for the application. 例如,在电子商务网站,显示产品建议可能没有处理订单重要。For example, in an e-commerce site, showing product recommendations is probably less critical than processing orders.

限制客户端Throttle clients. 有时,少量的用户会产生过多的负载,降低了应用程序对其他用户的可用性。Sometimes a small number of users create excessive load, which can reduce your application's availability for other users. 在这种情况下,可以在一段时间内限制客户端。In this situation, throttle the client for a certain period of time. 请参阅限制模式See the Throttling pattern.

阻止错误执行组件Block bad actors. 仅仅限制客户端并不意味着客户端的行为是恶意的。Just because you throttle a client, it doesn't mean client was acting maliciously. 它只意味着客户端超出其服务配额。It just means the client exceeded their service quota. 但如果客户端持续超出其配额或在其他方面具有不良行为,则可能需要进行阻止。But if a client consistently exceeds their quota or otherwise behaves badly, you might block them. 定义一个带外进程,供用户请求解除阻止。Define an out-of-band process for user to request getting unblocked.

使用领导选拔Use leader election. 当需要协调任务时,请使用领导选拔选择协调器。When you need to coordinate a task, use Leader Election to select a coordinator. 这样,协调器不是单一故障点。That way, the coordinator is not a single point of failure. 如果协调器失败,则选择一个新的协调器。If the coordinator fails, a new one is selected. 与其从头开始实施领导选举算法,不如考虑现成的解决方案,比如 Zookeeper。Rather than implement a leader election algorithm from scratch, consider an off-the-shelf solution such as Zookeeper.

使用故障注入进行测试Test with fault injection. 通常,成功的路径会得到精心的测试,而失败的路径却不会。All too often, the success path is well tested but not the failure path. 系统在生产中长时间运行后,才会执行失败路径。A system could run in production for a long time before a failure path is exercised. 通过触发实际故障或模拟故障,使用故障注入来测试系统对故障的复原能力。Use fault injection to test the resiliency of the system to failures, either by triggering actual failures or by simulating them.

采用混沌工程Embrace chaos engineering. 混沌工程通过将故障或异常情况随机注入到生产实例中,扩展了故障注入的概念。Chaos engineering extends the notion of fault injection, by randomly injecting failures or abnormal conditions into production instances.

为使应用程序的结构化方法自我修复,请参阅设计可靠的应用程序适用于 AzureFor a structured approach to making your applications self healing, see Design reliable applications for Azure.