您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

设计适用于 Azure 的可复原应用程序Designing resilient applications for Azure

在分布式系统中,故障时有发生。In a distributed system, failures will happen. 硬件可能发生故障。Hardware can fail. 网络也有可能发生暂时性故障。The network can have transient failures. 极少数情况下,整个服务或区域可能会遇到中断,但这些故障必须在计划之内。Rarely, an entire service or region may experience a disruption, but even those must be planned for.

在云中生成可靠应用程序不同于在企业设置中生成可靠应用程序。Building a reliable application in the cloud is different than building a reliable application in an enterprise setting. 尽管在传统上我们可以采购可纵向扩展的高端硬件,但在云环境中,我们必须使用可横向扩展而不是纵向扩展的硬件。While historically you may have purchased higher-end hardware to scale up, in a cloud environment you must scale out instead of scaling up. 可以使用市售硬件来保持云环境的较低成本。Costs for cloud environments are kept low through the use of commodity hardware. 在此新环境中,我们的注意力不应该是防范故障和优化“平均故障时间”,而应该将注意力转移到“平均还原时间”。Instead of focusing on preventing failures and optimizing "mean time between failures," in this new environment the focus shifts to "mean time to restore." 目标是最大程度地降低故障造成的影响。The goal is to minimize the effect of a failure.

本文概述如何在 Microsoft Azure 中生成可复原的应用程序。This article provides an overview of how to build resilient applications in Microsoft Azure. 首先解释术语“复原”的定义和相关概念。It starts with a definition of the term resiliency and related concepts. 然后,介绍在应用程序从设计和实施到部署和运营的整个生存期内,使用结构化的方法实现复原的过程。Then it describes a process for achieving resiliency, using a structured approach over the lifetime of an application, from design and implementation to deployment and operations.

什么是复原?What is resiliency?

复原是指系统能够在发生故障后进行恢复,然后继续正常运行。Resiliency is the ability of a system to recover from failures and continue to function. 复原并不旨在避免故障发生,而是通过响应故障来避免故障时间或数据丢失。It's not about avoiding failures, but responding to failures in a way that avoids downtime or data loss. 复原的目标是在故障发生后将应用程序恢复到可完全正常运行的状态。The goal of resiliency is to return the application to a fully functioning state following a failure.

复原的两个重要方面是高可用性和灾难恢复。Two important aspects of resiliency are high availability and disaster recovery.

  • 高可用性 (HA) 是指应用程序能够在正常状态下继续运行,而没有显著增加的故障时间。High availability (HA) is the ability of the application to continue running in a healthy state, without significant downtime. 所谓的“正常状态”是指,应用程序有响应,用户可以连接到应用程序,并与之交互。By "healthy state," we mean the application is responsive, and users can connect to the application and interact with it.
  • 灾难恢复 (DR) 是指能够从罕见但非常重大的事件(非暂时性的大规模故障,如影响整个区域的服务中断)中恢复。Disaster recovery (DR) is the ability to recover from rare but major incidents: non-transient, wide-scale failures, such as service disruption that affects an entire region. 灾难恢复包括数据备份和存档,并且可能包括手动干预,如通过备份还原数据库。Disaster recovery includes data backup and archiving, and may include manual intervention, such as restoring a database from backup.

选择 HA 还是 DR 的一种思路是,如果故障造成的影响超过了 HA 的应对能力,则应该首选 DR。One way to think about HA versus DR is that DR starts when the impact of a fault exceeds the ability of the HA design to handle it.

设计复原能力时,必须了解可用性要求。When you design resiliency, you must understand your availability requirements. 可以接受多长的故障时间?How much downtime is acceptable? 这在一定程度上取决于成本。This is partly a function of cost. 潜在的停机会给业务造成多大的损失?How much will potential downtime cost your business? 使应用程序保持高可用性需要投入多少资金?How much should you invest in making the application highly available? 此外,必须定义应用程序的可用性具体指的是什么。You also have to define what it means for the application to be available. 例如,应用程序“故障”是否指的是客户可以提交订单,但系统无法在正常时限内处理该订单?For example, is the application "down" if a customer can submit an order but the system cannot process it within the normal timeframe? 此外,还要考虑发生特定类型的服务中断的概率,以及缓解策略是否经济高效。Also consider the probability of a particular type of outage occurring, and whether a mitigation strategy is cost-effective.

另一个常见术语是业务连续性 (BC),指的是在发生不利条件(例如自然灾难或服务中断)期间和之后,执行关键业务功能的能力。Another common term is business continuity (BC), which is the ability to perform essential business functions during and after adverse conditions, such as a natural disaster or a downed service. BC 涵盖整个业务运营,包括物理设施、人员、通信、运输和 IT。BC covers the entire operation of the business, including physical facilities, people, communications, transportation, and IT. 本文的重点是云应用程序,但必须在总体业务连续性要求的上下文中进行复原规划。This article focuses on cloud applications, but resilience planning must be done in the context of overall BC requirements.

数据备份是灾难恢复的关键组成部分。Data backup is a critical part of DR. 如果应用程序的无状态组件发生故障,始终可以重新部署它们。If the stateless components of an application fail, you can always redeploy them. 但是,如果数据丢失,则系统无法恢复稳定状态。But if data is lost, the system can't return to a stable state. 最好在不同的区域将数据备份,以防发生区域范围的灾难。Data must be backed up, ideally in a different region in case of a region-wide disaster.

备份不同于数据复制Backup is distinct from data replication. 数据复制涉及近乎实时地复制数据,以便系统可以快速故障转移到副本。Data replication involves copying data in near-real-time, so that the system can fail over quickly to a replica. 很多数据库系统支持复制;例如,SQL Server 支持 SQL Server Always On 可用性组。Many databases systems support replication; for example, SQL Server supports SQL Server Always On Availability Groups. 数据复制可以减少故障后进行恢复所要花费的时间,它确保始终有一个待机的数据副本。Data replication can reduce how long it takes to recover from an outage, by ensuring that a replica of the data is always standing by. 但是,数据复制无法防范人为错误。However, data replication won't protect against human error. 如果数据由于人为错误而破坏,则损坏的数据只复制到副本。If data gets corrupted because of human error, the corrupted data just gets copied to the replicas. 因此,仍需在灾难恢复策略中包含长期备份。Therefore, you still need to include long-term backup in your DR strategy.

实现复原的过程Process to achieve resiliency

复原能力不是一个外加的功能。Resiliency is not an add-on. 必须将其融入设计,并在运营中付诸实践。It must be designed into the system and put into operational practice. 下面是要遵循的常规模型:Here is a general model to follow:

  1. 根据业务需求定义可用性要求。Define your availability requirements, based on business needs.
  2. 设计应用程序的复原能力。Design the application for resiliency. 从遵循经过证实的做法的体系结构着手,识别该体系结构中可能存在的故障点。Start with an architecture that follows proven practices, and then identify the possible failure points in that architecture.
  3. 实施策略来检测故障并从中恢复。Implement strategies to detect and recover from failures.
  4. 通过模拟故障和触发强制故障转移来测试实施项目。Test the implementation by simulating faults and triggering forced failovers.
  5. 使用可靠、可重复的过程将应用程序部署到生产环境中。Deploy the application into production using a reliable, repeatable process.
  6. 监视应用程序以检测故障。Monitor the application to detect failures. 通过监视系统,可以衡量应用程序的运行状况,并在必要的情况下对事件做出响应。By monitoring the system, you can gauge the health of the application and respond to incidents if necessary.
  7. 发生需要人工干预的事件时做出响应Respond if there are incidents that require manual interventions.

本文的余下部分将更详细地介绍上述每个步骤。In the remainder of this article, we discuss each of these steps in more detail.

定义复原要求Defining your resiliency requirements

复原规划从业务要求开始。Resiliency planning starts with business requirements. 可参考以下几点思路来考虑这项规划。Here are some approaches for thinking about resiliency in those terms.

按工作负荷分解Decompose by workload

许多云解决方案包括多个应用程序工作负荷。Many cloud solutions consist of multiple application workloads. 在此语境中,术语“工作负荷”是指某个离散的功能或计算任务,该任务根据业务逻辑和数据存储要求可与其他任务逻辑分离。The term "workload" in this context means a discrete capability or computing task, which can be logically separated from other tasks, in terms of business logic and data storage requirements. 例如,电子商务应用可能包含以下工作负荷:For example, an e-commerce app might include the following workloads:

  • 浏览和搜索产品目录。Browse and search a product catalog.
  • 创建和跟踪订单。Create and track orders.
  • 查看推荐商品。View recommendations.

这些工作负荷在可用性、可伸缩性、数据一致性和灾难恢复等方面具有不同的要求。These workloads might have different requirements for availability, scalability, data consistency, disaster recovery, and so forth. 同样,业务决策中都要考虑这些要求。Again, these are business decisions.

另外需要考虑使用模式。Also consider usage patterns. 系统是否必须在某些关键时段保持可用?Are there certain critical periods when the system must be available? 例如,税务申报服务不能在申报截止日期之前出现故障,视频流服务在重大赛事期间必须保持正常运行,等等。For example, a tax-filing service can't go down right before the filing deadline, a video streaming service must stay up during a big sports event, and so on. 在关键时段,可以在不同的区域采用冗余部署,以便在某个区域发生故障时,应用程序可以故障转移。During the critical periods, you might have redundant deployments across several regions, so the application could fail over if one region failed. 但是,多区域部署的成本更高,因此,在非关键时段,可以在单个区域运行应用程序。However, a multi-region deployment is more expensive, so during less critical times, you might run the application in a single region.


要考虑的两个重要指标是恢复时间目标和恢复点目标。Two important metrics to consider are the recovery time objective and recovery point objective.

  • 恢复时间目标 (RTO) 是指发生某个事件后,可接受应用程序不可用的最长时间。Recovery time objective (RTO) is the maximum acceptable time that an application can be unavailable after an incident. 如果 RTO 是 90 分钟,则从发生灾难开始,必须能够在 90 分钟内将应用程序还原到正常运行状态。If your RTO is 90 minutes, you must be able to restore the application to a running state within 90 minutes from the start of a disaster. 如果 RTO 极低,可以持续保持运转一个后备部署,以防范区域性服务中断。If you have a very low RTO, you might keep a second deployment continually running on standby, to protect against a regional outage.

  • 恢复点目标 (RPO) 是指发生灾难期间,可接受数据丢失的最大持续时间。Recovery point objective (RPO) is the maximum duration of data loss that is acceptable during a disaster. 例如,如果在单个数据库中存储数据并且未将数据复制到其他数据库,而是执行每小时备份,则最长可能会丢失一小时的数据。For example, if you store data in a single database, with no replication to other databases, and perform hourly backups, you could lose up to an hour of data.

RTO 和 RPO 属于业务要求。RTO and RPO are business requirements. 开展风险评估有助于定义应用程序的 RTO 和 RPO。Conducting a risk assessment can help you define the application's RTO and RPO. 另一个常见的指标是平均恢复时间 (MTTR),指的是发生故障后,还原应用程序所花费的平均时间。Another common metric is mean time to recover (MTTR), which is the average time that it takes to restore the application after a failure. MTTR 是反映系统状态的经验事实。MTTR is an empirical fact about a system. 如果 MTTR 超过 RTO,则系统发生故障会导致不可接受的业务中断,因为无法在定义的 RTO 内将系统还原。If MTTR exceeds the RTO, then a failure in the system will cause an unacceptable business disruption, because it won't be possible to restore the system within the defined RTO.


在 Azure 中,服务级别协议 (SLA) 描述 Microsoft 关于运行时间和连接方面的承诺。In Azure, the Service Level Agreement (SLA) describes Microsoft’s commitments for uptime and connectivity. 如果针对特定服务的 SLA 为 99.9%,则意味着该服务应该在 99.9% 的时间内可用。If the SLA for a particular service is 99.9%, it means you should expect the service to be available 99.9% of the time.


Azure SLA 还包括有关在无法满足 SLA 的情况下获取服务积点的条款,以及每个服务的“可用性”具体定义。The Azure SLA also includes provisions for obtaining a service credit if the SLA is not met, along with specific definitions of "availability" for each service. SLA 的此项规定充当强制策略。That aspect of the SLA acts as an enforcement policy.

应该针对解决方案中的每个工作负荷定义自己的目标 SLA。You should define your own target SLAs for each workload in your solution. 通过 SLA 可以评估体系结构是否满足业务要求。An SLA makes it possible to evaluate whether the architecture meets the business requirements. 例如,如果工作负荷的运行时间需要达到 99.99%,但它依赖于 SLA 为 99.9 % 的服务,则该服务不能是系统中的单一故障点。For example, if a workload requires 99.99% uptime, but depends on a service with a 99.9% SLA, that service cannot be a single-point of failure in the system. 一种补救措施是建立回退路径以防该服务发生故障,或者采取其他措施,以便在该服务发生故障时进行恢复。One remedy is to have a fallback path in case the service fails, or take other measures to recover from a failure in that service.

下表显示了各个 SLA 级别的潜在累积停机时间。The following table shows the potential cumulative downtime for various SLA levels.

SLASLA 每周故障时间Downtime per week 每月故障时间Downtime per month 每年故障时间Downtime per year
99%99% 1.68 小时1.68 hours 7.2 小时7.2 hours 3.65 天3.65 days
99.9%99.9% 10.1 分钟10.1 minutes 43.2 分钟43.2 minutes 8.76 小时8.76 hours
99.95%99.95% 5 分钟5 minutes 21.6 分钟21.6 minutes 4.38 小时4.38 hours
99.99%99.99% 1.01 分钟1.01 minutes 4.32 分钟4.32 minutes 52.56 分钟52.56 minutes
99.999%99.999% 6 秒6 seconds 25.9 秒25.9 seconds 5.26 分钟5.26 minutes

当然,在其他条件相同的情况下,可用性越高越好。Of course, higher availability is better, everything else being equal. 不过,如果追求更多的 9,则实现该可用性级别所带来的成本和复杂性也会增大。But as you strive for more 9s, the cost and complexity to achieve that level of availability grows. 99.99% 运行时间相当于每月的总停机时间大约为 5 分钟。An uptime of 99.99% translates to about 5 minutes of total downtime per month. 是否值得提高复杂性和成本来实现五个 9 的可用性?Is it worth the additional complexity and cost to reach five 9s? 答案取决于业务要求。The answer depends on the business requirements.

下面是定义 SLA 时需要考虑的其他一些因素:Here are some other considerations when defining an SLA:

  • 若要实现四个 9 (99.99%),也许就不能依赖于人工干预来从故障中恢复。To achieve four 9's (99.99%), you probably can't rely on manual intervention to recover from failures. 应用程序必须能够自我诊断和自我修复。The application must be self-diagnosing and self-healing.
  • 如果超过四个 9,将很难根据 SLA 的要求以足够快的速度检测到服务中断。Beyond four 9's, it is challenging to detect outages quickly enough to meet the SLA.
  • 请考虑衡量 SLA 时所依据的时限。Think about the time window that your SLA is measured against. 该时限越小,容限就越严格。The smaller the window, the tighter the tolerances. 根据每小时或每日运行时间定义 SLA 可能没有意义。It probably doesn't make sense to define your SLA in terms of hourly or daily uptime.

复合 SLAComposite SLAs

请考虑一个要在 Azure SQL 数据库中写入数据的应用服务 Web 应用。Consider an App Service web app that writes to Azure SQL Database. 在撰写本文时,这些 Azure 服务的 SLA 如下:At the time of this writing, these Azure services have the following SLAs:

  • 应用服务 Web 应用 = 99.95%App Service Web Apps = 99.95%
  • SQL 数据库 = 99.99%SQL Database = 99.99%

复合 SLA

此应用程序的预期最大停机时间是多少?What is the maximum downtime you would expect for this application? 如果任一服务发生故障,整个应用程序也会发生故障。If either service fails, the whole application fails. 一般情况下,每个服务发生故障的概率是自主性的,因此,此应用程序的复合 SLA 为 99.95% × 99.99%= 99.94%。In general, the probability of each service failing is independent, so the composite SLA for this application is 99.95% × 99.99% = 99.94%. 这比单个 SLA 更低,但不让人意外,因为依赖于多个服务的应用程序具有更多的潜在故障点。That's lower than the individual SLAs, which isn't surprising, because an application that relies on multiple services has more potential failure points.

另一方面,可以通过创建独立的回退路径来提高复合 SLA。On the other hand, you can improve the composite SLA by creating independent fallback paths. 例如,如果 SQL 数据库不可用,可将事务放入队列供稍后处理。For example, if SQL Database is unavailable, put transactions into a queue, to be processed later.

复合 SLA

如果采用这种设计,即使应用程序无法连接到数据库,它也能保持可用性。With this design, the application is still available even if it can't connect to the database. 但是,如果数据库和队列同时发生故障,则应用程序也会发生故障。However, it fails if the database and the queue both fail at the same time. 同时发生故障的预期时间百分比是 0.0001 × 0.001,因此,此组合路径的复合 SLA 是:The expected percentage of time for a simultaneous failure is 0.0001 × 0.001, so the composite SLA for this combined path is:

  • 数据库或队列 = 1.0 − (0.0001 × 0.001) = 99.99999%Database OR queue = 1.0 − (0.0001 × 0.001) = 99.99999%

总复合 SLA 是:The total composite SLA is:

  • Web 应用和(数据库或队列)= 99.95% × 99.99999%= ~99.95%Web app AND (database OR queue) = 99.95% × 99.99999% = ~99.95%

但是,这种方法存在几个缺点。But there are tradeoffs to this approach. 应用程序逻辑更复杂,因此需要支付队列费用,另外,可能还要考虑到数据一致性问题。The application logic is more complex, you are paying for the queue, and there may be data consistency issues to consider.

多区域部署的 SLASLA for multi-region deployments. 另一种 HA 技术是在多个区域中部署应用程序,当一个区域中的应用程序发生故障时,可以使用 Azure 流量管理器实现故障转移。Another HA technique is to deploy the application in more than one region, and use Azure Traffic Manager to fail over if the application fails in one region. 对于双区域部署,复合 SLA 的计算方式如下。For a two-region deployment, the composite SLA is calculated as follows.

N 成为一个区域中部署的应用程序的复合 SLA。Let N be the composite SLA for the application deployed in one region. 应用程序在两个区域中同时发生故障的预期可能性为 (1 − N) × (1 − N)。The expected chance that the application will fail in both regions at the same time is (1 − N) × (1 − N). 因此,Therefore,

  • 两个区域的组合 SLA = 1 − (1 − N)(1 − N) = N + (1 − N)NCombined SLA for both regions = 1 − (1 − N)(1 − N) = N + (1 − N)N

最后,必须考虑流量管理器的 SLAFinally, you must factor in the SLA for Traffic Manager. 在撰写本文时,流量管理器的 SLA 为 99.99%。At the time of this writing, the SLA for Traffic Manager SLA is 99.99%.

  • 复合 SLA = 99.99% × (两个区域的合并 SLA)Composite SLA = 99.99% × (combined SLA for both regions)

此外,故障转移不能瞬间完成,在故障转移期间,可能会造成一段停机时间。Also, failing over is not instantaneous and can result in some downtime during a failover. 请参阅流量管理器终结点监视和故障转移See Traffic Manager endpoint monitoring and failover.

计算出的 SLA 数字是个有用的基线,但不能反映可用性的方方面面。The calculated SLA number is a useful baseline, but it doesn't tell the whole story about availability. 通常,在非关键路径发生故障时,应用程序的可用性可以正常降级。Often, an application can degrade gracefully when a non-critical path fails. 假设某个应用程序显示图书目录。Consider an application that shows a catalog of books. 如果该应用程序无法检索封面的缩略图图像,它可能显示占位符图像。If the application can't retrieve the thumbnail image for the cover, it might show a placeholder image. 在这种情况下,无法获取该图像并不会减少应用程序的运行时间,不过会影响用户体验。In that case, failing to get the image does not reduce the application's uptime, although it affects the user experience.

冗余和故障设计Redundancy and designing for failure

故障的影响范围各不相同。Failures can vary in the scope of their impact. 某些硬件故障(例如磁盘故障)可能影响单个主机。Some hardware failures, such as a failed disk, may affect a single host machine. 网络交换机故障可能影响整个服务器机架。A failed network switch could affect a whole server rack. 中断整个数据中心的故障(例如数据中心断电)不太常见。Less common are failures that disrupt a whole data center, such as loss of power in a data center. 在极少数情况下,整个区域可能不可用。Rarely, an entire region could become unavailable.

冗余是让应用程序保持弹性的方法之一。One of the main ways to make an application resilient is through redundancy. 但是,需要在计划应用程序时规划这种冗余。But you need to plan for this redundancy when you design the application. 此外,所需的冗余级别取决于业务要求 — 并非每个应用程序都需要跨区域的冗余才能防范区域性服务中断。Also, the level of redundancy that you need depends on your business requirements — not every application needs redundancy across regions to guard against a regional outage. 一般情况下,提高冗余和可靠性的弊端就是增大成本和复杂性。In general, there is a tradeoff between greater redundancy and reliability versus higher cost and complexity.

Azure 提供许多功能用于实现每个故障级别的应用程序冗余,从单个 VM 到整个区域。Azure has a number of features to make an application redundant at every level of failure, from an individual VM to an entire region.

单个 VMSingle VM. Azure 针对单个 VM 提供运行时间 SLA。Azure provides an uptime SLA for single VMs. 尽管可以通过运行两个或更多个 VM 来获得更高的 SLA,则对于某些工作负荷而言,单个 VM 可能已足够可靠。Although you can get a higher SLA by running two or more VMs, a single VM may be reliable enough for some workloads. 对于生产工作负荷,我们建议使用两个或更多个 VM 来实现冗余。For production workloads, we recommend using two or more VMs for redundancy.

可用性集Availability sets. 若要防范局部性硬件故障(例如磁盘或网络交换机故障),请在可用性集中部署两个或更多个 VM。To protect against localized hardware failures, such as a disk or network switch failing, deploy two or more VMs in an availability set. 可用性集包括两个或更多个容错域,它们共用一个电源和网络交换机。An availability set consists of two or more fault domains that share a common power source and network switch. 可用性集中的 VM 分布在不同的容错域中,因此,如果硬件故障影响了一个容错域,仍可将网络流量路由到其他容错域中的 VM。VMs in an availability set are distributed across the fault domains, so if a hardware failure affects one fault domain, network traffic can still be routed the VMs in the other fault domains. 有关可用性集的详细信息,请参阅在 Azure 中管理 Windows 虚拟机的可用性For more information about Availability Sets, see Manage the availability of Windows virtual machines in Azure.

可用性区域(预览)Availability zones (preview). 可用性区域是 Azure 区域中的物理独立区域。An Availability Zone is a physically separate zone within an Azure region. 每个可用性区域有独立的电源、网络和散热设备。Each Availability Zone has a distinct power source, network, and cooling. 跨可用性区域部署 VM 有助于在发生数据中心范围的故障时保护应用程序。Deploying VMs across availability zones helps to protect an application against datacenter-wide failures.

配对区域Paired regions. 若要在发生区域性服务中断时保护应用程序,可以跨多个区域部署应用程序,并使用 Azure 流量管理器将 Internet 流量分发到不同的区域。To protect an application against a regional outage, you can deploy the application across multiple regions, using Azure Traffic Manager to distribute internet traffic to the different regions. 每个 Azure 区域与另一个区域配对。Each Azure region is paired with another region. 它们共同构成了区域对Together, these form a regional pair. 除巴西南部以外,区域对位于同一区域,以符合税务和执法管辖范围方面的数据驻留要求。With the exception of Brazil South, regional pairs are located within the same geography in order to meet data residency requirements for tax and law enforcement jurisdiction purposes.

设计多区域应用程序时,请注意跨区域的网络延迟高于区域内部的网络延迟。When you design a multi-region application, take into account that network latency across regions is higher than within a region. 例如,若要复制数据库以启用故障转移,可在一个区域中使用同步数据复制,但跨区域复制时应使用异步数据复制。For example, if you are replicating a database to enable failover, use synchronous data replication within a region, but asynchronous data replication across regions.

  可用性集Availability Set 可用性区域Availability Zone 配对区域Paired region
故障范围Scope of failure 机架Rack 数据中心Datacenter 区域Region
请求路由Request routing 负载均衡器Load Balancer 跨区域负载均衡器Cross-zone Load Balancer 流量管理器Traffic Manager
网络延迟Network latency 极低Very low Low 中到高Mid to high
虚拟网络Virtual network VNetVNet VNetVNet 跨区域 VNet 对等互连(预览)Cross-region VNet peering (preview)

复原设计Designing for resiliency

在设计阶段,应执行故障模式分析 (FMA)。During the design phase, you should perform a failure mode analysis (FMA). FMA 旨在识别潜在的故障点,并定义应用程序如何对这些故障做出响应。The goal of an FMA is to identify possible points of failure, and define how the application will respond to those failures.

  • 应用程序如何检测此类故障?How will the application detect this type of failure?
  • 应用程序如何对此类故障做出响应?How will the application respond to this type of failure?
  • 如何记录和监视此类故障?How will you log and monitor this type of failure?

有关 FMA 过程的详细信息以及针对 Azure 的具体建议,请参阅 Azure 复原指南:故障模式分析For more information about the FMA process, with specific recommendations for Azure, see Azure resiliency guidance: Failure mode analysis.

确定故障模式和检测策略的示例Example of identifying failure modes and detection strategy

故障点:调用外部 Web 服务/API。Failure point: Call to an external web service / API.

故障模式Failure mode 检测策略Detection strategy
服务不可用Service is unavailable HTTP 5xxHTTP 5xx
限制Throttling HTTP 429(请求过多)HTTP 429 (Too Many Requests)
身份验证Authentication HTTP 401(未授权)HTTP 401 (Unauthorized)
响应速度慢Slow response 请求超时Request times out

复原策略Resiliency strategies

本部分提供一些常见复原策略的调查。This section provides a survey of some common resiliency strategies. 其中的大多数策略并不局限于特定的技术。Most of these are not limited to a particular technology. 本部分中的说明汇总了每种技术背后的一般思路,并提供了其他阅读材料的链接。The descriptions in this section summarize the general idea behind each technique, with links to further reading.

重试暂时性故障Retry transient failures

暂时性故障可能是由于短暂的网络连接中断、删除了数据库连接或服务因繁忙而超时造成的。Transient failures can be caused by momentary loss of network connectivity, a dropped database connection, or a timeout when a service is busy. 通常,只需重试请求即可解决暂时性故障。Often, a transient failure can be resolved simply by retrying the request. 针对许多 Azure 服务,客户端 SDK 能够以对调用方透明的方式实施自动重试;请参阅重试服务指南For many Azure services, the client SDK implements automatic retries, in a way that is transparent to the caller; see Retry service specific guidance.

每次重试都会增大总延迟时间。Each retry attempt adds to the total latency. 此外,如果失败的请求过多,可能会导致出现瓶颈,因为挂起的请求会在队列中累积。Also, too many failed requests can cause a bottleneck, as pending requests accumulate in the queue. 这些被阻止的请求可能占用关键的系统资源,例如内存、线程、数据库连接等,从而导致连发故障。These blocked requests might hold critical system resources such as memory, threads, database connections, and so on, which can cause cascading failures. 为避免此问题,可增大每两次重试的延迟时间,并限制失败请求的总数。To avoid this, increase the delay between each retry attempt, and limit the total number of failed requests.

复合 SLA

有关详细信息,请参阅重试模式For more information, see Retry Pattern.

在实例之间负载均衡Load balance across instances

在可伸缩性方面,云应用程序应该能够通过添加更多实例来横向扩展。For scalability, a cloud application should be able to scale out by adding more instances. 此方法也会改进复原能力,因为可以将不正常的实例从轮转阵容中删除。This approach also improves resiliency, because unhealthy instances can be removed from rotation.

例如:For example:

复制数据Replicate data

复制数据是处理数据存储中非暂时性故障的常规策略。Replicating data is a general strategy for handling non-transient failures in a data store. 许多存储技术(包括 Azure SQL 数据库、Cosmos DB 和 Apache Cassandra)提供内置复制。Many storage technologies provide built-in replication, including Azure SQL Database, Cosmos DB, and Apache Cassandra.

必须同时考虑读取和写入路径。It's important to consider both the read and write paths. 根据所用的存储技术,可以创建多个可写副本,或者创建单个可写副本和多个只读副本。Depending on the storage technology, you might have multiple writable replicas, or a single writable replica and multiple read-only replicas.

为了最大程度地提高可用性,可将副本放在多个区域。To maximize availability, replicas can be placed in multiple regions. 但是,这会增大复制数据时的延迟。However, this increases the latency when replicating the data. 跨区域复制通常是以异步方式执行的,这意味着,如果某个副本发生故障,将无法遵循最终一致性模型,并可能会丢失数据。Typically, replicating across regions is done asynchronously, which implies an eventual consistency model and potential data loss if a replica fails.

正常降级Degrade gracefully

如果某个服务发生故障且没有故障转移路径,应用程序有时能够正常降级,同时仍能提供可接受的用户体验。If a service fails and there is no failover path, the application may be able to degrade gracefully while still providing an acceptable user experience. 例如:For example:

  • 将工作项放入队列供稍后处理。Put a work item on a queue, to be handled later.
  • 返回一个估计值。Return an estimated value.
  • 使用本地缓存的数据。Use locally cached data.
  • 向用户显示错误消息。Show the user an error message. (这种做法比应用程序停止响应请求更好。)(This option is better than having the application stop responding to requests.)

限制高访问量用户Throttle high-volume users

有时,少量的用户会造成过高的负载。Sometimes a small number of users create excessive load. 这可能会对其他用户造成影响,降低应用程序的总体可用性。That can have an impact on other users, reducing the overall availability of your application.

当单个客户端发出过多的请求时,应用程序可在特定的一段时间内限制该客户端。When a single client makes an excessive number of requests, the application might throttle the client for a certain period of time. 在限制期间,应用程序会拒绝来自该客户端的一部分或所有请求(取决于确切的限制策略)。During the throttling period, the application refuses some or all of the requests from that client (depending on the exact throttling strategy). 限制阈值可以根据客户的服务层确定。The threshold for throttling might depend on the customer's service tier.

限制并不意味着该客户端一定是恶意的,只表示它超出了其服务配额。Throttling does not imply the client was necessarily acting maliciously, only that it exceeded its service quota. 在某些情况下,使用者可能一贯地超出其配额或行为异常。In some cases, a consumer might consistently exceed their quota or otherwise behave badly. 在此情况下,可以进一步阻止该用户。In that case, you might go further and block the user. 为此,通常可以阻止其 API 密钥或 IP 地址范围。Typically, this is done by blocking an API key or an IP address range.

有关详细信息,请参阅限制模式For more information, see Throttling Pattern.

使用断路器Use a circuit breaker

断路器模式可以防止应用程序重复尝试执行很可能失败的操作。The Circuit Breaker pattern can prevent an application from repeatedly trying an operation that is likely to fail. 这类似于物理断路器:当电路过载时,开关可断开电流。This is similar to a physical circuit breaker, a switch that interrupts the flow of current when a circuit is overloaded.

断路器中包装了对服务的调用。The circuit breaker wraps calls to a service. 它具有三种状态:It has three states:

  • 闭合Closed. 这是正常状态。This is the normal state. 断路器向服务发送请求,计数器会跟踪最近的故障数。The circuit breaker sends requests to the service, and a counter tracks the number of recent failures. 如果在给定的时段内故障数超过阈值,断路器会切换到“断开”状态。If the failure count exceeds a threshold within a given time period, the circuit breaker switches to the Open state.
  • 断开Open. 在此状态下,断路器会立即使所有请求失败,且不调用服务。In this state, the circuit breaker immediately fails all requests, without calling the service. 应用程序应该使用缓解路径,例如,从副本中读取数据,或者只是向用户返回错误。The application should use a mitigation path, such as reading data from a replica or simply returning an error to the user. 当断路器切换到“断开”状态时,会启动计时器。When the circuit breaker switches to Open, it starts a timer. 计时时间已过后,断路器将切换到“半开”状态。When the timer expires, the circuit breaker switches to the Half-open state.
  • 半开Half-open. 在此状态下,断路器允许有限数量的请求发往服务。In this state, the circuit breaker lets a limited number of requests go through to the service. 如果这些请求成功,则认为服务已恢复,断路器将切换到“闭合”状态。If they succeed, the service is assumed to be recovered, and the circuit breaker switches back to the Closed state. 否则,会恢复为“断开”状态。Otherwise, it reverts to the Open state. “半开”状态可防止大量的请求涌入正在恢复中的服务。The Half-Open state prevents a recovering service from suddenly being inundated with requests.

有关详细信息,请参阅断路器模式For more information, see Circuit Breaker Pattern.

使用负载调控来平缓流量高峰Use load leveling to smooth out spikes in traffic

应用程序可能会遇到突发流量高峰,导致后端上的服务瘫痪。Applications may experience sudden spikes in traffic, which can overwhelm services on the backend. 如果后端服务无法以足够快的速度响应请求,可能会导致请求排队(备份),或导致服务限制应用程序。If a backend service cannot respond to requests quickly enough, it may cause requests to queue (back up), or cause the service to throttle the application.

为了避免此问题,可以使用队列作为缓冲区。To avoid this, you can use a queue as a buffer. 出现新的工作项时,应用程序不必立即调用后端服务,而可以将工作项排队,以便以异步方式运行它。When there is a new work item, instead of calling the backend service immediately, the application queues a work item to run asynchronously. 队列充当可平缓负载高峰的缓冲区。The queue acts as a buffer that smooths out peaks in the load.

有关详细信息,请参阅基于队列的负载调控模式For more information, see Queue-Based Load Leveling Pattern.

隔离关键资源Isolate critical resources

一个子系统发生故障有时会造成连锁反应,导致应用程序的其他部分发生故障。Failures in one subsystem can sometimes cascade, causing failures in other parts of the application. 如果某个故障导致某些资源(例如线程或套接字)无法及时释放,导致资源耗尽,则可能就会发生这种连锁反应。This can happen if a failure causes some resources, such as threads or sockets, not to get freed in a timely manner, leading to resource exhaustion.

为了避免此问题,可将系统分区为独立的组,使一个分区中的故障不会导致整个系统瘫痪。To avoid this, you can partition a system into isolated groups, so that a failure in one partition does not bring down the entire system. 此方法有时称为隔离模式。This technique is sometimes called the Bulkhead pattern.


  • 将数据库分区(例如,按租户),并为每个分区分配独立的 Web 服务器实例池。Partition a database (for example, by tenant) and assign a separate pool of web server instances for each partition.
  • 使用单独的线程池来隔离对不同服务发出的调用。Use separate thread pools to isolate calls to different services. 这有助于防止其中一个服务发生故障时出现连发故障。This helps to prevent cascading failures if one of the services fails. 有关示例,请参阅 Netflix Hystrix 库For an example, see the Netflix Hystrix library.
  • 使用容器来限制特定子系统可用的资源。Use containers to limit the resources available to a particular subsystem.

复合 SLA

应用补偿事务Apply compensating transactions

补偿事务是用于消除另一个已完成的事务所造成的影响的事务。A compensating transaction is a transaction that undoes the effects of another completed transaction.

在分布式系统中,可能很难实现较强的事务一致性。In a distributed system, it can be very difficult to achieve strong transactional consistency. 补偿事务是使用一系列可在每个步骤中撤消的较小独立事务,实现一致性的方式。Compensating transactions are a way to achieve consistency by using a series of smaller, individual transactions that can be undone at each step.

例如,若要预订行程,客户可能需要预约租车、客房和航班。For example, to book a trip, a customer might reserve a car, a hotel room, and a flight. 如果其中的任何步骤失败,整个操作就会失败。If any of these steps fails, the entire operation fails. 我们可以针对每个步骤定义一个补偿事务,而不用尝试对整个操作使用单个分布式事务。Instead of trying to use a single distributed transaction for the entire operation, you can define a compensating transaction for each step. 例如,若要撤消租车,可以取消预订。For example, to undo a car reservation, you cancel the reservation. 为了完成整个操作,协调人员会执行每个步骤。In order to complete the whole operation, a coordinator executes each step. 如果任一步骤失败,协调人员可应用补偿事务,以撤消前面已完成的任何步骤。If any step fails, the coordinator applies compensating transactions to undo any steps that were completed.

有关详细信息,请参阅补偿事务模式For more information, see Compensating Transaction Pattern.

测试复原能力Testing for resiliency

一般情况下,无法像测试应用程序功能(运行单元测试等)一样来测试复原能力。Generally, you can't test resiliency in the same way that you test application functionality (by running unit tests and so on). 必须在故障状态下测试端到端工作负荷的执行情况,而这种状态只能间歇性地出现。Instead, you must test how the end-to-end workload performs under failure conditions which only occur intermittently.

测试是一个迭代过程。Testing is an iterative process. 需要测试应用程序、测量结果、分析并解决出现的任何故障,并重复该过程。Test the application, measure the outcome, analyze and address any failures that result, and repeat the process.

故障注入测试Fault injection testing. 通过触发实际故障或模拟故障,测试系统在故障期间的复原能力。Test the resiliency of the system during failures, either by triggering actual failures or by simulating them. 下面提供了一些常见的故障情景供测试:Here are some common failure scenarios to test:

  • 关闭 VM 实例。Shut down VM instances.
  • 进程崩溃。Crash processes.
  • 证书过期。Expire certificates.
  • 更改访问密钥。Change access keys.
  • 关闭域控制器上的 DNS 服务。Shut down the DNS service on domain controllers.
  • 限制可用的系统资源,例如 RAM 或线程数。Limit available system resources, such as RAM or number of threads.
  • 卸载磁盘。Unmount disks.
  • 重新部署 VM。Redeploy a VM.

测量恢复时间,并验证是否满足业务要求。Measure the recovery times and verify that your business requirements are met. 同时测试故障模式的组合。Test combinations of failure modes as well. 确保故障不会造成连锁反应,并且能够以隔离的方式予以处理。Make sure that failures don't cascade, and are handled in an isolated way.

这是为何有必要在设计阶段分析潜在故障点的另一个原因。This is another reason why it's important to analyze possible failure points during the design phase. 该分析的结果应该成为测试计划的输入。The results of that analysis should be inputs into your test plan.

负载测试Load testing. 使用 Visual Studio Team ServicesApache JMeter 等工具对应用程序进行负载测试。Load test the application using a tool such as Visual Studio Team Services or Apache JMeter. 要识别只有在承受负载的情况下才会发生的故障(例如,后端数据库瘫痪,或者服务受限制),执行负载测试至关重要。Load testing is crucial for identifying failures that only happen under load, such as the backend database being overwhelmed or service throttling. 使用生产数据或尽量与生产数据接近的合成数据测试峰值负载。Test for peak load, using production data or synthetic data that is as close to production data as possible. 目标是检查应用程序在实际条件下的行为方式。The goal is to see how the application behaves under real-world conditions.

可复原的部署Resilient deployment

将应用程序部署到生产环境后,更新操作就成了一个可能的出错来源。Once an application is deployed to production, updates are a possible source of errors. 在最坏的情况下,不当的更新可能导致停机。In the worst case, a bad update can cause downtime. 为了避免此问题,部署过程必须可预测且可重复。To avoid this, the deployment process must be predictable and repeatable. 部署包括预配 Azure 资源、部署应用程序代码和应用配置设置。Deployment includes provisioning Azure resources, deploying application code, and applying configuration settings. 更新可能包括上述所有三个步骤,或其中的一部分。An update may involve all three, or a subset.

要害的一点在于,手动部署容易出错。The crucial point is that manual deployments are prone to error. 因此,我们建议采用可按需运行的自动化幂等过程,在某个环节出错时,可以重新运行。Therefore, it's recommended to have an automated, idempotent process that you can run on demand, and re-run if something fails.

“基础结构即代码”和“不可变基础结构”是与可复原部署相关的两个概念。Two concepts related to resilient deployment are infrastructure as code and immutable infrastructure.

  • “基础结构即代码”是指使用代码来预配和配置基础结构的做法。Infrastructure as code is the practice of using code to provision and configure infrastructure. “基础结构即代码”可以使用声明性方法或命令性方法(或两者的组合)。Infrastructure as code may use a declarative approach or an imperative approach (or a combination of both). 资源管理器模板就是声明性方法的一个例子。Resource Manager templates are an example of a declarative approach. PowerShell 脚本是命令性方法的例子。PowerShell scripts are an example of an imperative approach.
  • “不可变基础结构”是指在将基础结构部署到生产环境后不应对其进行修改的原则。Immutable infrastructure is the principle that you shouldn’t modify infrastructure after it’s deployed to production. 否则,可能会陷入这种状态:应用即席更改后,很难确切地知道哪些内容已更改,且很难对系统做出推断。Otherwise, you can get into a state where ad hoc changes have been applied, so it's hard to know exactly what changed, and hard to reason about the system.

另一个问题是如何推出应用程序更新。Another question is how to roll out an application update. 我们建议采用蓝绿部署或“金丝雀发布”方法,以高度受控的方式推送更新,尽量减小不当部署造成的影响。We recommend techniques such as blue-green deployment or canary releases, which push updates in highly controlled way to minimize possible impacts from a bad deployment.

  • 蓝绿部署方法将更新部署到独立于实时应用程序的生产环境。Blue-green deployment is a technique where an update is deployed into a production environment separate from the live application. 验证部署后,可将流量改为路由到更新的版本。After you validate the deployment, switch the traffic routing to the updated version. 例如,Azure 应用服务 Web 应用会在过渡槽中启用此方法。For example, Azure App Service Web Apps enables this with staging slots.
  • 金丝雀发布类似于蓝绿部署。Canary releases are similar to blue-green deployments. 使用此方法时,我们不是将所有流量切换到更新的版本,而是通过将一部分流量路由到新部署,向少部分用户推出更新。Instead of switching all traffic to the updated version, you roll out the update to a small percentage of users, by routing a portion of the traffic to the new deployment. 如果出现问题,则回退并还原到旧部署。If there is a problem, back off and revert to the old deployment. 否则,将更多的流量路由到新版本,直到路由的流量达到 100%。Otherwise, route more of the traffic to the new version, until it gets 100% of the traffic.

不管采用哪种方法,都应该确保在新版本无法正常运行的情况下,可以回滚到上次已知正常的部署。Whatever approach you take, make sure that you can roll back to the last-known-good deployment, in case the new version is not functioning. 此外,如果出错,应用程序日志必须指出哪个版本导致了错误。Also, if errors occur, the application logs must indicate which version caused the error.

监视和诊断Monitoring and diagnostics

监视和诊断对于复原能力至关重要。Monitoring and diagnostics are crucial for resiliency. 如果某个组件发生故障,我们需要知道该组件已发生故障,并深入分析故障原因。If something fails, you need to know that it failed, and you need insights into the cause of the failure.

监视大规模分布式系统是一个很大的难题。Monitoring a large-scale distributed system poses a significant challenge. 假设某个应用程序在几十个 VM 上运行 — 逐个地登录每个 VM,仔细查看日志文件,再尝试排查问题的做法不切实际。Think about an application that runs on a few dozen VMs — it's not practical to log into each VM, one at a time, and look through log files, trying to troubleshoot a problem. 此外,VM 实例数可能不是恒定的。Moreover, the number of VM instances is probably not static. VM 会随着应用程序的缩减和扩展而不断添加或删除,有时,某个实例可能会发生故障,需要重新预配。VMs get added and removed as the application scales in and out, and occasionally an instance may fail and need to be reprovisioned. 此外,典型的云应用程序可能会使用多个数据存储(Azure 存储、SQL 数据库、Cosmos DB、Redis 缓存),而单个用户操作可能跨多个子系统。In addition, a typical cloud application might use multiple data stores (Azure storage, SQL Database, Cosmos DB, Redis cache), and a single user action may span multiple subsystems.

可将监视和诊断过程视为包含多个不同阶段的管道:You can think of the monitoring and diagnostics process as a pipeline with several distinct stages:

复合 SLA

  • 检测Instrumentation. 要监视和诊断的原始数据来自各种源,包括应用程序日志、Web 服务器日志、OS 性能计数器、数据库日志和 Azure 平台中内置的诊断。The raw data for monitoring and diagnostics comes from a variety of sources, including application logs, web server logs, OS performance counters, database logs, and diagnostics built into the Azure platform. 大多数 Azure 服务提供诊断功能,可用于确定问题的原因。Most Azure services have a diagnostics feature that you can use to determine the cause of problems.
  • 收集和存储Collection and storage. 可以使用各种格式将原始检测数据保存在不同的位置(例如,应用程序跟踪日志、IIS 日志、性能计数器)。Raw instrumentation data can be held in various locations and with various formats (e.g., application trace logs, IIS logs, performance counters). 可以收集、合并这些不同源中的数据,并将其放入可靠的存储。These disparate sources are collected, consolidated, and put into reliable storage.
  • 分析和诊断Analysis and diagnosis. 合并数据后,可对其进行分析,以排查问题并提供应用程序运行状况的总体视图。After the data is consolidated, it can be analyzed to troubleshoot issues and provide an overall view of application health.
  • 可视化和警报Visualization and alerts. 在此阶段,将以适当的方式呈现遥测数据,使操作员能够快速发现问题或趋势。In this stage, telemetry data is presented in such a way that an operator can quickly notice problems or trends. 示例包括仪表板或电子邮件警报。Example include dashboards or email alerts.

监视与故障检测不同。Monitoring is not the same as failure detection. 例如,应用程序可以检测暂时性错误然后重试,这样就不会导致停机。For example, your application might detect a transient error and retry, resulting in no downtime. 但是,它还应该记录重试操作,使我们能够监视错误率,了解应用程序的大致运行状况。But it should also log the retry operation, so that you can monitor the error rate, in order to get an overall picture of application health.

应用程序日志是诊断数据的重要来源。Application logs are an important source of diagnostics data. 在应用程序日志记录方面,最佳做法包括:Best practices for application logging include:

  • 在生产环境中记录日志。Log in production. 否则,在最需要洞察数据的时候却缺少此类数据。Otherwise, you lose insight where you need it most.
  • 记录服务边界的事件。Log events at service boundaries. 包含跨服务边界流动的关联 ID。Include a correlation ID that flows across service boundaries. 如果某个事务流经多个服务,而其中一个服务发生故障,则关联 ID 可帮助查明事务失败的原因。If a transaction flows through multiple services and one of them fails, the correlation ID will help you pinpoint why the transaction failed.
  • 使用语义日志记录,也称为结构化日志记录。Use semantic logging, also known as structured logging. 使用非结构化日志很难将日志数据的使用和分析自动化,而云规模的运营需要这种自动化技术。Unstructured logs make it hard to automate the consumption and analysis of the log data, which is needed at cloud scale.
  • 使用异步日志记录。Use asynchronous logging. 否则,日志记录系统本身可能导致应用程序发生故障,因为它会导致请求进入备份队列,使请求在等待写入日志记录事件时阻塞。Otherwise, the logging system itself can cause the application to fail by causing requests to back up, as they block while waiting to write a logging event.
  • 应用程序日志记录与审核不同。Application logging is not the same as auditing. 审核可能是出于符合性或法规原因执行的。Auditing may be done for compliance or regulatory reasons. 在这种情况下,审核记录必须完整,处理事务时不可丢弃任何记录。As such, audit records must be complete, and it's not acceptable to drop any while processing transactions. 如果应用程序需要审核,应该独立于诊断日志记录执行审核。If an application requires auditing, this should be kept separate from diagnostics logging.

有关监视和诊断的详细信息,请参阅监视和诊断指南For more information about monitoring and diagnostics, see Monitoring and diagnostics guidance.

手动故障响应Manual failure responses

前面的部分着重于自动化恢复策略,这是实现高可用性的关键所在。Previous sections have focused on automated recovery strategies, which are critical for high availability. 但是,有时需要人工干预。However, sometimes manual intervention is needed.

  • 警报Alerts. 监视应用程序中是否出现了可能需要主动干预的警示。Monitor your application for warning signs that may require proactive intervention. 例如,如果看到 SQL 数据库或 Cosmos DB 在不断地限制应用程序,则可能需要增大数据库容量或优化查询。For example, if you see that SQL Database or Cosmos DB consistently throttles your application, you might need to increase your database capacity or optimize your queries. 在此示例中,即使应用程序能够以透明方式处理限制错误,遥测也仍应该引发警报,使我们能够跟进问题。In this example, even though the application might handle the throttling errors transparently, your telemetry should still raise an alert so that you can follow up.
  • 手动故障转移Manual failover. 某些系统无法自动故障转移,需要手动故障转移。Some systems cannot fail over automatically and require a manual failover.
  • 操作就绪性测试Operational readiness testing. 如果应用程序故障转移到了次要区域,我们应该在故障回复到主要区域之前执行操作就绪性测试。If your application fails over to a secondary region, you should perform an operational readiness test before you fail back to the primary region. 该项测试应该验证主要区域是否正常,可以再次接收流量。The test should verify that the primary region is healthy and ready to receive traffic again.
  • 数据一致性检查Data consistency check. 如果数据存储发生故障,原因可能是该存储再次可用后存在数据不一致情况,尤其是数据是复制过来的情况下。If a failure happens in a data store, there may be data inconsistencies when the store becomes available again, especially if the data was replicated.
  • 从备份还原Restoring from backup. 例如,如果 SQL 数据库遇到区域性服务中断,我们可以从最新备份异地还原数据库。For example, if SQL Database experiences a regional outage, you can geo-restore the database from the latest backup.

阐述和测试灾难恢复计划。Document and test your disaster recovery plan. 评估应用程序故障造成的业务影响。Evaluate the business impact of application failures. 尽量将过程自动化,并阐述所有手动步骤,例如,手动故障转移或者从备份还原数据的步骤。Automate the process as much as possible, and document any manual steps, such as manual failover or data restoration from backups. 定期测试灾难恢复过程,以验证并改进计划。Regularly test your disaster recovery process to validate and improve the plan.


本文整体性地讨论了复原能力,并将重点放在云的某些独特难题上。This article discussed resiliency from a holistic perspective, emphasizing some of the unique challenges of the cloud. 这些难题包括云计算的分散性、市售硬件的使用,以及暂时性网络故障的存在。These include the distributed nature of cloud computing, the use of commodity hardware, and the presence of transient network faults.

下面是本文阐述的要点:Here are the major points to take away from this article:

  • 复原能力可提高可用性,减少故障恢复平均时间。Resiliency leads to higher availability, and lower mean time to recover from failures.
  • 在云中实现复原能力需要采用有别于传统本地解决方案的技术组合。Achieving resiliency in the cloud requires a different set of techniques from traditional on-premises solutions.
  • 复原不是巧合。Resiliency does not happen by accident. 必须从一开始就规划并建立复原能力。It must be designed and built in from the start.
  • 复原涉及到应用程序生命周期的每个环节:从规划、编程到运营。Resiliency touches every part of the application lifecycle, from planning and coding to operations.
  • 此外,还要进行测试和监视!Test and monitor!