您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

复原支柱概述Overview of the resiliency pillar

在云中生成可靠的应用程序不同于传统的应用程序开发。Building a reliable application in the cloud is different from traditional application development. 尽管在过去,你可能已经购买了冗余更高的硬件级别,从而最大限度地减少了整个应用程序平台出现故障的几率。While historically you may have purchased levels of redundant higher-end hardware to minimize the chance of an entire application platform failing. 在云中,我们承认之前会发生故障。In the cloud, we acknowledge up front that failures will happen. 我们的目标不是试图防止各种故障,而是最大程度地减轻单个组件故障造成的影响。Instead of trying to prevent failures altogether, the goal is to minimize the effects of a single failing component.

若要使用 Azure 体系结构框架中的原则评估工作负荷,请参阅 Azure 体系结构审查To assess your workload using the tenets found in the Azure architecture framework, see the Azure architecture review.

可靠的应用程序具有以下特征:Reliable applications are:

  • 可复原,能够在故障后正常恢复,并且在完全恢复之前可以继续运行,且只会造成极短的停机时间和极少量的数据丢失。Resilient and recover gracefully from failures, and they continue to function with minimal downtime and data loss before full recovery.
  • 高度可用 (HA) ,可按照设计以正常状态运行,且不会造成很长的停机时间。Highly available (HA) and run as designed in a healthy state with no significant downtime.

了解这些要素如何协同工作 — 及其对成本造成的影响 — 是生成可靠应用程序的关键所在。Understanding how these elements work together — and how they affect cost — is essential to building a reliable application. 这可以帮助你确定多长的停机时间是可接受的、可能对业务造成的代价,以及在恢复期间需要保留哪些功能。It can help you determine how much downtime is acceptable, the potential cost to your business, and which functions are necessary during a recovery.

本文将会简要概述如何在 Azure 应用程序设计流程的每个步骤中融入可靠性要素。This article provides a brief overview of building reliability into each step of the Azure application design process. 每个部分都会提供深度文章的链接,其中介绍了如何将可靠性集成到该特定流程步骤。Each section includes a link to an in-depth article on how to integrate reliability into that specific step in the process. 如需了解各个 Azure 服务的可靠性注意事项,请查看特定 Azure 服务的复原能力查检表If you're looking for reliability considerations for individual Azure services, review the Resiliency checklist for specific Azure services.

可靠性设计Build for reliability

本部分介绍生成可靠 Azure 应用程序的六个步骤。This section describes six steps for building a reliable Azure application. 每个步骤链接到进一步定义流程和术语的部分。Each step links to a section that further defines the process and terms.

  1. 定义要求。 Define requirements. 根据分解的工作负荷与业务需求制定可用性和恢复要求。Develop availability and recovery requirements based on decomposed workloads and business needs.
  2. 使用体系结构最佳做法。 Use architectural best practices. 遵循经过证实的做法,在体系结构中识别出潜在的故障点,并确定应用程序如何应对故障。Follow proven practices, identify possible failure points in the architecture, and determine how the application will respond to failure.
  3. 使用模拟和强制故障转移进行测试。 Test with simulations and forced failovers. 模拟故障,触发强制故障转移,测试故障检测并在故障后予以恢复。Simulate faults, trigger forced failovers, and test detection and recovery from these failures.
  4. 以一致的方式部署应用程序。 Deploy the application consistently. 使用可靠且可重复的流程发布到生产环境。Release to production using reliable and repeatable processes.
  5. 监视应用程序运行状况。 Monitor application health. 检测故障,监视潜在故障的迹象,度量应用程序的运行状况。Detect failures, monitor indicators of potential failures, and gauge the health of your applications.
  6. 应对故障和灾难。 Respond to failures and disasters. 识别出故障何时发生,并根据建立的策略确定如何解决该故障。Identify when a failure occurs, and determine how to address it based on established strategies.

定义要求Define requirements

确定业务需求,并生成可靠性计划来解决这些需求。Identify your business needs, and build your reliability plan to address them. 请注意以下几点:Consider the following:

  • 识别工作负荷和用法。Identify workloads and usage. 从业务逻辑和数据存储要求的角度讲,工作负荷是逻辑上与其他任务截然不同的功能或任务。A workload is a distinct capability or task that is logically separated from other tasks, in terms of business logic and data storage requirements. 每个工作负荷在可用性、可伸缩性、数据一致性和灾难恢复方面具有不同的要求。Each workload has different requirements for availability, scalability, data consistency, and disaster recovery.

  • 规划使用模式。Plan for usage patterns. 使用模式在要求中也发挥了作用。Usage patterns also play a role in requirements. 识别关键和非关键时段的要求差异。Identify differences in requirements during critical and non-critical periods. 例如,税务申报应用程序在申报截止时间之前不能发生故障。For example, a tax-filing application can't fail during a filing deadline. 为了确保正常运行,请规划跨多个区域的冗余,以防其中一个区域发生故障。To ensure uptime, plan redundancy across several regions in case one fails. 相反,为了尽量降低非关键时段的成本,可以在单个区域中运行应用程序。Conversely, to minimize costs during non-critical periods, you can run your application in a single region.

  • 建立可用性指标 —平均恢复时间(MTTR)和平均故障时间间隔(MTBF)。Establish availability metrics — mean time to recovery (MTTR) and mean time between failures (MTBF). MTTR 是指发生故障后,还原某个组件所需的平均时间。MTTR is the average time it takes to restore a component after a failure. MTBF 是指某个组件在两次中断之间按预期方式合理运行的持续时间。MTBF is how long a component can reasonably expect to last between outages. 使用这些度量值可以确定要在何处添加冗余,并确定客户的服务级别协议 (SLA)。Use these measures to determine where to add redundancy and to determine service-level agreements (SLAs) for customers.

  • 建立恢复目标 — 恢复时间目标和恢复点目标 (RPO)。Establish recovery metrics — recovery time objective and recovery point objective (RPO). RTO 是指发生某个事件后,可接受应用程序不可用的最长时间。RTO is the maximum acceptable time an application can be unavailable after an incident. RPO 是指发生灾难期间,可接受数据丢失的最长持续时间。RPO is the maximum duration of data loss that is acceptable during a disaster. 若要派生这些值,请展开风险评估,并确保了解组织中发生停机和数据丢失所带来的成本与风险。To derive these values, conduct a risk assessment and make sure you understand the cost and risk of downtime or data loss in your organization.

    备注

    如果高可用性设置中任一关键组件的 MTTR 超过系统 RTO,则系统中的故障可能会导致出现不可接受的业务中断。If the MTTR of any critical component in a highly available setup exceeds the system RTO, a failure in the system might cause an unacceptable business disruption. 即,无法在定义的 RTO 内还原系统。That is, you can't restore the system within the defined RTO.

  • 确定工作负荷可用性目标。Determine workload availability targets. 为了确保应用程序体系结构符合业务要求,请定义每个工作负荷的目标 SLA。To ensure that application architecture meets your business requirements, define target SLAs for each workload. 除了应用程序依赖关系以外,还应该考虑到满足可用性要求所需的成本与复杂性。Account for the cost and complexity of meeting availability requirements, in addition to application dependencies.

  • 了解服务级别协议。Understand service-level agreements. 在 Azure 中,SLA 描述 Microsoft 在运行时间和连接性方面所做的承诺。In Azure, the SLA describes the Microsoft commitments for uptime and connectivity. 如果针对特定服务的 SLA 为 99.9%,则应该预期该服务在 99.9% 的时间内可用。If the SLA for a particular service is 99.9 percent, you should expect the service to be available 99.9 percent of the time.

    针对解决方案中的每个工作负荷定义你自己的目标 SLA,以便可以确定体系结构是否符合业务要求。Define your own target SLAs for each workload in your solution, so you can determine whether the architecture meets the business requirements. 例如,如果某个工作负荷的运行时间需要达到 99.99%,但它依赖于 SLA 为 99.9 % 的某个服务,那么,该服务不能是系统中的单一故障点。For example, if a workload requires 99.99 percent uptime but depends on a service with a 99.9 percent SLA, that service can't be a single point of failure in the system.

有关开发可靠应用程序要求的详细信息,请参阅应用程序设计以实现复原For more information about developing requirements for reliable applications, see Application design for resiliency.

使用体系结构最佳做法Use architectural best practices

在体系结构阶段,请注重根据业务要求实施最佳做法,识别出故障点,并最大程度地减小故障范围。During the architectural phase, focus on implementing practices that meet your business requirements, identify failure points, and minimize the scope of failures.

  • 执行故障模式分析 (FMA)。Perform a failure mode analysis (FMA). FMA 在设计阶段提前将复原能力融入到应用程序中。FMA builds resiliency into an application early in the design stage. 它可以帮助你识别出应用程序可能遇到的故障类型、每种故障的潜在影响,以及可能的恢复策略。It helps you identify the types of failures your application might experience, the potential effects of each, and possible recovery strategies.

  • 创建冗余计划。Create a redundancy plan. 每个工作负荷所需的冗余级别取决于业务需求,它会考虑到应用程序的总体成本。The level of redundancy required for each workload depends on your business needs and factors into the overall cost of your application.

  • 可伸缩性设计。Design for scalability. 云应用程序必须能够根据用途的变化进行缩放。A cloud application must be able to scale to accommodate changes in usage. 从离散的组件着手,尽量将应用程序设计为自动应对负载变化。Begin with discrete components, and design the application to respond automatically to load changes whenever possible. 设计期间请考虑到缩放限制,以便将来可以轻松扩展。Keep scaling limits in mind during design so you can expand easily in the future.

  • 规划订阅和服务要求。Plan for subscription and service requirements. 你可能需要使用更多的订阅来预配足够的资源,以满足存储、连接、吞吐量等方面的业务要求。You might need additional subscriptions to provision enough resources to meet your business requirements for storage, connections, throughput, and more.

  • 使用负载均衡来分发请求。Use load-balancing to distribute requests. 负载均衡通过从循环列表中删除不正常的实例,将应用程序请求分发到正常的服务实例。Load-balancing distributes your application's requests to healthy service instances by removing unhealthy instances from rotation.

  • 实施复原策略。Implement resiliency strategies. 复原是指系统能够在发生故障后进行恢复,然后继续正常运行。Resiliency is the ability of a system to recover from failures and continue to function. 实施复原能力设计模式,例如,隔离关键资源、使用补偿事务,以及尽可能地执行异步操作。Implement resiliency design patterns, such as isolating critical resources, using compensating transactions, and performing asynchronous operations whenever possible.

  • 将可用性要求融入到设计中。Build availability requirements into your design. 可用性是指系统正常工作时间所占的比例。Availability is the proportion of time your system is functional and working. 采取措施来确保应用程序可用性符合服务级别协议。Take steps to ensure that application availability conforms to your service-level agreement. 例如,避免单一故障点、按服务级别目标分解工作负荷,以及限制高流量用户。For example, avoid single points of failure, decompose workloads by service-level objective, and throttle high-volume users.

  • 管理数据。Manage your data. 数据的存储、备份和复制方式至关重要。How you store, back up, and replicate data is critical.

    • 选择应用程序数据的复制方法。Choose replication methods for your application data. 应用程序数据将存储在各种数据存储中,因此可能具有不同的可用性要求。Your application data is stored in various data stores and might have different availability requirements. 评估每种数据存储的复制方法和位置,确保它们满足要求。Evaluate the replication methods and locations for each type of data store to ensure that they satisfy your requirements.
    • 阐述并测试故障转移和故障回复流程。Document and test your failover and failback processes. 明确阐述有关故障转移到新数据存储的说明,并定期对其进行测试,确保这些说明准确且易于遵循。Clearly document instructions to fail over to a new data store, and test them regularly to make sure they are accurate and easy to follow.
    • 保护数据。Protect your data. 定期备份并验证数据,确保没有任何一个用户帐户既可以访问生产数据,也可以访问备份数据。Back up and validate data regularly, and make sure no single user account has access to both production and backup data.
    • 规划数据恢复。Plan for data recovery. 确保备份和复制策略提供的数据恢复时间可以满足服务级别要求。Make sure that your backup and replication strategy provides for data recovery times that meet your service-level requirements. 考虑到应用程序使用的所有数据类型,包括引用数据和数据库。Account for all types of data your application uses, including reference data and databases.

使用模拟和强制故障转移进行测试Test with simulations and forced failovers

可靠性测试需要在故障状态下度量端到端工作负荷的执行情况,而这种状态只能间歇性地出现。Testing for reliability requires measuring how the end-to-end workload performs under failure conditions that only occur intermittently.

  • 通过触发实际故障或模拟故障来测试常见故障场景。Test for common failure scenarios by triggering actual failures or by simulating them. 使用故障注入测试来测试常见场景(包括故障组合)和恢复时间。Use fault injection testing to test common scenarios (including combinations of failures) and recovery time.
  • 识别仅在承受负载的情况下才出现的故障。Identify failures that occur only under load. 使用生产数据或尽量与生产数据接近的合成数据测试峰值负载,以了解应用程序在真实条件下的行为。Test for peak load, using production data or synthetic data that is as close to production data as possible, to see how the application behaves under real-world conditions.
  • 运行灾难恢复演练。Run disaster recovery drills. 制定灾难恢复计划并定期测试,以确保该计划可正常运行。Have a disaster recovery plan in place, and test it periodically to make sure it works.
  • 执行故障转移和故障回复测试。Perform failover and failback testing. 确保应用程序的依赖服务按正确的顺序故障转移和故障回复。Ensure that your application's dependent services fail over and fail back in the correct order.
  • 运行模拟测试。Run simulation tests. 测试真实场景可以突显需要解决的问题。Testing real-life scenarios can highlight issues that need to be addressed. 场景应该可控,且不会中断业务。Scenarios should be controllable and non-disruptive to the business. 告知管理层将要模拟测试计划。Inform management of simulation testing plans.
  • 测试运行状况探测。Test health probes. 为负载均衡器和流量管理器配置运行状况探测,以检查关键系统组件。Configure health probes for load balancers and traffic managers to check critical system components. 测试这些探测,确保它们正确做出响应。Test them to make sure that they respond appropriately.
  • 测试监视系统。Test monitoring systems. 确保监视系统可靠报告关键信息和准确的数据来帮助识别潜在故障。Be sure that monitoring systems are reliably reporting critical information and accurate data to help identify potential failures.
  • 在测试场景中包含第三方服务。Include third-party services in test scenarios. 除了测试恢复功能以外,还要测试第三方服务中断可能造成的故障点。Test possible points of failure due to third-party service disruption, in addition to recovery.

测试是一个迭代过程。Testing is an iterative process. 需要测试应用程序、度量结果、分析并解决任何故障,并重复该流程。Test the application, measure the outcome, analyze and address any failures, and repeat the process.

有关测试应用程序可靠性的详细信息,请参阅测试 Azure 应用程序的复原能力和可用性For more information about testing for application reliability, see Testing Azure applications for resiliency and availability.

以一致的方式部署应用程序Deploy the application consistently

部署包括预配 Azure 资源、部署应用程序代码和应用配置设置。Deployment includes provisioning Azure resources, deploying application code, and applying configuration settings. 更新可能涉及到上述所有三个任务,或其中的一部分。An update may involve all three tasks or a subset of them.

将应用程序部署到生产环境后,更新操作就成了一个可能的出错来源。After an application is deployed to production, updates are a possible source of errors. 使用可预测且可重复的部署过程来尽量减少错误。Minimize errors with predictable and repeatable deployment processes.

  • 将应用程序部署流程自动化。Automate your application deployment process. 将尽可能多的流程自动化。Automate as many processes as possible.
  • 设计发布流程以尽量提高可用性。Design your release process to maximize availability. 如果发布流程要求服务在部署期间脱机,应用程序只能在重新联机后才可用。If your release process requires services to go offline during deployment, your application is unavailable until they come back online. 利用平台过渡和生产功能。Take advantage of platform staging and production features. 使用蓝绿发布或 Canary 发布方法来部署更新,以便在失败时可以快速回滚更新。Use blue-green or canary releases to deploy updates, so if a failure occurs, you can quickly roll back the update.
  • 为部署创建回滚计划。Have a rollback plan for deployment. 设计回滚流程,以便在发生部署失败时还原到上次已知正确的版本,并尽量减少停机时间。Design a rollback process to return to a last known good version and to minimize downtime if a deployment fails.
  • 记录并审核部署。Log and audit deployments. 如果使用分阶段部署方法,则生产环境中会运行应用程序的多个版本。If you use staged deployment techniques, more than one version of your application is running in production. 实施可靠的日志记录策略来尽量多地捕获版本特定的信息。Implement a robust logging strategy to capture as much version-specific information as possible.
  • 阐述应用程序发布流程。Document the application release process. 明确定义和阐述发布过程,并确保将其提供给整个运营团队。Clearly define and document your release process, and ensure that it's available to the entire operations team.

有关应用程序可靠性和部署的详细信息,请参阅部署具有复原能力和高可用性的 Azure 应用程序For more information about application reliability and deployment, see Deploying Azure applications for resiliency and availability.

监视应用程序运行状况Monitor application health

在应用程序中实施有关监视和警报的最佳做法,以便可以检测故障,并提醒操作员解决这些故障。Implement best practices for monitoring and alerts in your application so you can detect failures and alert an operator to fix them.

  • 实施运行状况探测和检查功能。Implement health probes and check functions. 在应用程序外部定期运行这些功能,以识别应用程序运行状况和性能下降的问题。Run them regularly from outside the application to identify degradation of application health and performance.

  • 检查长时间运行的工作流。Check long-running workflows. 提前检测到问题可以尽量减少回滚整个工作流或执行多个补偿事务的需要。Catching issues early can minimize the need to roll back the entire workflow or to execute multiple compensating transactions.

  • 维护应用程序日志。Maintain application logs.

    • 在生产环境在和服务边界处记录应用程序。Log applications in production and at service boundaries.
    • 使用语义和异步日志记录。Use semantic and asynchronous logging.
    • 将应用程序日志与审核日志区分开来。Separate application logs from audit logs.
  • 度量远程调用统计信息,并与应用程序团队共享数据。Measure remote call statistics, and share the data with the application team. 为了让运营团队即时查看应用程序运行状况,请汇总远程调用指标,例如延迟、吞吐量,以及第 99 和 95 百分位的错误。To give your operations team an instantaneous view into application health, summarize remote call metrics, such as latency, throughput, and errors in the 99 and 95 percentiles. 针对指标执行统计分析,以发现每个百分位中发生的错误。Perform statistical analysis on the metrics to uncover errors that occur within each percentile.

  • 跟踪适当时间范围内的暂时性异常和重试次数。Track transient exceptions and retries over an appropriate time frame. 在一段时间内异常的增加趋势指示此服务有问题,并可能发生故障。A trend of increasing exceptions over time indicates that the service is having an issue and may fail.

  • 设置提前警告系统。Set up an early warning system. 识别应用程序运行状况的关键性能指标 (KPI),例如暂时性异常和远程调用延迟,并为每个指标设置适当的阈值。Identify the key performance indicators (KPIs) of an application's health, such as transient exceptions and remote call latency, and set appropriate threshold values for each of them. 达到阈值时,将警报发送到操作员。Send an alert to operations when the threshold value is reached.

  • 在 Azure 订阅限制范围内操作Operate within Azure subscription limits. Azure 订阅限制特定的资源类型,例如资源组、核心和存储帐户的数目。Azure subscriptions have limits on certain resource types, such as the number of resource groups, cores, and storage accounts. 监视资源类型的用法。Watch your use of resource types.

  • 监视第三方服务。Monitor third-party services. 记录调用,并使用唯一标识符将其关联到应用程序的运行状况和诊断日志记录。Log your invocations and correlate them with your application's health and diagnostic logging using a unique identifier.

  • 为多个操作员提供培训,使他们能够监视应用程序并执行手动恢复步骤。Train multiple operators to monitor the application and to perform manual recovery steps. 确保始终至少有一名经过培训的操作员在值勤。Make sure there is always at least one trained operator active.

有关监视应用程序可靠性的详细信息,请参阅监视 Azure 应用程序运行状况For more information about monitoring for application reliability, see Monitoring Azure application health.

应对故障和灾难Respond to failures and disasters

创建恢复计划,确保其中涵盖了数据还原、网络中断、依赖服务故障和区域范围的服务中断。Create a recovery plan, and make sure that it covers data restoration, network outages, dependent service failures, and region-wide service disruptions. 在恢复策略中考虑到 VM、存储、数据库和其他 Azure 平台服务。Consider your VMs, storage, databases, and other Azure platform services in your recovery strategy.

  • 规划 Azure 支持交互。Plan for Azure support interactions. 在需求出现之前,请建立有关联系 Azure 支持部门的流程。Before the need arises, establish a process for contacting Azure support.
  • 阐述并测试灾难恢复计划。Document and test your disaster recovery plan. 编写一个灾难恢复计划来反映应用程序故障对业务造成的影响。Write a disaster recovery plan that reflects the business impact of application failures. 尽可能地将恢复流程自动化,并阐述任何手动步骤。Automate the recovery process as much as possible, and document any manual steps. 定期测试灾难恢复过程,以验证并改进计划。Regularly test your disaster recovery process to validate and improve the plan.
  • 根据需要手动故障转移。Fail over manually when required. 某些系统无法自动故障转移,需要手动故障转移。Some systems can't fail over automatically and require a manual failover. 如果应用程序故障转移到次要区域,请执行操作就绪性测试。If an application fails over to a secondary region, perform an operational readiness test. 故障回复之前,请验证主要区域是否正常,并可以再次接收流量。Verify that the primary region is healthy and ready to receive traffic again before failing back. 确定有哪些应用程序功能减弱,以及应用如何告知用户出现了暂时性问题。Determine what the reduced application functionality is and how the app informs users of temporary problems.
  • 针对应用程序故障做好准备。Prepare for application failure. 准备好应对一系列故障,包括系统自动处理的故障、导致功能减弱的故障,以及导致应用程序不可用的故障。Prepare for a range of failures, including faults that are handled automatically, those that result in reduced functionality, and those that cause the application to become unavailable. 应用程序应该告知用户出现了暂时性问题。The application should inform users of temporary issues.
  • 数据损坏后进行恢复。Recover from data corruption. 如果数据存储发生故障,请在该存储再次可用后检查数据不一致情况,尤其是数据是复制过来的情况下。If a failure happens in a data store, check for data inconsistencies when the store becomes available again, especially if the data was replicated. 从备份还原已损坏的数据。Restore corrupt data from a backup.
  • 网络中断后进行恢复。Recover from a network outage. 你也许可以使用缓存的数据在本地运行应用程序,但应用程序的功能会减弱。You might be able to use cached data to run locally with reduced application functionality. 否则,请考虑关闭应用程序,或故障转移到另一个区域。If not, consider application downtime or fail over to another region. 将数据存储在备用位置,直到连接恢复。Store your data in an alternate location until connectivity is restored.
  • 发生依赖服务故障后进行恢复。Recover from a dependent service failure. 确定哪些功能仍然可用,以及应用程序如何做出响应。Determine which functionality is still available and how the application should respond.
  • 发生区域范围的服务中断后进行恢复。Recover from a region-wide service disruption. 区域范围的服务中断并不常见,但你应该制定一个策略来应对此类问题,尤其是针对关键应用程序。Region-wide service disruptions are uncommon, but you should have a strategy to address them, especially for critical applications. 你也许可以将应用程序重新部署到另一个区域,或重新分发流量。You might be able to redeploy the application to another region or redistribute traffic.

有关应对故障和灾难恢复的详细信息,请参阅 Azure 应用程序的故障和灾难恢复For more information about responding to failures and disaster recovery, see Failure and disaster recovery for Azure applications.