您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

Azure 应用程序灾难恢复Disaster recovery for Azure applications

灾难恢复 (DR) 侧重于从应用程序功能的灾难性损失中恢复。Disaster recovery (DR) is focused on recovering from a catastrophic loss of application functionality. 例如,如果托管应用程序的 Azure 区域变得不可用,则需制定一个计划,以便在另一个区域运行应用程序或访问数据。For example, if an Azure region hosting your application becomes unavailable, you need a plan for running your application or accessing your data in another region.

业务和技术所有者必须在灾难期间确定需要多少功能。Business and technology owners must determine how much functionality is required during a disaster. 此功能级别表现为几种形式:完全不可用、部分可用(功能减弱或延迟处理)或完全可用。This level of functionality can take a few forms: completely unavailable, partially available via reduced functionality or delayed processing, or fully available.

复原和高可用性策略旨在处理临时故障情况。Resiliency and high availability strategies are intended to handling temporary failure conditions. 此计划的执行涉及人员、流程以及允许系统继续正常运转的支持性应用程序。Executing this plan involves people, processes, and supporting applications that allow the system to continue functioning. 你的计划应包括排练故障和测试数据库恢复,以确保该计划的可靠性。Your plan should include rehearsing failures and testing the recovery of databases to ensure the plan is sound.

Azure 灾难恢复功能Azure disaster recovery features

与可用性注意事项一样,Azure 提供了旨在支持灾难恢复的复原技术指南As with availability considerations, Azure provides resiliency technical guidance designed to support disaster recovery. Azure 的可用性功能与灾难恢复之间还存在某种关系。There is also a relationship between availability features of Azure and disaster recovery. 例如,跨容错域管理角色可提高应用程序的可用性。For example, the management of roles across fault domains increases the availability of an application. 如果没有此项管理,则未经处理的硬件故障会演变为“灾难”情形。Without that management, an unhandled hardware failure would become a “disaster” scenario. 利用这些可用性功能和策略是应用程序防灾的重要部分。Leveraging these availability features and strategies is an important part of disaster-proofing your application. 但是,本文不仅介绍一般的可用性问题,还涉及到严重(且罕见)的灾难事件。However, this article goes beyond general availability issues to more serious (and rarer) disaster events.

多个数据中心区域Multiple datacenter regions

Azure 在世界各地许多不同的区域保留了数据中心。Azure maintains datacenters in many regions around the world. 此基础结构支持多种灾难恢复方案,如系统提供的将 Azure 存储异地复制到次要区域。This infrastructure supports several disaster recovery scenarios, such as system-provided geo-replication of Azure Storage to secondary regions. 你还可以轻松而经济地将云服务部署到全球多个位置。You can also easily and inexpensively deploy a cloud service to multiple locations around the world. 将此与在多个区域生成和维护自己的数据中心的成本和困难程度相比,高下立见。Compare this with the cost and difficulty of building and maintaining your own datacenters in multiple regions. 将数据和服务部署到多个数据中心可保护应用程序不会在单个区域中发生重大中断。Deploying data and services to multiple regions helps protect your application from a major outage in a single region. 设计灾难恢复计划时,务必了解配对区域这一概念。As you design your disaster recovery plan, it’s important to understand the concept of paired regions. 有关详细信息,请参阅业务连续性和灾难恢复 (BCDR):Azure 配对区域For more information, see Business continuity and disaster recovery (BCDR): Azure Paired Regions.

Azure Site RecoveryAzure Site Recovery

Azure Site Recovery 提供了一种简单方法,用于在区域之间复制 Azure VM。Azure Site Recovery provides a simple way to replicate Azure VMs between regions. 它的管理开销极低,因为无需在次要区域中预配任何附加资源。It has minimumal management overhead, because you don't need to provision any additional resources in the secondary region. 启用复制时,Site Recovery 会根据源 VM 设置自动在目标区域中创建所需的资源。When you enable replication, Site Recovery automatically creates the required resources in the target region, based on the source VM settings. 它提供自动连续复制,只需单击一下鼠标就能执行应用程序故障转移。It provides automated continuous replication, and enables you to perform application failover with a single click. 此外,可以通过测试故障转移运行灾难恢复演练,而不影响生产工作负荷或正在进行的复制。You can also run disaster recovery drills by testing failover, without affecting your production workloads or ongoing replication.

Azure 流量管理器Azure Traffic Manager

发生数据中心特有的故障后,必须将流量重定向到另一区域中的服务或部署。When a region-specific failure occurs, you must redirect traffic to services or deployments in another region. 通过诸如 Azure 流量管理器之类的服务处理此操作是最有效的方法,主要区域发生故障时,此服务会将用户流量自动故障转移到另一个区域。It is most effective to handle this via services such as Azure Traffic Manager, which automates the failover of user traffic to another region if the primary region fails. 设计有效的 DR 策略时,了解流量管理器基础知识很重要。Understanding the fundamentals of Traffic Manager is important when designing an effective DR strategy.

流量管理器根据流量路由方法和终结点的运行状况,使用域名系统 (DNS) 将客户端请求定向到最合适的终结点。Traffic Manager uses the Domain Name System (DNS) to direct client requests to the most appropriate endpoint based on a traffic-routing method and the health of the endpoints. 在下图中,用户连接到流量管理器 URL (http://myATMURL.trafficmanager.net),这是实际站点 URL(http://app1URL.cloudapp.nethttp://app2URL.cloudapp.net)的抽象形式。In the following diagram, users connect to a Traffic Manager URL (http://myATMURL.trafficmanager.net) which abstracts the actual site URLs (http://app1URL.cloudapp.net and http://app2URL.cloudapp.net). 用户请求根据你配置的流量管理器路由方法路由到正确的基础 URL。User requests are routed to the proper underlying URL based on your configured Traffic Manager routing method. 在本文中,我们只讨论故障转移选项。For the sake of this article, we will be concerned with only the failover option.

通过 Azure 流量管理器路由

配置流量管理器时,需提供一个新的流量管理器 DNS 前缀,用户将用它来访问你的服务。When configuring Traffic Manager, you provide a new Traffic Manager DNS prefix, which users will use to access your service. 流量管理器当前将负载均衡从区域级别向上提升了一个级别。Traffic Manager now abstracts load balancing one level higher that the regional level. 对于流量管理器管理的所有部署,其 DNS 均映射到某个 CNAME。The Traffic Manager DNS maps to a CNAME for all the deployments that it manages.

在流量管理器内,指定在发生故障时将用户路由到的部署的优先级列表。Within Traffic Manager, you specify a prioritized list of deployments that users will be routed to when failure occurs. 流量管理器监视部署终结点。Traffic Manager monitors the deployment endpoints. 如果主部署变得不可用,流量管理器会将用户路由到优先级列表中的下一个部署。If the primary deployment becomes unavailable, Traffic Manager routes users to the next deployment on the priority list.

虽然流量管理器决定了在故障转移时前往何方,但在非故障转移模式下,你可以决定故障转移域是处于休眠还是活动状态(这与流量管理器无关)。Although Traffic Manager decides where to go during a failover, you can decide whether your failover domain is dormant or active while you're not in failover mode (which is unrelated to Traffic Manager). 流量管理器在主站点中检测到故障后,无论该站点当前是否为用户提供服务,都会转移到故障转移站点。Traffic Manager detects a failure in the primary site and rolls over to the failover site, regardless of whether that site is currently serving users.

有关 Azure 流量管理器工作原理的详细信息,请参阅:For more information on how Azure Traffic Manager works, refer to:

Azure 灾难情况Azure disaster scenarios

以下部分涵盖多种不同类型的灾难情况。The following sections cover several different types of disaster scenarios. 区域范围的服务中断不是应用程序范围内发生故障的唯一原因。Region-wide service disruptions are not the only cause of application-wide failures. 设计不良和管理错误也会导致中断。Poor design and administrative errors can also lead to outages. 请在恢复计划的设计和测试阶段设想可能导致故障的原因,这样做很重要。It's important to consider the possible causes of a failure during both the design and testing phases of your recovery plan. 一个好的计划可充分利用 Azure 功能,并通过应用程序特有的策略强化这些功能。A good plan takes advantage of Azure features and augments them with application-specific strategies. 由应用程序的重要性、恢复点目标 (RPO) 和恢复时间目标 (RTO) 确定所选的响应。The chosen response is determined by the importance of the application, the recovery point objective (RPO), and the recovery time objective (RTO).

应用程序故障Application failure

如前所述,Azure 流量管理器会自动处理因主机虚拟机中的底层硬件或操作系统软件引起的故障。Azure Traffic Manager automatically handles failures that result from the underlying hardware or operating system software in the host virtual machine. Azure 会创建新的角色实例,并将它添加到可用池中。Azure creates a new role instance and adds it to the available pool. 如果有多个角色实例正在运行,Azure 会将处理任务转移到其他正在运行的角色实例,同时替换发生故障的节点。If more than one role instance was already running, Azure shifts processing to the other running role instances while replacing the failed node.

即使没有任何硬件或操作系统底层故障,也可能出现严重的应用程序错误。Serious application errors can occur without any underlying failure of the hardware or operating system. 应用程序可能因逻辑错误或数据完整性问题导致的灾难性异常而发生故障。The application might fail due to catastrophic exceptions caused by bad logic or data integrity issues. 必须在应用程序代码中加入足够的遥测数据,使监视系统可检测到故障情况并通知应用程序管理员。You must include sufficient telemetry in the application code so that a monitoring system can detect failure conditions and notify an application administrator. 充分了解灾难恢复过程的管理员在解决关键错误时,可以决定是触发故障转移过程还是接受可用性中断。An administrator who has full knowledge of the disaster recovery processes can decide whether to trigger a failover process or accept an availability outage while resolving the critical errors.

数据损坏Data corruption

Azure 自动将 Azure SQL 数据库和 Azure 存储数据在同一区域的不同容错域内冗余地存储三次。Azure automatically stores Azure SQL Database and Azure Storage data three times redundantly within different fault domains in the same region. 如果使用异地复制,则再将这些数据在另一个区域内存储三次。If you use geo-replication, the data is stored three additional times in a different region. 但是,如果用户或应用程序损坏了主副本中的数据,则会将损坏情况迅速复制到其他副本。However, if your users or your application corrupts that data in the primary copy, the data quickly replicates to the other copies. 不幸的是,这会产生多份损坏的数据。Unfortunately, this results in multiple copies of corrupt data.

要应对可能的数据损坏,可以采用两种做法。To manage potential corruption of your data, you have two options. 首先,可以管理自定义备份策略。First, you can manage a custom backup strategy. 可以将备份存储在 Azure 中或存储在本地,具体取决于业务需求或治理监管。You can store your backups in Azure or on-premises, depending on your business requirements or governance regulations. 另一种做法是使用时间点还原选项来恢复 SQL 数据库。Another option is to use the point-in-time restore option to recover a SQL database. 有关详细信息,请参阅下面的灾难恢复的数据策略部分。For more information, see the data strategies for disaster recovery section below.

网络中断Network outage

当 Azure 网络的某些部分中断时,可能无法访问应用程序或数据。When parts of the Azure network are inaccessible, you may be unable to access your application or data. 如果一个或多个角色实例因网络问题而不可用,则 Azure 将利用应用程序剩余的可用实例。If one or more role instances are unavailable due to network issues, Azure uses the remaining available instances of your application. 如果应用程序因 Azure 网络中断而无法访问其数据,你可以使用缓存数据在本地以应用程序功能减弱的方式运行。If your application cannot access its data because of an Azure network outage, you can potentially run with reduced application functionality locally by using cached data. 你需要将灾难恢复策略设计为在应用程序中以功能减弱的方式运行。You need to design the disaster recovery strategy to run with reduced functionality in your application. 某些应用程序可能做不到这一点。For some applications, this might not be practical.

另一个选项是将数据存储在备用位置,直到连接恢复。Another option is to store data in an alternate location until connectivity is restored. 如果减弱功能不是好办法,则剩余的选项为应用程序停机或故障转移到备用区域。If reducing functionality is not an option, the remaining options are application downtime or failover to an alternate region. 设计运行功能减弱的应用程序多出于业务决策而非技术决策。The design of an application running with reduced functionality is as much a business decision as a technical one. 应用程序功能减弱部分深入讨论了这一问题。This is discussed further in the section on reduced application functionality.

从属服务故障Failure of a dependent service

Azure 提供的许多服务可能会定期停机。Azure provides many services that can experience periodic downtime. 例如,Azure Redis 缓存是一种多租户服务,用于向应用程序提供缓存功能。For example, Azure Redis Cache is a multi-tenant service which provides caching capabilities to your application. 设想如果从属服务不可用,应用程序中将发生什么,这样做很重要。It's important to consider what happens in your application if the dependent service is unavailable. 此方案在许多方面与网络中断方案类似。In many ways, this scenario is similar to the network outage scenario. 但是,单独考量每一项服务有望改进整个计划。However, considering each service independently results in potential improvements to your overall plan.

例如,Azure Redis 缓存从云服务部署内部为应用程序提供缓存,从而提供灾难恢复的优势。Azure Redis Cache provides caching to your application from within your cloud service deployment, which provides disaster recovery benefits. 首先,服务现在运行在部署本地的角色上。First, the service now runs on roles that are local to your deployment. 因此,在云服务的总体管理过程中,可更好地监视和管理缓存的状态。Therefore, you're better able to monitor and manage the status of the cache as part of your overall management processes for the cloud service. 此类缓存还为缓存数据提供一些新功能,比如高可用性,用于在单个节点发生故障时,通过在其他节点上维护重复的副本来保留缓存数据。This type of caching also exposes new features such as high availability for cached data, which preserves cached data if a single node fails by maintaining duplicate copies on other nodes.

请注意,高可用性会降低吞吐量并增大延迟,因为写入操作还必须更新所有辅助副本。Note that high availability decreases throughput and increases latency because write operations must also upedate any secondary copies. 存储缓存数据所需的内存量实际会增加一倍,进行容量规划时必须考虑到这一点。The amount of memory required to store the cached data is effectively doubled, which must be taken into account during capacity planning. 此示例表明,每项从属服务都可能具有提高总体可用性和帮助抵御灾难性故障的能力。This example demonstrates that each dependent service might have capabilities that improve your overall availability and resistance to catastrophic failures.

通过每个从属服务,应够可以了解服务中断造成的影响。With each dependent service, you should understand the implications of a service disruption. 在缓存示例中,或许可以直接从数据库访问数据,直到还原缓存为止。In the caching example, it might be possible to access the data directly from a database until you restore your cache. 这在提供对应用程序数据的完全访问权限的同时,会导致性能降低。This would result in reduced performance while providing full access to application data.

区域范围的服务中断Region-wide service disruption

以前的故障主要还是可在同一 Azure 区域内应对的故障。The previous failures have primarily been failures that can be managed within the same Azure region. 但是,还必须为整个区域发生服务中断的可能性做好准备。However, you must also prepare for the possibility that there is a service disruption of the entire region. 发生区域范围的服务中断时,数据的本地冗余副本不可用。If a region-wide service disruption occurs, the locally redundant copies of your data are not available. 如果启用了异地复制,则在异地区域内另有 Blob 和表的三个副本。If you have enabled geo-replication, there are three additional copies of your blobs and tables in a different region. 如果 Microsoft 声明区域发生故障,Azure 会将所有 DNS 条目将重新映射到异地复制的区域。If Microsoft declares the region lost, Azure remaps all of the DNS entries to the geo-replicated region.


注意,对此过程无任何控制权,并且仅对区域范围的服务中断进行此过程。Be aware that you don't have any control over this process, and it will occur only for region-wide service disruption. 考虑使用 Azure Site Recovery 获取更有利的 RPO 和 RTO。Consider using Azure Site Recovery to achieve better RPO and RTO. Site Recovery 可让应用程序确定什么是可以接受的服务中断,以及何时故障转移到已复制的 VM。Site Recovery allows application to decide what is an acceptable outage, and when to fail over to the replicated VMs.

Azure 范围的服务中断Azure-wide service disruption

在灾难规划中,必须考虑到所有可能发生的灾难情况。In disaster planning, you must consider the entire range of possible disasters. 最严重的一个故障将同时涉及所有 Azure 区域。One of the most severe service disruptions would involve all Azure regions simultaneously. 与其他服务中断一样,在这种情况下,你可能决定接受临时停机的风险。As with other service disruptions, you might decide to accept the risk of temporary downtime in that event. 跨越多个区域的广泛服务中断比涉及从属服务或单个区域的孤立服务中断少见得多。Widespread service disruptions that span regions are much rarer than isolated service disruptions involving dependent services or single regions.

但是,你可能认为某些任务关键型应用程序需要一个针对多区域服务中断的备份计划。However, you may decide that certain mission-critical applications require a backup plan for a multi-region service disruption. 此计划可能包括故障转移到备用云混合本地和云解决方案中的服务。This plan might include failing over to services in an alternative cloud or a hybrid on-premises and cloud solution.

应用程序功能减弱Reduced application functionality

设计良好的应用程序通常使用一些服务,这些服务通过实现松散耦合的信息互换模式相互通信。A well-designed application typically uses services that communicate with each other though the implementation of loosely coupled information-interchange patterns. 适合 DR 的应用程序需要在服务级别分离职责。A DR-friendly application requires separation of responsibilities at the service level. 这可以防止从属服务中断导致整个应用程序停止运行。This prevents the disruption of a dependent service from bringing down the entire application. 例如,以 Y 公司的电子商务应用程序为例,该应用程序可能由以下模块构成:For example, consider a web commerce application for Company Y. The following modules might constitute the application:

  • 产品目录:便于用户浏览产品。Product Catalog allows users to browse products.
  • 购物车:便于用户在其购物车中添加/删除产品。Shopping Cart allows users to add/remove products in their shopping cart.
  • 订单状态:显示用户订单的发货状态。Order Status shows the shipping status of user orders.
  • 订单提交:通过提交订单并付款,完成购物过程。Order Submission finalizes the shopping session by submitting the order with payment.
  • 订单处理:验证订单的数据完整性并执行数量可用性检查。Order Processing validates the order for data integrity and performs a quantity availability check.

当此应用程序中的某个服务依赖项变得不可用时,在依赖项恢复之前,该服务如何工作?When a service dependency in this application becomes unavailable, how does the service function until the dependency recovers? 设计良好的系统会通过在设计时和运行时分离职责来实施隔离边界。A well-designed system implements isolation boundaries through separation of responsibilities, both at design time and at runtime. 可以将每个故障分类为可恢复和不可恢复。You can categorize every failure as recoverable and non-recoverable. 不可恢复的错误会使服务停止运行,但可以通过备选项来规避可恢复的错误。Non-recoverable errors will bring down the service, but you can mitigate a recoverable error through alternatives. 可通过自动处理故障和采取备用措施解决的特定问题对用户是透明的。Certain problems addressed by automatically handling faults and taking alternate actions are transparent to the user. 在更严重的服务中断期间,应用程序可能完全不可用。During a more serious service disruption, the application might be completely unavailable. 第三种做法是继续使用减弱的功能处理用户请求。A third option is to continue handling user requests with reduced functionality.

例如,如果托管订单的数据库停运,则“订单处理”服务无法处理销售事务。For instance, if the database for hosting orders goes down, the Order Processing service loses its ability to process sales transactions. 根据体系结构的不同,该应用程序的“订单提交”和“订单处理”服务可能难以或无法继续。Depending on the architecture, it might be difficult or impossible for the Order Submission and Order Processing services of the application to continue. 如果该应用程序未设计成处理这种情况,则整个应用程序可能脱机。If the application is not designed to handle this scenario, the entire application might go offline. 但是,如果产品数据存储在其他位置,则仍可使用“产品目录”模块来查看产品。However, if the product data is stored in a different location, then the Product Catalog module can still be used for viewing products. 不过,该应用程序的其他部分不可用,如下订单或库存查询。However, other parts of the application are unavailable, such as ordering or inventory queries.

决定哪项减弱的应用程序功能可用既是一项业务决策,也是一项技术决策。Deciding what reduced application functionality is available is both a business decision and a technical decision. 你必须决定该应用程序如何向用户通知所有临时问题。You must decide how the application will inform the users of any temporary problems. 在上例中,该应用程序可能允许查看产品以及将这些产品添加到购物车。In the example above, the application might allow viewing products and adding them to a shopping cart. 但是,当用户尝试进行购买时,该应用程序向用户通知订购功能暂时不可用。However, when the user attempts to make a purchase, the application notifies the user that the ordering functionality is temporarily unavailable. 对于客户来说,这不是理想状态,但这样确实可防止应用程序范围的服务中断。This isn't ideal for the customer, but it does prevent an application-wide service disruption.

灾难恢复的数据策略Data strategies for disaster recovery

恰当的数据处理是灾难恢复计划中颇具难度的一方面。Proper data handling is a challenging aspect of a disaster recovery plan. 在恢复过程中,数据还原通常最耗时间。During the recovery process, data restoration typically takes the most time. 在发生故障之后,为减弱功能做出的不同选择使数据的故障恢复和一致性面临严峻挑战。Different choices for reducing functionality result in difficult challenges for data recovery from failure and consistency after failure.

其中一个考虑因素是需要还原或维护应用程序数据的副本。One consideration is the need to restore or maintain a copy of the application’s data. 此数据会在辅助站点用于参考和事务目的。You will use this data for reference and transactional purposes at a secondary site. 本地部署需要一个成本高昂且耗时漫长的规划过程才能实施多区域灾难恢复策略。An on-premises deployment requires an expensive and lengthy planning process to implement a multiple-region disaster recovery strategy. 方便的是,包括 Azure 在内的大多数云提供商均已允许将应用程序部署到多个区域。Conveniently, most cloud providers, including Azure, readily allow the deployment of applications to multiple regions. 这些区域分散在各地,因此应该极少发生多个区域同时服务中断的情况。These regions are geographically distributed in such a way that multiple-region service disruption should be extremely rare. 能够跨区域处理数据的策略是任何灾难恢复计划成功的决定性因素之一。The strategy for handling data across regions is one of the contributing factors for the success of any disaster recovery plan.

以下部分讨论与数据备份、引用数据和事务数据相关的灾难恢复方法。The following sections discuss disaster recovery techniques related to data backups, reference data, and transactional data.

备份和还原Backup and restore

定期备份应用程序数据可为某些灾难恢复方案提供支持。Regular backups of application data can support some disaster recovery scenarios. 不同的存储资源需要使用不同的方法。Different storage resources require different techniques.

SQL 数据库SQL Database

对于基本、标准和高级 SQL 数据库层,可以利用时间点还原来恢复数据库。For the Basic, Standard, and Premium SQL Database tiers, you can take advantage of point-in-time restore to recover your database. 有关详细信息,请参阅概述:云业务连续性与使用 SQL 数据库进行数据库灾难恢复For more information, see Overview: Cloud business continuity and database disaster recovery with SQL Database. 另一种做法是对 SQL 数据库使用活动异地复制。Another option is to use Active Geo-Replication for SQL Database. 这会自动将数据库更改复制到相同甚至不同 Azure 区域中的辅助数据库。This automatically replicates database changes to secondary databases in the same Azure region or even in a different Azure region. 这提供了一些本文介绍的手动程度更高的数据同步技术的潜在替代方法。This provides a potential alternative to some of the more manual data synchronization techniques presented in this article. 有关详细信息,请参阅概述:SQL 数据库活动异地复制For more information, see Overview: SQL Database Active Geo-Replication.

还可以使用手动程度更高的方法进行备份和还原。You can also use a more manual approach for backup and restore. 使用 DATABASE COPY 命令可创建具有事务一致性的数据库备份副本。Use the DATABASE COPY command to create a backup copy of the database with transactional consistency. 也可以使用 Azure SQL 数据库的导入/导出服务,该服务支持将数据库导出到存储在 Azure Blob 存储中的 BACPAC 文件(包含数据库架构和关联数据的压缩文件)。You can also use the import/export service of Azure SQL Database, which supports exporting databases to BACPAC files (compressed files containing your database schema and associated data) that are stored in Azure Blob storage.

Azure 存储内置的冗余性在同一区域中创建备份文件的两个副本。The built-in redundancy of Azure Storage creates two replicas of the backup file in the same region. 但是,由运行备份过程的频率决定 RPO,即可能在灾难情况下丢失的数据量。However, the frequency of running the backup process determines your RPO, which is the amount of data you might lose in disaster scenarios. 例如,假设每个整点执行一次备份,而灾难发生在整点前的两分钟。For example, imagine that you perform a backup at the top of each hour, and a disaster occurs two minutes before the top of the hour. 那么,会丢失在执行上次备份之后记录的 58 分钟的数据。You lose 58 minutes of data recorded after the last backup was performed. 此外,为了应对区域范围的服务中断,应将 BACPAC 文件复制到备用区域。Also, to protect against a region-wide service disruption, you should copy the BACPAC files to an alternate region. 之后可以在备用区域还原这些备份。You then have the option of restoring those backups in the alternate region. 有关更多详细信息,请参阅概述:云业务连续性与使用 SQL 数据库进行数据库灾难恢复For more details, see Overview: Cloud business continuity and database disaster recovery with SQL Database.

Azure 存储Azure Storage

对于 Azure 存储,可制定一个自定义备份过程,也可使用许多第三方备份工具中的某一个。For Azure Storage, you can develop a custom backup process or use one of many third-party backup tools. 请注意,在大多数应用程序设计中,还有许多其他的复杂情况,其中存储资源互相引用对方。Note that most application designs have additional complexities where storage resources reference each other. 例如,设想一个 SQL 数据库,其中一列链接到 Azure 存储中的 Blob。For example, consider a SQL database that has a column that links to a blob in Azure Storage. 如果未能同时进行备份,则数据库可能会提供一个指针,指向在发生故障之前未备份的 Blob。If the backups do not happen simultaneously, the database might have a pointer to a blob that was not backed up before the failure. 应用程序或灾难恢复计划必须实现在恢复后处理这种不一致性的过程。The application or disaster recovery plan must implement processes to handle this inconsistency after a recovery.

其他数据平台Other data platforms

服务架构 (IaaS) 托管的其他数据平台(如 Elasticsearch 或 MongoDB)在创建集成的备份和还原过程时具有自己的功能和注意事项。Other infrastructure-as-a-service (IaaS) hosted data platforms, such as Elasticsearch or MongoDB, have their own capabilities and considerations when creating an integrated backup and restore process. 对于这些数据平台,一般建议使用任何基于集成的本机或可用复制或快照功能。For these data platforms, the general recommendation is to use any native or available integration-based replication or snapshotting capabilities. 如果这些功能不存在或不适用,则考虑使用 Azure 备份服务或托管/非托管磁盘快照来创建应用程序数据的时间点副本。If those capabilities do not exist or are not suitable, then consider using Azure Backup Service or managed/unmanaged disk snapshots to create a point-in-time copy of application data. 无论哪种情况,都务必确定如何实现一致的备份,尤其是当应用程序数据分布在多个文件系统,或者使用卷管理器或基于软件的 RAID 将多个驱动器合并到单个文件系统时。In all cases, it’s important to determine how to achieve consistent backups, especially when application data spans multiple files systems, or when multiple drives are combined into a single file system using volume managers or software-based RAID.

灾难恢复的引用数据模式Reference data pattern for disaster recovery

引用数据是支持应用程序功能的只读数据。Reference data is read-only data that supports application functionality. 这些数据通常不经常更改。It typically does not change frequently. 尽管备份和还原是处理区域范围的服务中断的一种方法,但 RTO 耗时相对较长。Although backup and restore is one method to handle region-wide service disruptions, the RTO is relatively long. 将应用程序部署到次要区域后,有一些策略可改进引用数据的 RTO。When you deploy the application to a secondary region, some strategies can improve the RTO for reference data.

由于引用数据不经常更改,因此可通过在次要区域内保留引用数据的永久副本,缩短 RTO。Because reference data changes infrequently, you can improve the RTO by maintaining a permanent copy of the reference data in the secondary region. 这样可消除发生灾难时还原备份所需的时间。This eliminates the time required to restore backups in the event of a disaster. 要满足多区域灾难恢复要求,必须将应用程序和引用数据一起部署到多个区域。To meet the multiple-region disaster recovery requirements, you must deploy the application and the reference data together in multiple regions. 高可用性的引用数据模式中所述,可以将引用数据部署到角色本身、外部存储或这两者的组合。As mentioned in Reference data pattern for high availability, you can deploy reference data to the role itself, to external storage, or to a combination of both.

计算节点内引用数据部署模型还隐式满足了灾难恢复要求。The reference data deployment model within compute nodes implicitly satisfies the disaster recovery requirements. 将引用数据部署到 SQL 数据库需要将引用数据的副本部署到每个区域。Reference data deployment to SQL Database requires that you deploy a copy of the reference data to each region. 同样的策略也适用于 Azure 存储。The same strategy applies to Azure Storage. 必须将存储在 Azure 存储中的任何引用数据副本部署到主要区域和次要区域。You must deploy a copy of any reference data that's stored in Azure Storage to the primary and secondary regions.


必须对所有数据(包括引用数据)实现自己的应用程序特有的备份例程。You must implement your own application-specific backup routines for all data, including reference data. 仅在整个区域范围的服务中断时,使用跨区域的异地复制副本。Geo-replicated copies across regions are used only in a region-wide service disruption. 为了防止出现长时间的停机,应将应用程序的任务关键部分的数据部署到次要区域。To prevent extended downtime, deploy the mission-critical parts of the application’s data to the secondary region. 有关此拓扑的示例,请参阅主动-被动模型For an example of this topology, see the active-passive model.

灾难恢复的事务数据模式Transactional data pattern for disaster recovery

实施功能完备的灾难模式策略需要将事务数据异步复制到次要区域。Implementation of a fully functional disaster mode strategy requires asynchronous replication of the transactional data to the secondary region. 可进行复制的实际时间范围将决定应用程序的 RPO 特征。The practical time windows within which the replication can occur will determine the RPO characteristics of the application. 仍然可以从主要区域恢复在复制期间丢失的数据。You might still recover the data that was lost from the primary region during the replication window. 以后还可以与次要区域合并。You might also be able to merge with the secondary region later.

以下体系结构示例介绍在故障转移情况下处理事务数据的几种不同方式。The following architecture examples provide some ideas on different ways of handling transactional data in a failover scenario. 这些示例并未尽列,注意到这一点很重要。It's important to note that these examples are not exhaustive. 例如,可将中间存储位置(如队列)替换为 Azure SQL 数据库。For example, intermediate storage locations such as queues might be replaced with Azure SQL Database. 队列自身可以是 Azure 存储或 Azure 服务总线队列(请参阅 Azure 队列和服务总线队列 - 比较与对照)。The queues themselves might be either Azure Storage or Azure Service Bus queues (see Azure queues and Service Bus queues - compared and contrasted). 服务器存储目标也可能有所不同,如使用 Azure 表而不是 SQL 数据库。Server storage destinations might also vary, such as Azure tables instead of SQL Database. 此外,在不同步骤中,还可插入辅助角色作为中介。In addition, worker roles might be inserted as intermediaries in various steps. 其目的不在于精确地模仿这些体系结构,而是在恢复事务数据和相关模块时考虑各种备选方法。The intent is not to emulate these architectures exactly, but to consider various alternatives in the recovery of transactional data and related modules.

复制事务数据以准备灾难恢复Replication of transactional data in preparation for disaster recovery

设想一个应用程序,其中使用 Azure 存储队列保存事务数据。Consider an application that uses Azure Storage queues to hold transactional data. 这允许辅助角色在去耦体系结构中处理事务数据并将其放入服务器数据库。This allows worker roles to process the transactional data to the server database in a decoupled architecture. 如果前端角色要求立即查询这些数据,则这要求事务使用某种形式的临时缓存。This requires the transactions to use some form of temporary caching if the front-end roles require the immediate query of that data. 根据数据丢失承受程度的不同,可以选择复制队列、数据库或所有存储资源。Depending on the level of data-loss tolerance, you might choose to replicate the queues, the database, or all of the storage resources. 如果只是复制数据库,则当主要区域停机后,仍然可以在主要区域恢复时恢复队列中的数据。With only database replication, if the primary region goes down, you can still recover the data in the queues when the primary region comes back.

下图显示一种体系结构,其中跨区域同步了服务器数据库。The following diagram shows an architecture where the server database is synchronized across regions.


实现这种体系结构的最大难题是区域之间的复制策略。The biggest challenge to implementing this architecture is the replication strategy between regions. Azure SQL 数据同步服务可以实现这种类型的复制。The Azure SQL Data Sync service enables this type of replication. 撰写本文时,该服务处于预览阶段,尚不建议用于生产环境。As of this writing, the service is in preview and is not yet recommended for production environments. 有关详细信息,请参阅概述:云业务连续性与使用 SQL 数据库进行数据库灾难恢复For more information, see Overview: Cloud business continuity and database disaster recovery with SQL Database. 对于生产应用程序,必须投资购入第三方解决方案或在代码中创建自己的复制逻辑。For production applications, you must invest in a third-party solution or create your own replication logic in code. 根据体系结构的不同,可能会进行双向复制,这种复制更为复杂。Depending on the architecture, the replication might be bidirectional, which is more complex.

一种可能的实现方法是在上述示例中使用中间队列。One potential implementation might make use of the intermediate queue in the previous example. 处理数据并将其放入最终存储目标的辅助角色可以同时在主要区域和次要区域内做出更改。The worker role that processes the data to the final storage destination might make the change in both the primary region and the secondary region. 这些任务并非不重要,而有关复制代码的完整指导超出本文的范畴。These are not trivial tasks, and complete guidance for replication code is beyond the scope of this article. 应投入大量时间来开发和测试将数据复制到次要区域的方法。Invest significant time and testing into the approach for replicating data to the secondary region. 应进行其他处理和测试,以确保故障转移和恢复过程正确处理任何可能发生的数据不一致情况或复制事务。Additional processing and testing can help ensure that the failover and recovery processes correctly handle any possible data inconsistencies or duplicate transactions.


本文侧重于平台即服务 (PaaS)。Most of this paper focuses on platform as a service (PaaS). 但是,使用 Azure 虚拟机的混合应用程序仍具有其他复制和可用性选项。However, additional replication and availability options for hybrid applications use Azure Virtual Machines. 这些混合应用程序使用基础结构即服务 (IaaS) 在 Azure 中的虚拟机上托管 SQL Server。These hybrid applications use infrastructure as a service (IaaS) to host SQL Server on virtual machines in Azure. 因此,可在 SQL Server 中使用传统的可用性方法,如 AlwaysOn 可用性组或日志传送。This allows traditional availability approaches in SQL Server, such as AlwaysOn Availability Groups or Log Shipping. 某些方法(如 AlwaysOn)只能在本地 SQL Server 与 Azure 虚拟机之间发挥作用。Some techniques, such as AlwaysOn, work only between on-premises SQL Server instances and Azure virtual machines. 有关详细信息,请参阅 Azure 虚拟机中 SQL Server 的高可用性和灾难恢复For more information, see High availability and disaster recovery for SQL Server in Azure Virtual Machines.

使用减弱的应用程序功能捕获事务Reduced application functionality for transaction capture

另外设想一个以功能减弱的方式运行的体系结构。Consider a second architecture that operates with reduced functionality. 次要区域中的应用程序会停用所有功能,如报告、商业智能 (BI) 或清空队列。The application in the secondary region deactivates all the functionality, such as reporting, business intelligence (BI), or draining queues. 它仅接受业务要求定义的事务工作流的最重要类型。It accepts only the most important types of transactional workflows, as defined by business requirements. 系统将捕获事务并将其写入队列。The system captures the transactions and writes them to queues. 在服务中断的初始阶段,系统可以推迟数据处理。The system might postpone processing the data during the initial stage of the service disruption. 如果主要区域内的系统在预期的时间范围内重新激活,则主要区域内的辅助角色可能会清空队列。If the system on the primary region is reactivated within the expected time window, the worker roles in the primary region can drain the queues. 此过程不需要合并数据库。This process eliminates the need for database merging. 如果主要区域的服务中断超出可承受的范围,则应用程序可开始处理队列。If the primary region service disruption goes beyond the tolerable window, the application can start processing the queues.

在此方案中,次要区域内的数据库包含增量事务数据,一旦主要区域重新激活,就必须合并此类数据。In this scenario, the database in the secondary region contains incremental transactional data that must be merged after the primary is reactivated. 下图展示了此策略,用于临时存储事务数据,直到还原主要区域。The following diagram shows this strategy for temporarily storing transactional data until the primary region is restored.


有关具有复原功能的 Azure 应用程序的数据管理方法的更多讨论,请参阅防故障:弹性云体系结构指南For more discussion of data management techniques for resilient Azure applications, see Failsafe: Guidance for Resilient Cloud Architectures.

灾难恢复的部署拓扑Deployment topologies for disaster recovery

必须对任务关键型应用程序做好准备,以处理区域范围的服务中断。You must prepare mission-critical applications to handle region-wide service disruptions. 可以将多区域部署策略合并到运营规划中。Incorporate a multiple-region deployment strategy into the operational planning.

多区域部署可能需要使用 IT 流程在经历灾难后将应用程序数据和引用数据发布到次要区域。Multiple-region deployments might involve IT processes to publish the application and reference data to the secondary region after a disaster. 如果应用程序要求立即进行故障转移,则部署过程可能涉及主动/被动或主动/主动设置。If the application requires instant failover, the deployment process might involve an active/passive setup or an active/active setup. 这种类型的部署具有在备用区域内运行应用程序的现有实例。This type of deployment has existing instances of the application running in the alternate region. Azure 流量管理器等路由服务可在 DNS 级别提供负载均衡服务。A routing service such as Azure Traffic Manager provides load-balancing services at the DNS level. 此类工具可检测到服务中断,并在需要时会用户路由到其他区域。It can detect service disruptions and route the users to different regions when needed.

若要实现成功的 Azure 灾难恢复,应从一开始就将它内置到解决方案中。A successful Azure disaster recovery includes building that recovery into the solution from the start. 云提供了其他可用于在灾难期间从故障中恢复的选项,而传统的托管提供商无法提供此类选项。The cloud provides additional options for recovering from failures during a disaster that are not available in a traditional hosting provider. 具体是指,你可以在其他区域快速、动态地分配资源,避免在发生故障前因资源闲置而产生成本。Specifically, you can dynamically and quickly allocate resources in a different region, avoiding the cost of idle resources prior to a failure.

以下部分涵盖灾难恢复的不同部署拓扑。The following sections cover different deployment topologies for disaster recovery. 通常,在提高可用性时,需要在增加成本或复杂性之间取得平衡。Typically, there's a tradeoff in increased cost or complexity for additional availability.

单区域部署Single-region deployment

单区域部署实际上不是一种灾难恢复拓扑,而是旨在与其他体系结构进行对比。A single-region deployment is not really a disaster recovery topology, but is meant to contrast with the other architectures. Azure 中的应用程序经常采用单区域部署,不过,这种部署并不符合灾难恢复拓扑的要求。Single-region deployments are common for applications in Azure; however, they do not meet the requirements of a disaster recovery topology.

下图演示了在单个 Azure 区域内运行的应用程序。The following diagram depicts an application running in a single Azure region. Azure 流量管理器以及使用容错域和升级域可提高区域内应用程序的可用性。Azure Traffic Manager and the use of fault and upgrade domains increase availability of the application within the region.


在此方案中,数据库是单一故障点。In this scenario, the database is a single point of failure. 虽然 Azure 会将不同容错域中的数据复制到内部副本,但这种复制仅在同一区域内进行。Though Azure replicates the data across different fault domains to internal replicas, this replication occurs only within the same region. 应用程序无法承受灾难性故障。The application cannot withstand a catastrophic failure. 如果区域变得不可用,则容错域也不可用,其中包括所有服务实例和存储资源。If the region becomes unavailable, then so do the fault domains, including all service instances and storage resources.

对于所有应用程序(最不重要的除外),必须制定计划以将应用程序部署到不同区域中的多个区域内。For all but the least critical applications, you must devise a plan to deploy your applications across multiple regions. 在考虑要使用哪种部署拓扑时,还应考虑 RTO 和成本的限制。You should also consider RTO and cost constraints in considering which deployment topology to use.

现在来看支持在不同区域内进行故障转移的特定方法。Let's take a look now at specific approaches to supporting failover across different regions. 以下示例都使用两个区域说明该过程。These examples all use two regions to describe the process.

使用 Azure Site Recovery 进行故障转移Failover using Azure Site Recovery

使用 Azure Site Recovery 启用 Azure VM 复制时,它会在次要区域中创建多个资源:When you enable Azure VM replication using Azure Site Recovery, it creates several resources in the secondary region:

  • 资源组。Resource group.
  • 虚拟网络 (VNet)。Virtual network (VNet).
  • 存储帐户。Storage account.
  • 在故障转移后用于保留 VM 的可用性集。Availability sets to hold VMs after failover.

主要区域中 VM 磁盘上写入的数据会持续传输到次要区域中的存储帐户。Data writes on the VM disks in the primary region are continuously transferred to the storage account in the secondary region. 目标存储帐户中每隔几分钟就会生成恢复点。Recovery points are generated in the target storage account every few minutes. 启动故障转移时,将在目标资源组、VNet 和可用性集中创建已恢复的 VM。When you initiate a failover, the recovered VMs are created in the target resource group, VNet, and availability set. 在故障转移期间,可以选择任何可用的恢复点。During a failover, you can choose any available recovery point.

重新部署到次要 Azure 区域Redeployment to a secondary Azure region

在重新部署到次要区域的方法中,只有主要区域有应用程序和数据库在运行。For the approach of redeployment to a secondary region, only the primary region has applications and databases running. 次要区域未设置为自动故障转移。The secondary region is not set up for an automatic failover. 因此,在发生灾难时,必须启动新区域内服务的所有部分。So when a disaster occurs, you must spin up all the parts of the service in the new region. 其中包括将云服务上传到 Azure、部署云服务、还原数据和更改 DNS 以重新路由流量。This includes uploading a cloud service to Azure, deploying the cloud service, restoring the data, and changing DNS to reroute the traffic.

虽然这是最实惠的多区域选项,但其 RTO 特征最差。Although this is the most affordable of the multiple-region options, it has the worst RTO characteristics. 在此模型中,服务包和数据库备份存储在本地或次要区域的 Azure Blob 存储中。In this model, the service package and database backups are stored either on-premises or in the Azure Blob storage instance of the secondary region. 但必须先部署新服务并还原数据,才能继续操作。However, you must deploy a new service and restore the data before it resumes operation. 即使完全自动从备份存储中传输数据,预配新数据库环境仍然会花费大量时间。Even with full automation of the data transfer from backup storage, provisioning a new database environment consumes a lot of time. 将数据从备份磁盘存储移至次要区域内的空数据库是还原过程中成本最高的部分,Moving data from the backup disk storage to the empty database on the secondary region is the most expensive part of the restore process. 但必须这样做才能使新数据库进入正常运行状态,因为并未复制该数据库。You must do this, however, to bring the new database to an operational state because it isn't replicated.

最佳方法是将服务包存储在次要区域内的 Blob 存储中。The best approach is to store the service packages in Blob storage in the secondary region. 这样不必将包上传到 Azure,而从本地的开发计算机进行部署时就要这样做。This eliminates the need to upload the package to Azure, which is what happens when you deploy from an on-premises development machine. 可以使用 PowerShell 脚本,迅速将服务包从 Blob 存储部署到新云服务中。You can quickly deploy the service packages to a new cloud service from Blob storage by using PowerShell scripts.

此选项仅适用于可承受高 RTO 的非关键应用程序。This option is practical only for non-critical applications that can tolerate a high RTO. 例如,此选项可能适用于可关闭数小时,但需在 24 小时内可用的应用程序。For instance, this might work for an application that can be down for several hours but is required to be available within 24 hours.

重新部署到次要 Azure 区域


许多公司倾向于选择主动-被动拓扑。An active-passive topology is the choice that many companies favor. 与重新部署方法相比,这种拓扑增加相对较少的成本即可提高 RTO。This topology provides improvements to the RTO with a relatively small increase in cost over the redeployment approach. 在此方案中,同样存在主辅 Azure 区域。In this scenario, there is again a primary and a secondary Azure region. 所有流量均流向主要区域内的主动部署。All of the traffic goes to the active deployment on the primary region. 次要区域为灾难恢复所做的准备更充分,因为两个区域内均运行有数据库。The secondary region is better prepared for disaster recovery because the database is running on both regions. 而且,它们之间还建立了同步机制。Additionally, a synchronization mechanism is in place between them. 这种备用方法可能涉及两种变化形式:仅数据库方法或在次要区域内进行完全部署的方法。This standby approach can involve two variations: a database-only approach or a complete deployment in the secondary region.

仅数据库Database only

在主动-被动拓扑的第一种变化形式中,只有主要区域部署了云服务应用程序。In the first variation of the active-passive topology, only the primary region has a deployed cloud service application. 但是,与重新部署方法不同,这两个区域与数据库内容进行同步。However, unlike the redeployment approach, both regions are synchronized with the contents of the database. (有关详细信息,请参阅灾难恢复的事务数据模式部分。)发生灾难时,激活要求会更低。(For more information, see the section on transactional data pattern for disaster recovery.) When a disaster occurs, there are fewer activation requirements. 启动次要区域内的应用程序,将连接字符串更改为新数据库,并更改 DNS 条目以重新路由流量。You start the application in the secondary region, change connection strings to the new database, and change the DNS entries to reroute traffic.

与重新部署方法一样,服务包应已存储在次要区域内的 Azure Blob 存储中,以便更快进行部署。Like the redeployment approach, you should have already stored the service packages in Azure Blob storage in the secondary region for faster deployment. 但是,这不会产生数据库还原操作所需的大部分开销,因为数据库已就绪并且正在运行。However, you don’t incur the majority of the overhead that database restore operation requires, because the database is ready and running. 这样可节省大量时间,因此是一种经济型 DR 模式(也是最常用的模式)。This saves a significant amount of time, making this an affordable DR pattern (and the one most frequently used).


完整副本Full replica

在主动-被动拓扑的第二种变化形式中,主要区域和次要区域均为完全部署。In the second variation of the active-passive topology, both the primary region and the secondary region have a full deployment. 这种部署包括云服务和同步数据库。This deployment includes the cloud services and a synchronized database. 但是,只有主要区域在主动处理来自用户的网络请求。However, only the primary region is actively handling network requests from the users. 只有当主要区域出现服务中断时,次要区域才会激活。The secondary region becomes active only when the primary region experiences a service disruption. 在这种情况下,会将所有新网络请求路由到次要区域。In that case, all new network requests route to the secondary region. Azure 流量管理器可自动管理此故障转移。Azure Traffic Manager can manage this failover automatically.

由于已部署相关服务,因此故障转移的速度比仅数据库变化形式更快。Failover occurs faster than the database-only variation because the services are already deployed. 此拓扑的 RTO 非常短。This topology provides a very low RTO. 在主要区域发生故障时,次要故障转移区域必须能够立即投入使用。The secondary failover region must be ready to go immediately after failure of the primary region.

除了更快的响应时间,此拓扑还预分配和部署备份服务,避免在发生灾难期间因空间不足而无法分配新实例。Along with a quicker response time, this topology pre-allocates and deploys backup services, avoiding the possibility of a lack of space to allocate new instances during a disaster. 如果辅助 Azure 区域接近最大容量,则这一点很重要。This is important if your secondary Azure region is nearing capacity. 任何服务级别协议 (SLA) 都无法保证你可以立即在任何区域部署一个或多个新的云服务。No service-level agreement (SLA) guarantees that you can instantly deploy one or more new cloud services in any region.

为了尽量缩短此模型的响应时间,主要区域和次要区域的规模(角色实例数)必须相近。For the fastest response time with this model, you must have similar scale (number of role instances) in the primary and secondary regions. 尽管存在一些优点,但由于为未使用的计算实例付费的成本高昂,通常这不是最明智的财务选择。Despite the advantages, paying for unused compute instances is costly, and this might not be the most prudent financial choice. 因此,更常见的做法是在次要区域内使用规模略有缩减的云服务版本。Because of this, it's more common to use a slightly scaled-down version of cloud services on the secondary region. 这样就能在必要时快速故障转移和横向扩展辅助部署。Then you can quickly fail over and scale out the secondary deployment if necessary. 应当自动完成故障转移过程,以便一旦主要区域发生故障,即根据负载激活其他实例。You should automate the failover process so that after the primary region is inaccessible, you activate additional instances, depending on the load. 这可能涉及某种类型的自动缩放机制,如虚拟机规模集This might involve the use of an autoscaling mechanism like virtual machine scale sets.

下图展示了主动-被动拓扑中的一种模型,其中主要区域和次要区域均包含完全部署的云服务。The following diagram shows the model where the primary and secondary regions contain a fully deployed cloud service in an active-passive topology.



在主动-主动拓扑中,云服务和数据库在这两个区域内均部署齐全。In an active-active topology, the cloud services and database are fully deployed in both regions. 与主动-被动模型不同的是,两个区域都会接收用户流量。Unlike the active-passive model, both regions receive user traffic. 此选项产生的恢复时间最快。This option yields the quickest recovery time. 相关服务已经过扩展,以在每个区域处理一部分负载。The services are already scaled to handle a portion of the load at each region. 已启用 DNS,便于使用次要区域。DNS is already enabled to use the secondary region. 在确定如何将用户路由到相应的区域的过程中,还有其他复杂因素。There's additional complexity in determining how to route users to the appropriate region. 可以采用轮循机制计划。Round-robin scheduling might be possible. 很可能某些用户会使用其数据的主副本所在的特定区域。It's more likely that certain users would use a specific region where the primary copy of their data resides.

在故障转移时,只需禁止 DNS 访问主要区域。In case of failover, simply disable DNS to the primary region. 这可以将所有流量路由到次要区域。This routes all traffic to the secondary region.

即使在此模型中,也有一些变化形式。Even in this model, there are some variations. 例如,下图展示了一个拥有数据库主控副本的主要区域。For example, the following diagram depicts a primary region which owns the master copy of the database. 两个区域内的云服务均写入该主数据库。The cloud services in both regions write to that primary database. 辅助部署可从主数据库或复制的数据库进行读取。The secondary deployment can read from the primary or replicated database. 本例中的复制为单向。Replication in this example is one-way.


上图中的主动-主动体系结构有一个缺点。There is a downside to the active-active architecture in the preceding diagram. 第二个区域必须访问第一个区域内的数据库,因为主控副本存放在那里。The second region must access the database in the first region because the master copy resides there. 从区域外部访问数据时,性能大幅下降。Performance significantly drops off when you access data from outside a region. 在跨区域调用数据库时,应考虑使用某种类型的批处理方法以提高这些调用的性能。In cross-region database calls, you should consider some type of batching strategy to improve the performance of these calls. 有关详细信息,请参阅如何使用批处理来改善 SQL 数据库应用程序的性能For more information, see How to use batching to improve SQL Database application performance.

另一种备选体系结构可能要求每个区域直接访问其数据库。An alternative architecture might involve each region accessing its own database directly. 在该模型中,将需要进行某种类型的双向复制以同步每个区域内的数据库。In that model, some type of bidirectional replication is required to synchronize the databases in each region.

在以前的拓扑中,缩短 RTO 通常会增加成本和复杂程度。With the previous topologies, decreasing RTO generally increases costs and complexity. 主动-主动拓扑弃用了这种成本模式。The active-active topology deviates from this cost pattern. 在主动-主动拓扑中,主要区域内需要的实例数量可能不像主动-被动拓扑那么多。In the active-active topology, you might not need as many instances on the primary region as you would in the active-passive topology. 如果主动-被动体系结构的主要区域内有 10 个实例,则在每个主动-主动体系结构的区域内可能只需要 5 个实例。If you have 10 instances on the primary region in an active-passive architecture, you might need only 5 in each region in an active-active architecture. 现在两个区域分担负载。Both regions now share the load. 如果在被动区域内将 10 个实例保持在热备用状态以等待故障转移,则这样做可能比主动-被动拓扑更为节省成本。This might be a cost savings over the active-passive topology if you keep a warm standby on the passive region with 10 instances waiting for failover.

如果直到主要区域还原后才意识到这一点,则次要区域的新用户可能会突然间猛增。Realize that until you restore the primary region, the secondary region might receive a sudden surge of new users. 如果在主要区域发生服务中断时每个服务器上有 10,000 个用户,则次要区域现在突然间必须处理 20,000 个用户。If there are 10,000 users on each server when the primary region experiences a service disruption, the secondary region suddenly has to handle 20,000 users. 针对次要区域的监视规则必须检测到这种增长情况,并将次要区域内的实例增加一倍。Monitoring rules on the secondary region must detect this increase and double the instances in the secondary region. 有关详细信息,请参阅故障检测部分。For more information on this, see the section on failure detection.

混合本地和云解决方案Hybrid on-premises and cloud solution

另一个灾难恢复策略是设计一个同时在本地和云中运行的混合应用程序。One additional strategy for disaster recovery is to architect a hybrid application that runs on-premises and in the cloud. 根据应用程序的不同,主要区域可能在任一位置。Depending on the application, the primary region might be either location. 以前面的体系结构为例,并假设主要区域和次要区域均位于本地。Consider the previous architectures and imagine the primary or secondary region as an on-premises location.

这些混合体系结构中有一些难题。There are some challenges in these hybrid architectures. 首先,本文大部分内容处理的是 PaaS 体系结构模式。First, most of this article has addressed PaaS architecture patterns. Azure 中的典型 PaaS 应用程序依赖 Azure 特有的构件,如角色、云服务和流量管理器。Typical PaaS applications in Azure rely on Azure-specific constructs such as roles, cloud services, and Traffic Manager. 为此类型的 PaaS 应用程序创建本地解决方案所需的体系结构显著不同。Creating an on-premises solution for this type of PaaS application would require a significantly different architecture. 就管理或成本角度而言,这一点可能不可行。This might not be feasible from a management or cost perspective.

但是,对已迁移到云中的传统体系结构(比如基于 IaaS 的体系结构)来说,混合灾难恢复解决方案面临的挑战更少。However, a hybrid solution for disaster recovery has fewer challenges for traditional architectures that have been migrated to the cloud, such as IaaS-based architectures. IaaS 应用程序使用云中的虚拟机,这些虚拟机在本地具有直接等效项。IaaS applications use virtual machines in the cloud that can have direct on-premises equivalents. 使用虚拟网络还可将云中的虚拟机与本地网络资源相连。You can also use virtual networks to connect machines in the cloud with on-premises network resources. 这样就产生了多种仅 PaaS 应用程序所不具备的可能性。This allows several possibilities that are not possible with PaaS-only applications. 例如,SQL Server 可利用灾难恢复解决方案,如 AlwaysOn 可用性组和数据库镜像。For example, SQL Server can take advantage of disaster recovery solutions such as AlwaysOn Availability Groups and database mirroring. 有关详细信息,请参阅 Azure 虚拟机中 SQL Server 的高可用性和灾难恢复For details, see High availability and disaster recovery for SQL Server in Azure virtual machines.

IaaS 解决方案还为本地应用程序使用 Azure 作为故障转移选项提供一个更方便的途径。IaaS solutions also provide an easier path for on-premises applications to use Azure as the failover option. 可能在现有的本地区域内具有完全正常运行的应用程序。You might have a fully functioning application in an existing on-premises region. 但如果缺少资源,无法维护分散在各地的区域以进行故障转移,该怎么办呢?However, what if you lack the resources to maintain a geographically separate region for failover? 可能决定使用虚拟机和虚拟网络在 Azure 中运行应用程序。You might decide to use virtual machines and virtual networks to get your application running in Azure. 在这种情况下,可以定义将数据同步到云的过程。In that case, define processes that synchronize data to the cloud. 然后,Azure 部署变成用于故障转移的次要区域。The Azure deployment then becomes the secondary region to use for failover. 主要区域仍为本地应用程序。The primary region remains the on-premises application. 有关 IaaS 体系结构和功能的详细信息,请参阅虚拟机文档For more information about IaaS architectures and capabilities, see the Virtual Machines documentation.

备用云Alternative cloud

在某些情况下,Microsoft Azure 虽然拥有众多功能,但可能仍不符合内部的符合性规则或组织所需的策略。There are situations where the broad capabilities of Microsoft Azure still may not meet internal compliance rules or policies required by your organization. 如果云服务提供商发生全球性的服务中断,即使是最完善的准备和设计也无法在发生灾难时实施备份系统。Even the best preparation and design to implement backup systems during a disaster are inadequate during a global service disruption of a cloud service provider.

应将可用性要求与提高可用性所需的成本和复杂程度进行比较。You should compare availability requirements with the cost and complexity of increased availability. 执行风险分析,并为解决方案定义 RTO 和 RPO。Perform a risk analysis, and define the RTO and RPO for your solution. 如果应用程序无法承受任何停机,则可能需要考虑使用其他云解决方案。If your application cannot tolerate any downtime, you might consider using an additional cloud solution. 除非整个 Internet 同时瘫痪,否则在 Azure 发生全球性的访问故障时,其他云解决方案仍将正常运转。Unless the entire Internet goes down, another cloud solution might still be available if Azure becomes globally inaccessible.

和混合方案一样,其他云解决方案中也可存在以前灾难恢复体系结构中的故障转移部署。As with the hybrid scenario, the failover deployments in the previous disaster recovery architectures can also exist within another cloud solution. 备用云 DR 站点应仅用于那些 RTO 只允许极短停机时间(如有)的解决方案。Alternative cloud DR sites should be used only for solutions whose RTO allows very little, if any, downtime. 请注意,使用 Azure 以外 DR 站点的解决方案将需要在配置、开发、部署和维护方面做更多工作。Note that a solution that uses a DR site outside Azure will require more work to configure, develop, deploy, and maintain. 此外,在跨云的体系结构中实施经认证的做法也更加困难。It's also more difficult to implement proven practices in a cross-cloud architecture. 尽管云平台的高级概念相似,但 API 和体系结构各有不同。Although cloud platforms have similar high-level concepts, the APIs and architectures are different.

如果 DR 策略依赖于多个云平台,那么,在解决方案设计中加入抽象层就很有用。If your DR strategy relies upon multiple cloud platforms, it's valuable to include abstraction layers in the design of the solution. 这样就不必为各种云平台开发和维护同一应用程序的两个不同版本来应对灾难。This eliminates the need to develop and maintain two different versions of the same application for different cloud platforms in case of disaster. 和混合方案一样,在这些情况下使用 Azure 虚拟机或 Azure 容器服务可能比云特定的 PaaS 设计更加方便。As with the hybrid scenario, the use of Azure Virtual Machines or Azure Container Service might be easier in these cases than the use of cloud-specific PaaS designs.


我们刚刚讨论的某些模式需要迅速激活脱机部署以及还原系统的特定部分。Some of the patterns that we just discussed require quick activation of offline deployments as well as restoration of specific parts of a system. 自动化脚本可按需激活资源以及迅速部署解决方案。Automation scripts can activate resources on demand and deploy solutions rapidly. 下面的 DR 相关自动化示例使用 Azure PowerShell,不过,使用 Azure CLI服务管理 REST API 也是不错的选择。The DR-related automation examples below use Azure PowerShell, but using the Azure CLI or the Service Management REST API are also good options.

自动化脚本可管理 Azure 以非透明方式处理的各个 DR 环节。Automation scripts manage aspects of DR not transparently handled by Azure. 这会生成一致且可重复的结果,最大程度减少人为错误。This produces consistent and repeatable results, minimizing human error. 预先定义的 DR 脚本还可缩短在发生灾难期间重新生成系统及其组成部分的时间。Predefined DR scripts also reduce the time to rebuild a system and its constituent parts during a disaster. 不想尝试自己解决如何在停机时还原站点,因为分分秒秒都在损失金钱。You don’t want to try to manually figure out how to restore your site while it's down and losing money every minute.

请从头到尾反复测试脚本。Test your scripts repeatedly from start to finish. 验证这些脚本的基本功能后,请确保在灾难模拟中测试它们。After verifying their basic functionality, make sure to test them in disaster simulation. 这样有助于揭示脚本或过程中的缺陷。This helps uncover defects in the scripts or processes.

自动化的最佳实践是为 Azure 灾难恢复创建 PowerShell 脚本或命令行接口 (CLI) 的存储库。A best practice with automation is to create a repository of PowerShell scripts or command-line interface (CLI) scripts for Azure disaster recovery. 请明确标注这些脚本并对它们分类,以便快速访问。Clearly mark and categorize them for quick access. 指派专人来管理脚本的存储库和版本控制。Designate a primary person to manage the repository and versioning of the scripts. 请通过参数解释和脚本使用示例,妥善记录这些脚本。Document them well with explanations of parameters and examples of script use. 另外请确保将此文档与 Azure 部署保持同步。Also ensure that you keep this documentation in sync with your Azure deployments. 这就是让专人全权管理整个存储库的意图所在。This underscores the purpose of having a primary person in charge of all parts of the repository.

故障检测Failure detection

若要正确地处理可用性和灾难恢复的问题,必须可检测和诊断故障。To correctly handle problems with availability and disaster recovery, you must be able to detect and diagnose failures. 请执行高级服务器和部署监视,以便在系统或其组件突然变得不可用时快速发现该问题。Perform advanced server and deployment monitoring to quickly recognize when a system or its components suddenly become unavailable. 评估云服务及其依赖项的总体运行状况的监视工具可完成这其中的一部分工作。Monitoring tools that assess the overall health of the cloud service and its dependencies can perform part of this work. System Center 2016 就是一款合适的 Microsoft 工具。One suitable Microsoft tool is System Center 2016. 第三方工具也可提供监视功能。Third-party tools can also provide monitoring capabilities. 大多数监视解决方案跟踪关键的性能计数器和服务可用性。Most monitoring solutions track key performance counters and service availability.

尽管这些工具很重要,但必须规划云服务中的故障检测和报告。Although these tools are vital, you must plan for fault detection and reporting within a cloud service. 此外还必须规划正确使用 Azure 诊断。You must also plan to properly use Azure Diagnostics. 自定义性能计数器或事件日志条目也可成为总体策略的一部分。Custom performance counters or event-log entries can also be part of the overall strategy. 这样可在故障期间提供更多数据以迅速诊断问题并恢复所有功能。This provides more data during failures to quickly diagnose the problem and restore full capabilities. 它还为监视工具提供其他指标,用于确定应用程序的运行状况。It also provides additional metrics that the monitoring tools can use to determine application health. 有关详细信息,请参阅在 Azure 云服务中启用 Azure 诊断For more information, see Enabling Azure Diagnostics in Azure Cloud Services. 有关如何规划总体“运行状况模型”的讨论,请参阅防故障:弹性云体系结构指南For a discussion of how to plan for an overall “health model,” see Failsafe: Guidance for Resilient Cloud Architectures.

灾难模拟Disaster simulation

模拟测试涉及在真实工作环境下营造小规模的真实情形,以观察团队成员的反应。Simulation testing involves creating small real-life situations on the work floor to observe how the team members react. 模拟还会展示在恢复计划中概述的解决方案的效果。Simulations also show how effective the solutions are in the recovery plan. 执行模拟的方式应以所营造的情景不会中断实际业务,但仍感觉像真实情况为准。Execute simulations so that the created scenarios don't disrupt actual business, while still feeling like real situations.

设想在应用程序中设计某种类型的“开关面板”以手动模拟可用性问题。Consider architecting a type of “switchboard” in the application to manually simulate availability issues. 例如,通过软开关,使订购模块发生故障,从而触发该模块的数据库访问异常。For instance, through a soft switch, trigger database access exceptions for an ordering module by causing it to malfunction. 可将类似的轻型方法用于网络接口级别的其他模块。You can take similar lightweight approaches for other modules at the network interface level.

将在模拟期间重点关注任何未充分解决的问题。The simulation highlights any issues that were inadequately addressed. 模拟的方案必须完全可控。The simulated scenarios must be completely controllable. 也就是说,即便恢复计划似乎会失败,也可以使情况恢复正常,而不会导致任何重大损害。This means that, even if the recovery plan seems to be failing, you can restore the situation back to normal without causing any significant damage. 还应向高级管理人员通知将执行模拟练习的时间和方式,这一点也很重要。It’s also important that you inform higher-level management about when and how the simulation exercises will be executed. 此计划应详细说明时间或在模拟期间受影响的资源。This plan should detail the time or resources affected during the simulation. 还应定义测试灾难恢复计划时成功的衡量标准。Also define the measures of success when testing your disaster recovery plan.

如果使用 Azure Site Recovery,则可以执行到 Azure 的测试故障转移,以便在不丢失任何数据或造成停机的情况下验证复制策略或执行灾难恢复演练。If you are using Azure Site Recovery, you can execute a test failover to Azure, to validate your replication strategy or perform a disaster recovery drill without any data loss or downtime. 测试故障转移不会对正在进行的 VM 复制或生产环境造成任何影响。A test failover does not affect on the ongoing VM replication or your production environment.

其他几种方法也可以测试灾难恢复计划。Several other techniques can test disaster recovery plans. 但是,其中大多数只是这些基本方法的变化形式。However, most of them are simply variations of these basic techniques. 此测试的目的是评估恢复计划的可行性。The intent of this testing is to evaluate the feasibility of the recovery plan. 灾难恢复测试专注于细节,以发现基本恢复计划中的漏洞。Disaster recovery testing focuses on the details to discover gaps in the basic recovery plan.

服务特定指南Service-specific guidance

以下主题介绍特定于灾难恢复的 Azure 服务:The following topics describe disaster recovery specific Azure services:

服务Service 主题Topic
云服务Cloud Services 发生影响 Azure 云服务的 Azure 服务中断时该怎么办What to do in the event of an Azure service disruption that impacts Azure Cloud Services
Key VaultKey Vault Azure Key Vault 可用性和冗余Azure Key Vault availability and redundancy
存储Storage 在 Azure 存储中断时该怎么办What to do if an Azure Storage outage occurs
SQL 数据库SQL Database 还原 Azure SQL 数据库或故障转移到辅助数据库Restore an Azure SQL Database or failover to a secondary
虚拟机Virtual machines 发生影响 Azure 虚拟机的 Azure 服务中断事件时该怎么办What to do in the event that an Azure service disruption impacts Azure virtual machines
虚拟网络Virtual networks 虚拟网络 - 业务连续性Virtual Network – Business Continuity