您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

在多个 Azure 区域中运行 Web 应用程序以确保高可用性Run a web application in multiple Azure regions for high availability

此参考体系结构展示了如何在多个区域中运行 Azure 应用服务应用程序以实现高可用性。This reference architecture shows how to run an Azure App Service application in multiple regions to achieve high availability.

具有高可用性的 Web 应用程序的参考体系结构

下载此体系结构的 Visio 文件Download a Visio file of this architecture.

体系结构Architecture

此体系结构基于提高 Web 应用程序中的可伸缩性中显示的体系结构。This architecture builds on the one shown in Improve scalability in a web application. 它们的主要区别包括:The main differences are:

  • 主要和次要区域Primary and secondary regions. 此体系结构使用两个区域来实现更高的可用性。This architecture uses two regions to achieve higher availability. 应用程序部署到每个区域。The application is deployed to each region. 在正常运行期间,网络流量被路由到主要区域。During normal operations, network traffic is routed to the primary region. 如果主要区域变得不可用,则流量将被路由到次要区域。If the primary region becomes unavailable, traffic is routed to the secondary region.
  • 前门。Front Door. 前门将传入的请求路由到主要区域。Front Door routes incoming requests to the primary region. 如果运行该区域的应用程序变得不可用,前门将故障转移到次要区域。If the application running that region becomes unavailable, Front Door fails over to the secondary region.
  • 对 SQL 数据库和/或 Cosmos DB 进行异地复制Geo-replication of SQL Database and/or Cosmos DB.

与部署到单个区域相比,多区域体系结构可以提供更高的可用性。A multi-region architecture can provide higher availability than deploying to a single region. 如果区域中断影响主要区域,则可以使用前门故障转移到次要区域。If a regional outage affects the primary region, you can use Front Door to fail over to the secondary region. 当应用程序的单个子系统出现故障时,此体系结构可能也比较有用。This architecture can also help if an individual subsystem of the application fails.

有多种常规方法可跨区域实现高可用性:There are several general approaches to achieving high availability across regions:

  • 主动/被动(采用热备用模式)。Active/passive with hot standby. 流量将前往一个区域,而另一个区域将以热备用模式等待。Traffic goes to one region, while the other waits on hot standby. “热备用模式”意味着次要区域中的 VM 已被分配并总是处于运行状态。Hot standby means the VMs in the secondary region are allocated and running at all times.
  • 主动/被动(采用冷备用模式)。Active/passive with cold standby. 流量将前往一个区域,而另一个区域将以冷备用模式等待。Traffic goes to one region, while the other waits on cold standby. “冷备用模式”意味着次要区域中的 VM 不会被分配,直到故障转移需要它们。Cold standby means the VMs in the secondary region are not allocated until needed for failover. 此方法的运行成本较低,但是当发生故障时通常需要花费更长时间才能联机。This approach costs less to run, but will generally take longer to come online during a failure.
  • 主动/主动。Active/active. 两个区域都处于活动状态,并且会在它们之间对请求进行负载均衡。Both regions are active, and requests are load balanced between them. 如果一个区域变得不可用,则不再使其参与轮换。If one region becomes unavailable, it is taken out of rotation.

此参考体系结构重点介绍具有热备用功能的主动/被动,并使用前门进行故障转移。This reference architecture focuses on active/passive with hot standby, using Front Door for failover.

建议Recommendations

你的要求可能不同于此处描述的体系结构。Your requirements might differ from the architecture described here. 请使用本部分中的建议作为入手点。Use the recommendations in this section as a starting point.

区域配对Regional pairing

每个 Azure 区域都与同一地域内的另一个区域配对。Each Azure region is paired with another region within the same geography. 通常,请选择同一区域对中的区域(例如“美国东部 2”和“美国中部”)。In general, choose regions from the same regional pair (for example, East US 2 and Central US). 这样做的好处包括:Benefits of doing so include:

  • 如果发生大范围的故障,会优先恢复每个区域对中的至少一个区域。If there is a broad outage, recovery of at least one region out of every pair is prioritized.
  • 计划内 Azure 系统更新会按顺序提供给配对的区域,以尽可能减少停机时间。Planned Azure system updates are rolled out to paired regions sequentially to minimize possible downtime.
  • 在多数情况下,区域对位于同一地域内以满足数据驻留要求。In most cases, regional pairs reside within the same geography to meet data residency requirements.

但请确保两个区域都支持应用程序所需的所有 Azure 服务。However, make sure that both regions support all of the Azure services needed for your application. 请参阅服务(按区域)See Services by region. 有关区域对的详细信息,请参阅业务连续性和灾难恢复 (BCDR):Azure 配对区域For more information about regional pairs, see Business continuity and disaster recovery (BCDR): Azure Paired Regions.

资源组Resource groups

考虑将主要区域、次要区域和流量管理器放置到单独的资源组中。Consider placing the primary region, secondary region, and Traffic Manager into separate resource groups. 这允许你将部署到每个区域的资源作为单个集合进行管理。This lets you manage the resources deployed to each region as a single collection.

前门配置Front Door configuration

路由Routing. 前门支持多种路由机制Front Door supports several routing mechanisms. 对于本文中所述的方案,请使用优先级路由。For the scenario described in this article, use priority routing. 通过此设置,前门会将所有请求发送到主要区域,除非该区域的终结点无法访问。With this setting, Front Door sends all requests to the primary region unless the endpoint for that region becomes unreachable. 那时,它将自动故障转移到次要区域。At that point, it automatically fails over to the secondary region. 设置具有不同优先级值的后端池,活动区域为1,备用或被动区域设置为2或更高。Set the backend pool with different priority values, 1 for the active region and 2 or higher for the standby or passive region.

运行状况探测Health probe. 前门使用 HTTP (或 HTTPS) 探测器来监视每个后端的可用性。Front Door uses an HTTP (or HTTPS) probe to monitor the availability of each back end. 探测为前端提供通过/失败测试以故障转移到次要区域。The probe gives Front Door a pass/fail test for failing over to the secondary region. 它通过将请求发送到指定的 URL 路径来执行工作。It works by sending a request to a specified URL path. 如果它在超时期间内收到一个非-200 响应,则探测失败。If it gets a non-200 response within a timeout period, the probe fails. 你可以配置运行状况探测频率、评估所需的样本数,以及要将后端标记为正常的成功示例数。You can configure the health probe frequency, number of samples required for evaluation, and the number of successful samples required for the backend to be marked as healthy. 如果前门将后端标记为 "已降级",则故障转移到另一后端。If Front Door marks the backend as degraded, it fails over to the other backend. 有关详细信息,请参阅运行状况探测For details, see Health Probes.

作为最佳做法,请在应用程序后端中创建运行状况探测路径,以报告应用程序的总体运行状况。As a best practice, create a health probe path in your application backend that reports the overall health of the application. 此运行状况探测应检查重要依赖项,如应用服务应用、存储队列和 SQL 数据库。This health probe should check critical dependencies such as the App Service apps, storage queue, and SQL Database. 否则,在应用程序的关键部分出现故障时,探测可能会报告正常的后端。Otherwise, the probe might report a healthy backend when critical parts of the application are actually failing. 另一方面,请不要使用运行状况探测来检查较低优先级的服务。On the other hand, don't use the health probe to check lower priority services. 例如,如果某个电子邮件服务发生故障,则应用程序可以切换到辅助提供程序或者只是稍后再发送电子邮件。For example, if an email service goes down the application can switch to a second provider or just send emails later. 有关此设计模式的进一步讨论,请参阅运行状况终结点监视模式For further discussion of this design pattern, see Health Endpoint Monitoring Pattern.

SQL 数据库SQL Database

使用活动异地复制在一个不同的区域中创建可读取的次要副本。Use Active Geo-Replication to create a readable secondary replica in a different region. 最多可以有四个可读取的次要副本。You can have up to four readable secondary replicas. 如果主数据库失败或者需要使其脱机,请故障转移到辅助数据库。Fail over to a secondary database if your primary database fails or needs to be taken offline. 可以为任何弹性数据库池中的任何数据库配置活动异地复制。Active Geo-Replication can be configured for any database in any elastic database pool.

Cosmos DBCosmos DB

Cosmos DB 支持跨多个写入区域的区域进行异地复制。Cosmos DB supports geo-replication across regions with multiple write regions. 也可将一个区域指定为可写区域,将其他区域指定为只读副本。Alternatively, you can designate one region as the writable region and the others as read-only replicas. 如果发生区域性中断,可以通过选择另一个区域作为写入区域来进行故障转移。If there is a regional outage, you can fail over by selecting another region to be the write region. 客户端 SDK 会自动将写入请求发送到当前写入区域,因此,在故障转移后不需要更新客户端配置。The client SDK automatically sends write requests to the current write region, so you don't need to update the client configuration after a failover. 有关详细信息,请参阅使用 Azure Cosmos DB 全局分配数据For more information, see Global data distribution with Azure Cosmos DB.

备注

所有副本都属于同一资源组。All of the replicas belong to the same resource group.

存储Storage

对于 Azure 存储,请使用读取访问异地冗余存储 (RA-GRS)。For Azure Storage, use read-access geo-redundant storage (RA-GRS). 使用 RA-GRS 存储时,数据被复制到次要区域。With RA-GRS storage, the data is replicated to a secondary region. 你可以通过一个单独的终结点以只读方式访问次要区域中的数据。You have read-only access to the data in the secondary region through a separate endpoint. 如果发生区域性中断或灾难,则 Azure 存储团队可以决定执行到次要区域的异地故障转移。If there is a regional outage or disaster, the Azure Storage team might decide to perform a geo-failover to the secondary region. 客户不需要为此故障转移执行任何操作。There is no customer action required for this failover.

对于队列存储,请在次要区域中创建一个备份队列。For Queue storage, create a backup queue in the secondary region. 在故障转移期间,应用可以使用备份队列,直到主要区域变得重新可用。During failover, the app can use the backup queue until the primary region becomes available again. 这样,应用程序仍可处理新请求。That way, the application can still process new requests.

可用性注意事项Availability considerations

设计跨区域的高可用性时,请考虑这些要点。Consider these points when designing for high availability across regions.

Azure Front DoorAzure Front Door

如果主要区域变得不可用,Azure 前门会自动故障转移。Azure Front Door automatically fails over if the primary region becomes unavailable. 当前门发生故障转移时,如果客户端无法访问应用程序,则会有一段时间 (通常) 约20-60 秒。When Front Door fails over, there is a period of time (usually about 20-60 seconds) when clients cannot reach the application. 持续时间受以下因素影响:The duration is affected by the following factors:

  • 运行状况探测的频率Frequency of health probes. 运行状况探测越频繁,前端的速度就越快,就能检测到停机时间或后端。The more frequent the health probes are sent, the faster Front Door can detect downtime or the backend coming back healthy.
  • 示例大小配置Sample size configuration. 此配置控制运行状况探测所需的样本数,以检测主后端已变得不可访问。This configuration controls how many samples are required for the health probe to detect that the primary backend has become unreachable. 如果此值太小,则可能会从间歇问题中获得误报。If this value is too low, you could get false positives from intermittent issues.

Front Door 是系统中的一个潜在故障点。Front Door is a possible failure point in the system. 如果服务出现故障,则客户端在停机期间无法访问应用程序。If the service fails, clients cannot access your application during the downtime. 查看 Front Door 服务级别协议 (SLA),然后确定仅使用 Front Door 是否能满足业务对高可用性的需求。Review the Front Door service level agreement (SLA) and determine whether using Front Door alone meets your business requirements for high availability. 如果不能,请考虑添加另一个流量管理解决方案作为故障回复机制。If not, consider adding another traffic management solution as a fallback. 如果 Front Door 服务出现故障,请将 DNS 中的规范名称 (CNAME) 记录更改为指向另一个流量管理服务。If the Front Door service fails, change your canonical name (CNAME) records in DNS to point to the other traffic management service. 此步骤必须手动执行,并且在 DNS 更改被传播之前,应用程序将不可用。This step must be performed manually, and your application will be unavailable until the DNS changes are propagated.

SQL 数据库SQL Database

使用 Azure SQL 数据库确保业务连续性的相关概述中介绍了 SQL 数据库的恢复点目标 (RPO) 和估计恢复时间 (ERT)。The recovery point objective (RPO) and estimated recovery time (ERT) for SQL Database are documented in Overview of business continuity with Azure SQL Database.

存储Storage

RA-GRS 存储提供了持久性存储,但请务必了解在中断期间可能会发生什么情况:RA-GRS storage provides durable storage, but it's important to understand what can happen during an outage:

  • 如果发生存储中断,则在一段时间内无法对数据进行写入访问。If a storage outage occurs, there will be a period of time when you don't have write-access to the data. 在中断期间,仍然可以从辅助终结点读取数据。You can still read from the secondary endpoint during the outage.

  • 如果区域性故障或灾难影响了主位置并且无法从那里恢复数据,则 Azure 存储团队可以决定执行到次要区域的异地故障转移。If a regional outage or disaster affects the primary location and the data there cannot be recovered, the Azure Storage team may decide to perform a geo-failover to the secondary region.

  • 到次要区域的数据复制是以异步方式执行的。Data replication to the secondary region is performed asynchronously. 因此,如果执行异地故障转移,并且无法从主要区域中恢复数据,则可能会丢失一些数据。Therefore, if a geo-failover is performed, some data loss is possible if the data can't be recovered from the primary region.

  • 暂时性故障(例如网络中断)不会触发存储故障转移。Transient failures, such as a network outage, will not trigger a storage failover. 设计应用程序时请使其能够在发生暂时性故障时进行复原。Design your application to be resilient to transient failures. 缓解选项包括:Mitigation options include:

    • 从次要区域中进行读取。Read from the secondary region.
    • 临时切换到另一存储帐户来执行新的写入操作(例如将消息排入队列)。Temporarily switch to another storage account for new write operations (for example, to queue messages).
    • 将数据从次要区域复制到另一个存储帐户。Copy data from the secondary region to another storage account.
    • 提供降低的功能,直到系统完成故障回复。Provide reduced functionality until the system fails back.

有关详细信息,请参阅在 Azure 存储中断时该怎么办For more information, see What to do if an Azure Storage outage occurs.

成本注意事项Cost considerations

使用定价计算器估算成本。Use the pricing calculator to estimate costs. 本部分中的这些建议可帮助你降低成本。These recommendations in this section may help you to reduce cost.

Azure Front DoorAzure Front Door

Azure 前门计费有三个定价层:出站数据传输、入站数据传输和路由规则。Azure Front Door billing has three pricing tiers: outbound data transfers, inbound data transfers, and routing rules. 有关详细信息,请参阅Azure 前门定价For more info See Azure Front Door Pricing. 定价图不包括访问后端服务的数据和转接到前门的成本。The pricing chart does not include the cost of accessing data from the backend services and transferring to Front Door. 这些成本按数据传输费用计费,如带宽定价详细信息中所述。Those costs are billed based on data transfer charges, described in Bandwidth Pricing Details.

Azure Cosmos DBAzure Cosmos DB

确定 Azure Cosmos DB 定价有两个因素:There are two factors that determine Azure Cosmos DB pricing:

  • 预配的吞吐量或每秒请求单位数) (RU/秒。The provisioned throughput or Request Units per second (RU/s).

    Cosmos DB 分配保证指定的 RU/s 所需的资源。Cosmos DB allocates the resources required to guarantee the RU/s that you specify. 将按小时对每小时预配的最大吞吐量进行计费。You are billed hourly for the maximum provisioned throughput per hour. 由于容器或数据库专用资源,即使未运行任何工作负荷,也需要支付指定的吞吐量。Because of the resources dedicated to your container or database, you are charged for the specified throughput even if you don't run any workload.

  • 已使用存储。Consumed storage. 按指定时间内用于数据和索引的总存储量 (GBs) 来收取统一费用。You are billed a flat rate for the total amount of storage (GBs) consumed for data and the indexes for a given hour.

有关详细信息,请参阅 Microsoft Azure 架构良好的框架中的“成本”部分。For more information, see the cost section in Microsoft Azure Well-Architected Framework.

可管理性注意事项Manageability considerations

如果主数据库发生故障,请执行到辅助数据库的手动故障转移。If the primary database fails, perform a manual failover to the secondary database. 请参阅还原 AZURE SQL 数据库或故障转移到辅助数据库See Restore an Azure SQL Database or failover to a secondary. 在进行故障转移之前,辅助数据库将保持只读状态。The secondary database remains read-only until you fail over.

DevOps 注意事项DevOps considerations

此体系结构遵循多区域部署建议,请参阅Azure 结构良好的框架的 DevOps 部分中所述。This architecture follows the multi region deployment recommendation, described in the DevOps section of the Azure Well Architected Framework.

此体系结构基于在web 应用程序中提高可伸缩性的内容构建,请参阅DevOps 注意事项部分This architecture builds on the one shown in Improve scalability in a web application, see DevOps considerations section.