在多个 Azure Stack Hub 区域中运行 N 层应用程序以实现高可用性Run an N-tier application in multiple Azure Stack Hub regions for high availability

此参考体系结构展示了在多个 Azure Stack Hub 区域中运行 N 层应用程序以实现可用性和可靠的灾难恢复基础结构的一组经过实践检验的做法。This reference architecture shows a set of proven practices for running across an N-tier application multiple Azure Stack Hub regions, in order to achieve availability and a robust disaster recovery infrastructure. 在本文档中,流量管理器用于实现高可用性,但如果流量管理器在环境中不是首选项,则也可使用一对高度可用的负载均衡器来代替。In this document, Traffic Manager is used to achieve high availability, however if Traffic Manager is not a preferred choice in your environment, a pair of highly available load balancers could also be substituted in.

备注

请注意,在下面的体系结构中使用的流量管理器需要在 Azure 中进行配置,用于配置流量管理器配置文件的终结点需要是可以公开路由的 IP。Please note the Traffic Manager used in the architecture below needs to be configured in Azure and the endpoints used to configure the Traffic Manager profile need to be publicly routable IPs.

体系结构Architecture

此体系结构基于使用 SQL Server 的 N 层应用程序中所示的体系结构。This architecture builds on the one shown in N-tier application with SQL Server.

Azure N 层应用程序的高可用性网络体系结构

  • 主要和次要区域Primary and secondary regions. 使用两个区域来实现更高的可用性。Use two regions to achieve higher availability. 其中一个是主要区域。One is the primary region. 另一个区域用于故障转移。The other region is for failover.

  • Azure 流量管理器Azure Traffic Manager. 流量管理器将传入请求路由到其中一个区域。Traffic Manager routes incoming requests to one of the regions. 在正常运行期间,它将请求路由到主要区域。During normal operations, it routes requests to the primary region. 如果该区域变得不可用,则流量管理器将故障转移到次要区域。If that region becomes unavailable, Traffic Manager fails over to the secondary region. 有关详细信息,请参阅流量管理器配置部分。For more information, see the section Traffic Manager configuration.

  • 资源组Resource groups. 为主要区域和次要区域创建单独的资源组Create separate resource groups for the primary region, the secondary region. 这允许你将每个区域作为单个资源集合灵活进行管理。This gives you the flexibility to manage each region as a single collection of resources. 例如,可以重新部署一个区域而无需关闭另一个区域。For example, you could redeploy one region, without taking down the other one. 链接资源组,以便可以运行查询来列出应用程序的所有资源。Link the resource groups, so that you can run a query to list all the resources for the application.

  • 虚拟网络Virtual networks. 为每个区域创建一个单独的虚拟网络。Create a separate virtual network for each region. 请确保地址空间不重叠。Make sure the address spaces do not overlap.

  • SQL Server Always On 可用性组SQL Server Always On Availability Group. 如果使用的是 SQL Server,建议使用 SQL Always On 可用性组以实现高可用性。If you are using SQL Server, we recommend SQL Always On Availability Groups for high availability. 创建同时包含两个区域中的 SQL Server 实例的单个可用性组。Create a single availability group that includes the SQL Server instances in both regions.

  • VNET 到 VNET VPN 连接VNET to VNET VPN Connection. 由于 VNET 对等互连尚不可在 Azure Stack Hub 上使用,因此请使用 VNET 到 VNET VPN 连接来连接两个 VNET。As VNET Peering is not yet available on Azure Stack Hub, use VNET to VNET VPN connection in order to connect the two VNETs. 有关详细信息,请参阅 Azure Stack Hub 中的 VNET 到 VNETPlease see VNET to VNET in Azure Stack Hub for more information.

建议Recommendations

与部署到单个区域相比,多区域体系结构可以提供更高的可用性。A multi-region architecture can provide higher availability than deploying to a single region. 如果区域性故障影响了主要区域,则可以使用流量管理器故障转移到次要区域。If a regional outage affects the primary region, you can use Traffic Manager to fail over to the secondary region. 当应用程序的单个子系统出现故障时,此体系结构可能也比较有用。This architecture can also help if an individual subsystem of the application fails.

有多种常规方法可跨区域实现高可用性:There are several general approaches to achieving high availability across regions:

  • 主动/被动(采用热备用模式)Active/passive with hot standby. 流量将前往一个区域,而另一个区域将以热备用模式等待。Traffic goes to one region, while the other waits on hot standby. “热备用模式”意味着次要区域中的 VM 已被分配并总是处于运行状态。Hot standby means the VMs in the secondary region are allocated and running at all times.

  • 使用冷备用的主动/被动Active/passive with cold standby. 流量将前往一个区域,而另一个区域将以冷备用模式等待。Traffic goes to one region, while the other waits on cold standby. “冷备用模式”意味着次要区域中的 VM 不会被分配,直到故障转移需要它们。Cold standby means the VMs in the secondary region are not allocated until needed for failover. 此方法的运行成本较低,但是当发生故障时通常需要花费更长时间才能联机。This approach costs less to run, but will generally take longer to come online during a failure.

  • 主动/主动Active/active. 两个区域都处于活动状态,并且会在它们之间对请求进行负载均衡。Both regions are active, and requests are load balanced between them. 如果一个区域变得不可用,则不再使其参与轮换。If one region becomes unavailable, it is taken out of rotation.

此参考体系结构侧重于“主动/被动(采用热备用模式)”,使用流量管理器进行故障转移。This reference architecture focuses on active/passive with hot standby, using Traffic Manager for failover. 可以为热备用模式部署少量 VM,然后根据需要横向扩展。You could deploy a small number of VMs for hot standby and then scale out as needed.

流量管理器配置Traffic Manager configuration

配置流量管理器时,请考虑以下几点:Consider the following points when configuring Traffic Manager:

  • 路由Routing. 流量管理器支持多个路由算法Traffic Manager supports several routing algorithms. 对于本文中所述的情况,请使用“优先级” 路由(以前称为“故障转移” 路由)。For the scenario described in this article, use priority routing (formerly called failover routing). 使用此设置时,流量管理器将所有请求都发送到主要区域,除非主要区域变得无法访问。With this setting, Traffic Manager sends all requests to the primary region, unless the primary region becomes unreachable. 那时,它将自动故障转移到次要区域。At that point, it automatically fails over to the secondary region. 请参阅配置故障转移路由方法See Configure Failover routing method.

  • 运行状况探测Health probe. 流量管理器使用 HTTP(或 HTTPS)探测来监视每个区域的可用性。Traffic Manager uses an HTTP (or HTTPS) probe to monitor the availability of each region. 探测检查指定 URL 路径的 HTTP 200 响应。The probe checks for an HTTP 200 response for a specified URL path. 作为最佳做法,请创建一个用于报告应用程序整体运行状况的终结点,并使用此终结点进行运行状况探测。As a best practice, create an endpoint that reports the overall health of the application, and use this endpoint for the health probe. 否则,探测可能会在应用程序的关键部分实际上已出现故障时报告终结点运行状况正常。Otherwise, the probe might report a healthy endpoint when critical parts of the application are actually failing. 有关详细信息,请参阅 运行状况终结点监视模式For more information, see Health Endpoint Monitoring pattern.

当流量管理器进行故障转移时,一段时间内客户端将无法访问应用程序。When Traffic Manager fails over there is a period of time when clients cannot reach the application. 持续时间受以下因素影响:The duration is affected by the following factors:

  • 运行状况探测必须检测主要区域是否变得无法访问。The health probe must detect that the primary region has become unreachable.

  • DNS 服务器必须更新 IP 地址的已缓存 DNS 记录,这取决于 DNS 生存时间 (TTL)。DNS servers must update the cached DNS records for the IP address, which depends on the DNS time-to-live (TTL). 默认 TTL 为 300 秒(5 分钟),但可以在创建流量管理器配置文件时配置此值。The default TTL is 300 seconds (5 minutes), but you can configure this value when you create the Traffic Manager profile.

相关详细信息,请参阅关于流量管理器监视For details, see About Traffic Manager Monitoring.

如果流量管理器进行故障转移,我们建议执行手动故障回复,而不是实施自动故障回复。If Traffic Manager fails over, we recommend performing a manual failback rather than implementing an automatic failback. 否则,可能会造成应用程序在区域之间来回转移的情况。Otherwise, you can create a situation where the application flips back and forth between regions. 在进行故障回复之前,请验证是否所有应用程序子系统的运行状况都正常。Verify that all application subsystems are healthy before failing back.

请注意,默认情况下,流量管理器会自动进行故障回复。Note that Traffic Manager automatically fails back by default. 若要禁止此操作,请在发生故障转移事件后手动降低主要区域的优先级。To prevent this, manually lower the priority of the primary region after a failover event. 例如,假设主要区域的优先级为 1,次要区域的优先级为 2。For example, suppose the primary region is priority 1 and the secondary is priority 2. 在故障转移后,请将主要区域的优先级设置为 3,以禁止自动故障回复。After a failover, set the primary region to priority 3, to prevent automatic failback. 当准备好切换回来时,请将优先级更新为 1。When you are ready to switch, back, update the priority to 1.

以下 Azure CLI 命令更新优先级:The following Azure CLI command updates the priority:

az network traffic-manager endpoint update --resource-group <resource-group> --profile-name <profile>
    --name <endpoint-name> --type externalEndpoints --priority 3

另一种方法是暂时禁用终结点,直到你准备好进行故障回复:Another approach is to temporarily disable the endpoint until you are ready to fail back:

az network traffic-manager endpoint update --resource-group <resource-group> --profile-name <profile>
    --name <endpoint-name> --type externalEndpoints --endpoint-status Disabled

可能需要重新部署某个区域中的资源,具体取决于故障转移原因。Depending on the cause of a failover, you might need to redeploy the resources within a region. 在进行故障回复之前,请执行操作准备情况测试。Before failing back, perform an operational readiness test. 测试应当验证的事项如下所示:The test should verify things like:

  • VM 是否已正确配置。VMs are configured correctly. (所有必需的软件是否已安装、IIS 是否正在运行,等等。)(All required software is installed, IIS is running, and so on.)

  • 应用程序子系统运行状况是否正常。Application subsystems are healthy.

  • 功能测试。Functional testing. (例如,是否可以从 Web 层访问数据库层。)(For example, the database tier is reachable from the web tier.)

配置 SQL Server Always On 可用性组Configure SQL Server Always On Availability Groups

在 Windows Server 2016 之前,SQL Server Always On 可用性组需要一个域控制器,并且可用性组中的所有节点必须在同一 Active Directory (AD) 域中。Prior to Windows Server 2016, SQL Server Always On Availability Groups require a domain controller, and all nodes in the availability group must be in the same Active Directory (AD) domain.

若要配置可用性组,请执行以下操作:To configure the availability group:

  • 至少在每个区域中放置两个域控制器。At a minimum, place two domain controllers in each region.

  • 为每个域控制器提供一个静态 IP 地址。Give each domain controller a static IP address.

  • 创建 VPN,以便启用两个虚拟网络之间的通信。Create VPN to enable communication between two virtual networks.

  • 针对每个虚拟网络,将两个区域中的域控制器的 IP 地址添加到 DNS 服务器列表。For each virtual network, add the IP addresses of the domain controllers (from both regions) to the DNS server list. 可以使用以下 CLI 命令。You can use the following CLI command. 有关详细信息,请参阅更改 DNS 服务器For more information, see Change DNS servers.

    az network vnet update --resource-group <resource-group> --name <vnet-name> --dns-servers "10.0.0.4,10.0.0.6,172.16.0.4,172.16.0.6"
    
  • 创建一个 Windows Server 故障转移群集 (WSFC) 群集,使其包括两个区域中的 SQL Server 实例。Create a Windows Server Failover Clustering (WSFC) cluster that includes the SQL Server instances in both regions.

  • 创建一个 SQL Server Always On 可用性组,使其包括主要区域和次要区域中的 SQL Server 实例。Create a SQL Server Always On Availability Group that includes the SQL Server instances in both the primary and secondary regions. 有关步骤,请参阅将 Always On 可用性组扩展到远程 Azure 数据中心 (PowerShell)See Extending Always On Availability Group to Remote Azure Datacenter (PowerShell) for the steps.

    • 将主要副本放置在主要区域中。Put the primary replica in the primary region.

    • 将一个或多个次要副本放置在主要区域中。Put one or more secondary replicas in the primary region. 对它们进行配置以将同步提交与自动故障转移一起使用。Configure these to use synchronous commit with automatic failover.

    • 将一个或多个次要副本放置在次要区域中。Put one or more secondary replicas in the secondary region. 出于性能方面的原因,请对它们进行配置以使用 异步 提交。Configure these to use asynchronous commit, for performance reasons. (否则,所有 T-SQL 事务必须等待通过网络到次要区域的往返旅程。)(Otherwise, all T-SQL transactions have to wait on a round trip over the network to the secondary region.)

备注

异步提交副本不支持自动故障转移。Asynchronous commit replicas don't support automatic failover.

可用性注意事项Availability considerations

使用复杂的 N 层应用时,可能不需要在次要区域中复制整个应用程序。With a complex N-tier app, you may not need to replicate the entire application in the secondary region. 相反,可能只需复制支持业务连续性所需的关键子系统。Instead, you might just replicate a critical subsystem that is needed to support business continuity.

流量管理器是系统中的一个潜在故障点。Traffic Manager is a possible failure point in the system. 如果流量管理器服务出现故障,则客户端在停机期间无法访问应用程序。If the Traffic Manager service fails, clients cannot access your application during the downtime. 查看流量管理器 SLA,然后确定仅使用流量管理器是否能满足业务对高可用性的需求。Review the Traffic Manager SLA, and determine whether using Traffic Manager alone meets your business requirements for high availability. 如果不能,请考虑添加另一个流量管理解决方案作为故障回复机制。If not, consider adding another traffic management solution as a failback. 如果 Azure 流量管理器服务出现故障,请将 DNS 中的 CNAME 记录更改为指向其他流量管理服务。If the Azure Traffic Manager service fails, change your CNAME records in DNS to point to the other traffic management service. (此步骤必须手动执行,并且在 DNS 更改被传播之前,应用程序将不可用。)(This step must be performed manually, and your application will be unavailable until the DNS changes are propagated.)

对于 SQL Server 群集,有两个故障转移方案需要考虑:For the SQL Server cluster, there are two failover scenarios to consider:

  • 主要区域中的所有 SQL Server 数据库副本都失败。All of the SQL Server database replicas in the primary region fail. 例如,在发生区域性中断期间可能会出现此情况。For example, this could happen during a regional outage. 在这种情况下,必须手动故障转移可用性组,尽管流量管理器在前端会自动进行故障转移。In that case, you must manually fail over the availability group, even though Traffic Manager automatically fails over on the front end. 请按照执行可用性组的强制手动故障转移 (SQL Server) 一文中的步骤进行操作,该文章介绍了如何在 SQL Server 2016 中使用 SQL Server Management Studio、Transact-SQL 或 PowerShell 执行强制故障转移。Follow the steps in Perform a Forced Manual Failover of a SQL Server Availability Group, which describes how to perform a forced failover by using SQL Server Management Studio, Transact-SQL, or PowerShell in SQL Server 2016.

    警告

    使用强制故障转移时存在数据丢失风险。With forced failover, there is a risk of data loss. 在主要区域恢复联机状态后,创建数据库快照并使用 tablediff 查明差异。Once the primary region is back online, take a snapshot of the database and use tablediff to find the differences.

  • 流量管理器故障转移到次要区域,但主要 SQL Server 数据库副本仍然可用。Traffic Manager fails over to the secondary region, but the primary SQL Server database replica is still available. 例如,前端层可能会失败,但不会影响 SQL Server VM。For example, the front-end tier might fail, without affecting the SQL Server VMs. 在这种情况下,Internet 流量将路由到次要区域中,并且该区域仍可以连接到主要副本。In that case, Internet traffic is routed to the secondary region, and that region can still connect to the primary replica. 但是,延迟将有所增加,因为 SQL Server 连接是跨区域的。However, there will be increased latency, because the SQL Server connections are going across regions. 在此情况下,应当执行手动故障转移,如下所述:In this situation, you should perform a manual failover as follows:

    1. 暂时将次要区域中的 SQL Server 数据库副本切换为 同步 提交。Temporarily switch a SQL Server database replica in the secondary region to synchronous commit. 这可确保在故障转移期间不会丢失数据。This ensures there won't be data loss during the failover.

    2. 故障转移到该副本。Fail over to that replica.

    3. 在故障回复到主要区域后,还原异步提交设置。When you fail back to the primary region, restore the asynchronous commit setting.

可管理性注意事项Manageability considerations

更新部署时,请一次更新一个区域,以减少由于错误配置或应用程序中的错误而导致全局故障的可能性。When you update your deployment, update one region at a time to reduce the chance of a global failure from an incorrect configuration or an error in the application.

测试系统在发生故障时的复原能力。Test the resiliency of the system to failures. 下面是要测试的一些常见故障方案:Here are some common failure scenarios to test:

  • 关闭 VM 实例。Shut down VM instances.

  • 对 CPU 和内存等资源进行压力测试。Pressure resources such as CPU and memory.

  • 断开网络连接/使网络传输出现延迟。Disconnect/delay network.

  • 使进程崩溃。Crash processes.

  • 使证书过期。Expire certificates.

  • 模拟硬件错误。Simulate hardware faults.

  • 关闭域控制器上的 DNS 服务。Shut down the DNS service on the domain controllers.

测量恢复时间,并验证它们是否满足你的业务要求。Measure the recovery times and verify they meet your business requirements. 另外还测试故障模式的组合。Test combinations of failure modes, as well.

后续步骤Next steps