您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

高可用性和 Azure SQL 数据库High-availability and Azure SQL Database

Azure SQL 数据库中的高可用性体系结构的目标是确保数据库在99.99% 的时间内启动并运行,而无需担心维护操作和中断的影响。The goal of the High Availability architecture in Azure SQL Database is to guarantee that your database is up and running 99.99% of time, without worrying about the impact of maintenance operations and outages. Azure 会自动处理关键的服务任务,例如修补、备份、Windows 和 SQL 升级,以及根本硬件、软件或网络故障等计划外事件。Azure automatically handles critical servicing tasks, such as patching, backups, Windows and SQL upgrades, as well as unplanned events such as underlying hardware, software or network failures. 当对基础 SQL 实例进行修补或故障转移时,如果在应用程序中使用重试逻辑,则停机时间并不明显。When the underlying SQL instance is patched or fails over, the downtime is not noticeable if you employ retry logic in your app. 即使出现最严重的问题,Azure SQL 数据库也能快速恢复,确保数据始终可用。Azure SQL Database can quickly recover even in the most critical circumstances ensuring that your data is always available.

高可用性解决方案旨在确保提交的数据永远不会因故障而丢失,维护操作不会影响工作负荷,并且数据库在软件体系结构中不会出现单点故障。The high availability solution is designed to ensure that committed data is never lost due to failures, that maintenance operations do not affect your workload, and that the database will not be a single point of failure in your software architecture. 在升级或维护数据库期间,维护或停机时段都不需要停止工作负荷。There are no maintenance windows or downtimes that should require you to stop the workload while the database is upgraded or maintained.

在 Azure SQL 数据库中使用了两种高可用性体系结构模型:There are two high-availability architectural models that are used in Azure SQL Database:

  • 基于计算和存储分离的标准可用性模型。Standard availability model that is based on a separation of compute and storage. 它依赖于远程存储层的高可用性和可靠性。It relies on high availability and reliability of the remote storage tier. 此体系结构面向基于预算的业务应用程序,这些应用程序可以容忍维护活动期间的性能下降。This architecture targets budget-oriented business applications that can tolerate some performance degradation during maintenance activities.
  • 基于数据库引擎进程群集的高级可用性模型。Premium availability model that is based on a cluster of database engine processes. 它依赖于始终存在可用数据库引擎节点的仲裁。It relies on the fact that there is always a quorum of available database engine nodes. 此体系结构面向任务关键型应用程序,具有高 IO 性能、高事务速率,并保证在维护活动期间对工作负荷的性能影响降至最低。This architecture targets mission critical applications with high IO performance, high transaction rate and guarantees minimal performance impact to your workload during maintenance activities.

Azure SQL Database 在最新稳定版本的 SQL Server 数据库引擎和 Windows 操作系统上运行,大多数用户不会注意到升级是连续执行的。Azure SQL Database runs on the latest stable version of SQL Server Database Engine and Windows OS, and most users would not notice that upgrades are performed continuously.

“基本”、“标准”和“常规用途”服务层级可用性Basic, Standard, and General Purpose service tier availability

这些服务层利用标准可用性体系结构。These service tiers leverage the standard availability architecture. 下图显示了四个不同的节点,分别为计算层和存储层。The following figure shows four different nodes with the separated compute and storage layers.

计算和存储隔离

标准可用性模型包括两个层:The standard availability model includes two layers:

  • 无状态计算层,它运行 @no__t 的进程,并且仅包含暂时性和缓存的数据,例如 TempDB、附加 SSD 上的模型数据库、计划缓存、缓冲池和内存中的列存储池。A stateless compute layer that runs the sqlservr.exe process and contains only transient and cached data, such as TempDB, model databases on the attached SSD, and plan cache, buffer pool, and columnstore pool in memory. 此无状态节点由 Azure Service Fabric 操作,该初始化 @no__t,控制节点的运行状况,并在必要时执行故障转移到另一个节点。This stateless node is operated by Azure Service Fabric that initializes sqlservr.exe, controls health of the node, and performs failover to another node if necessary.
  • 具有存储在 Azure Blob 存储中的数据库文件(.mdf/.ldf)的有状态数据层。A stateful data layer with the database files (.mdf/.ldf) that are stored in Azure Blob storage. Azure blob 存储具有内置的数据可用性和冗余功能。Azure blob storage has built-in data availability and redundancy feature. 它可以保证即使 SQL Server 处理崩溃,数据文件中的日志文件或页中的每个记录仍将保留。It guarantees that every record in the log file or page in the data file will be preserved even if SQL Server process crashes.

只要数据库引擎或操作系统升级,或检测到故障,Azure Service Fabric 就会将无状态的 SQL Server 进程移到具有足够可用容量的另一个无状态计算节点。Whenever the database engine or the operating system is upgraded, or a failure is detected, Azure Service Fabric will move the stateless SQL Server process to another stateless compute node with sufficient free capacity. Azure Blob 存储中的数据不受移动的影响,并且数据/日志文件会附加到新初始化的 SQL Server 进程。Data in Azure Blob storage is not affected by the move, and the data/log files are attached to the newly initialized SQL Server process. 此过程保证99.99% 的可用性,但在转换过程中,较大的工作负荷可能会遇到性能下降的情况,因为新的 SQL Server 实例以冷缓存开头。This process guarantees 99.99% availability, but a heavy workload may experience some performance degradation during the transition since the new SQL Server instance starts with cold cache.

“高级”或“业务关键”服务层级可用性Premium and Business Critical service tier availability

高级和业务关键服务层利用高级可用性模型,该模型将计算资源(SQL Server 数据库引擎进程)和存储(本地附加的 SSD)集成到一个节点上。Premium and Business Critical service tiers leverage the Premium availability model, which integrates compute resources (SQL Server Database Engine process) and storage (locally attached SSD) on a single node. 通过将计算和存储复制到创建三个到四个节点的群集的其他节点,实现高可用性。High availability is achieved by replicating both compute and storage to additional nodes creating a three to four-node cluster.

数据库引擎节点群集

基础数据库文件(.mdf/.ldf)放置在附加的 SSD 存储上,以向工作负荷提供非常低的延迟 IO。The underlying database files (.mdf/.ldf) are placed on the attached SSD storage to provide very low latency IO to your workload. 高可用性是使用类似于 SQL Server Always On 可用性组的技术实现的。High availability is implemented using a technology similar to SQL Server Always On Availability Groups. 群集包括单个主副本(SQL Server 进程),该副本可供读写客户工作负荷访问,最多包含三个包含数据副本的辅助副本(计算和存储)。The cluster includes a single primary replica (SQL Server process) that is accessible for read-write customer workloads, and up to three secondary replicas (compute and storage) containing copies of data. 主节点会按顺序不断推送对辅助节点的更改,并确保在提交每个事务之前,将数据同步到至少一个辅助副本。The primary node constantly pushes changes to the secondary nodes in order and ensures that the data is synchronized to at least one secondary replica before committing each transaction. 此过程可确保,如果主节点由于任何原因崩溃,总会有一个完全同步的节点会故障转移到。This process guarantees that if the primary node crashes for any reason, there is always a fully synchronized node to fail over to. 故障转移由 Azure Service Fabric 发起。The failover is initiated by the Azure Service Fabric. 辅助副本成为新的主节点后,将创建另一个辅助副本以确保群集具有足够的节点(仲裁集)。Once the secondary replica becomes the new primary node, another secondary replica is created to ensure the cluster has enough nodes (quorum set). 故障转移完成后,SQL 连接将自动重定向到新的主节点。Once failover is complete, SQL connections are automatically redirected to the new primary node.

作为一个额外的好处,高级可用性模型包括将只读 SQL 连接重定向到某个辅助副本的能力。As an extra benefit, the premium availability model includes the ability to redirect read-only SQL connections to one of the secondary replicas. 此功能称为 "读取横向扩展"。它提供100% 的额外计算能力,无需支付额外费用,就无法从主副本中脱离加载只读操作,如分析工作负荷。This feature is called Read Scale-Out. It provides 100% additional compute capacity at no extra charge to off-load read-only operations, such as analytical workloads, from the primary replica.

超大规模服务层可用性Hyperscale service tier availability

分布式函数体系结构中介绍了超大规模服务层体系结构。The Hyperscale service tier architecture is described in Distributed functions architecture.

超大规模功能体系结构

超大规模中的可用性模型包含四个层:The availability model in Hyperscale includes four layers:

  • 无状态计算层,它运行 @no__t 0 的进程,并且仅包含在附加的 SSD 上的非覆盖 RBPEX 缓存、TempDB、模型数据库等,以及在内存中计划缓存、缓冲池和列存储池等。A stateless compute layer that runs the sqlservr.exe processes and contains only transient and cached data, such as non-covering RBPEX cache, TempDB, model database, etc. on the attached SSD, and plan cache, buffer pool, and columnstore pool in memory. 此无状态层包括主要计算副本和可用作故障转移目标的多个辅助计算副本(可选)。This stateless layer includes the primary compute replica and optionally a number of secondary compute replicas that can serve as failover targets.
  • 页面服务器形成的无状态存储层。A stateless storage layer formed by page servers. 此层是在计算副本上运行的 @no__t 0 进程的分布式存储引擎。This layer is the distributed storage engine for the sqlservr.exe processes running on the compute replicas. 每个页面服务器仅包含暂时性和缓存的数据,例如在附加的 SSD 上覆盖 RBPEX 缓存,以及在内存中缓存数据页。Each page server contains only transient and cached data, such as covering RBPEX cache on the attached SSD, and data pages cached in memory. 每个页面服务器在主动-主动配置中都有成对的页面服务器,以提供负载平衡、冗余和高可用性。Each page server has a paired page server in an active-active configuration to provide load balancing, redundancy, and high availability.
  • 由计算节点构成的有状态事务日志存储层,运行日志服务进程、事务日志登陆区域和事务日志长期存储。A stateful transaction log storage layer formed by the compute node running the Log service process, the transaction log landing zone, and transaction log long term storage. 登陆区域和长期存储使用 Azure 存储,它为事务日志提供可用性和冗余,确保已提交事务的数据持久性。Landing zone and long term storage use Azure Storage, which provides availability and redundancy for transaction log, ensuring data durability for committed transactions.
  • 一种具有数据库文件(.mdf/ndf)的有状态数据存储层,存储在 Azure 存储中,由页面服务器更新。A stateful data storage layer with the database files (.mdf/.ndf) that are stored in Azure Storage and are updated by page servers. 该层使用 Azure 存储的数据可用性和冗余功能。This layer uses data availability and redundancy features of Azure Storage. 它保证数据文件中的每个页面都将保留,即使超大规模体系结构崩溃的其他层中的进程或计算节点发生故障。It guarantees that every page in a data file will be preserved even if processes in other layers of Hyperscale architecture crash, or if compute nodes fail.

所有超大规模层中的计算节点在 Azure Service Fabric 上运行,用于控制每个节点的运行状况,并根据需要对可用节点执行故障转移。Compute nodes in all Hyperscale layers run on Azure Service Fabric, which controls health of each node and performs failovers to available healthy nodes as necessary.

有关超大规模中的高可用性的详细信息,请参阅超大规模中的数据库高可用性For more information on high availability in Hyperscale, see Database High Availability in Hyperscale.

区域冗余配置Zone redundant configuration

默认情况下,将在同一数据中心内创建高级可用性模型的节点群集。By default, the cluster of nodes for the premium availability model is created in the same datacenter. 引入Azure 可用性区域后,SQL 数据库可以将业务关键数据库的不同副本放置到同一区域中的不同可用性区域。With the introduction of Azure Availability Zones, SQL Database can place different replicas of the Business Critical database to different availability zones in the same region. 若要消除单一故障点,还要将控件环跨区域地复制为三个网关环 (GW)。To eliminate a single point of failure, the control ring is also duplicated across multiple zones as three gateway rings (GW). 到特定网关环的路由受 Azure 流量管理器 (ATM) 控制。The routing to a specific gateway ring is controlled by Azure Traffic Manager (ATM). 由于高级或业务关键服务层中的区域冗余配置不会创建其他数据库冗余,因此你可以无需额外付费即可启用此功能。Because the zone redundant configuration in the Premium or Business Critical service tiers does not create additional database redundancy, you can enable it at no extra cost. 通过选择区域冗余配置,你可以使你的高级或业务关键数据库弹性应对一组更大的故障,包括灾难性的数据中心中断,而不会对应用程序逻辑进行任何更改。By selecting a zone redundant configuration, you can make your Premium or Business Critical databases resilient to a much larger set of failures, including catastrophic datacenter outages, without any changes to the application logic. 还可以将所有现有“高级”或“业务关键”数据库或池转换到区域冗余配置。You can also convert any existing Premium or Business Critical databases or pools to the zone redundant configuration.

由于区域冗余数据库的副本在不同的数据中心具有一定距离,因此增加的网络延迟可能会增加提交时间,从而影响某些 OLTP 工作负载的性能。Because the zone redundant databases have replicas in different datacenters with some distance between them, the increased network latency may increase the commit time and thus impact the performance of some OLTP workloads. 始终可以通过禁用区域冗余设置返回到单个区域配置。You can always return to the single-zone configuration by disabling the zone redundancy setting. 此过程是一种联机操作,类似于常规的服务层升级。This process is an online operation similar to the regular service tier upgrade. 在此进程结束时,该数据库或池将从区域冗余环迁移到单个区域环,反之亦然。At the end of the process, the database or pool is migrated from a zone redundant ring to a single zone ring or vice versa.

重要

目前,只有在 "高级" 和 "业务关键" 服务层中,才支持区域冗余数据库和弹性池。Zone redundant databases and elastic pools are currently only supported in the Premium and Business Critical service tiers in select regions. 使用业务关键层时,区域冗余配置仅在选择 Gen5 计算硬件时可用。When using the Business Critical tier, zone redundant configuration is only available when the Gen5 compute hardware is selected. 有关支持区域冗余数据库的区域的最新信息,请参阅按区域提供的服务支持For up to date information about the regions that support zone redundant databases, see Services support by region.
此功能在托管实例中不可用。This feature is not available in Managed instance.

下图演示了高可用性体系结构的区域冗余版本:The zone redundant version of the high availability architecture is illustrated by the following diagram:

高可用性体系结构区域冗余

加速的数据库恢复 (ADR)Accelerated Database Recovery (ADR)

加速数据库恢复(ADR)是一项全新的 SQL 数据库引擎功能,可大大提高数据库的可用性,尤其是在存在长时间运行的事务时。Accelerated Database Recovery (ADR) is a new SQL database engine feature that greatly improves database availability, especially in the presence of long running transactions. ADR 目前可用于单个数据库、弹性池和 Azure SQL 数据仓库。ADR is currently available for single databases, elastic pools, and Azure SQL Data Warehouse.

测试数据库错误复原Testing database fault resiliency

高可用性是 Azure SQL 数据库平台的 fundamenental 部分,可透明地用于数据库应用程序。High availability is a fundamenental part of Azure SQL Database platform and works transparently for your database application. 但是,我们认识到,你可能想要测试在计划或计划外事件期间启动的自动故障转移操作如何影响应用程序,然后再将其部署到生产环境。However, we recognize that you may want to test how the automatic failover operations initiated during planned or unplanned events would impact the application before you deploy it for production. 你可以调用一个特殊的 API 来重新启动数据库或弹性池,这反过来会触发故障转移。You can call a special API to restart the database or the elastic pool, which will in turn trigger the failover. 在区域冗余数据库或弹性池的情况下,API 调用将导致客户端连接重定向到不同 AZ 中的新主数据库。In the case of zone redundant database or elastic pool, the API call would result in redirecting the client connections to the new primary in a different AZ. 因此,除了测试故障转移对现有数据库会话的影响,还可以验证它是否会影响端到端性能。So in addition to testing how failover impacts the existing database sessions, you can also verify if it impacts the end-to-end performance. 由于重新启动操作是入侵的,很多用户可能会对平台造成压力,因此每个数据库或弹性池每30分钟只允许进行一次故障转移呼叫。Because the restart operation is intrusive and a large number of them could stress out the platform, only one failover call is allowed every 30 minutes for each database or elastic pool. 有关详细信息,请参阅数据库故障转移和弹性池故障转移For details, see Database failover and Elastic pool failover.

重要

故障转移命令当前不可用于 Hypescale 数据库和托管的 instancses。The Failover command is currently not available for Hypescale databases and Managed instancses.

结束语Conclusion

Azure SQL 数据库具有内置的高可用性解决方案,与 Azure 平台深度集成。Azure SQL Database features a built-in high availability solution, that is deeply integrated with the Azure platform. 它依赖于故障检测和恢复 Service Fabric,在用于数据保护的 Azure Blob 存储上,以及在可用性区域上以获得更高的容错能力。It is dependent on Service Fabric for failure detection and recovery, on Azure Blob storage for data protection, and on Availability Zones for higher fault tolerance. 此外,Azure SQL 数据库利用 SQL Server 的 Always On 可用性组技术来进行复制和故障转移。In addition, Azure SQL database leverages the Always On Availability Group technology from SQL Server for replication and failover. 这些技术的结合使应用程序能够完全实现混合存储模型的优势,并支持最苛刻的 Sla。The combination of these technologies enables applications to fully realize the benefits of a mixed storage model and support the most demanding SLAs.

后续步骤Next steps