使用 SQL AlwaysOn 的 IaaS - 调整故障转移群集网络阈值IaaS with SQL AlwaysOn - Tuning Failover Cluster Network Thresholds

本文介绍了用于调整故障转移群集网络阈值的解决方案。This article introduces solutions for adjusting the threshold of failover cluster networks.

症状Symptom

当在 IaaS 中运行 SQL Server AlwaysOn 的 Windows 故障转移群集节点时,建议将群集设置更改为更宽松的监视状态。When running Windows Failover Cluster nodes in IaaS with SQL Server AlwaysOn, changing the cluster setting to a more relaxed monitoring state is recommended. 现成的群集设置具有限制性,并可能导致不必要的中断。Cluster settings out of the box are restrictive and could cause unneeded outages. 默认设置用于高度优化的本地网络,并且不考虑多租户环境(如 Windows Azure (IaaS) )导致的延迟。The default settings are designed for highly tuned on premises networks and do not take into account the possibility of induced latency caused by a multi-tenant environment such as Windows Azure (IaaS).

Windows Server 故障转移群集不断监视 Windows 群集中节点的网络连接和运行状况。Windows Server Failover Clustering is constantly monitoring the network connections and health of the nodes in a Windows Cluster. 如果某个节点不可通过网络访问,则执行恢复操作进行恢复,并使应用程序和在线服务进入群集中的另一个节点上。If a node is not reachable over the network, then recovery action is taken to recover and bring applications and services online on another node in the cluster. 群集节点之间的通信延迟可能导致以下错误:Latency in communication between cluster nodes can lead to the following error:

错误 1135 (系统事件日志) Error 1135 (system event log)

从活动故障转移群集成员身份中删除了群集节点节点1Cluster node Node1 was removed from the active failover cluster membership. 此节点上的群集服务可能已停止。The Cluster service on this node may have stopped. 这也可能是由于节点与故障转移群集中的其他活动节点失去了通信。This could also be due to the node having lost communication with other active nodes in the failover cluster. 运行验证配置向导检查您的网络配置。Run the Validate a Configuration wizard to check your network configuration. 如果此情况仍然存在,请检查与此节点上的网络适配器相关的硬件或软件错误。If the condition persists, check for hardware or software errors related to the network adapters on this node. 还要检查节点连接到的任何其他网络组件(如集线器、交换机或网桥)中是否存在故障。Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Cluster .log 示例:Cluster.log Example:

0000ab34.00004e64::2014/06/10-07:54:34.099 DBG   [NETFTAPI] Signaled NetftRemoteUnreachable event, local address 10.xx.x.xxx:3343 remote address 10.x.xx.xx:3343
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] got event: Remote endpoint 10.xx.xx.xxx:~3343~ unreachable from 10.xx.x.xx:~3343~
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Marking Route from 10.xxx.xxx.xxxx:~3343~ to 10.xxx.xx.xxxx:~3343~ as down
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [NDP] Checking to see if all routes for route (virtual) local fexx::xxx:5dxx:xxxx:3xxx:~0~ to remote xxx::cxxx:xxxd:xxx:dxxx:~0~ are down
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [NDP] All routes for route (virtual) local fxxx::xxxx:5xxx:xxxx:3xxx:~0~ to remote fexx::xxxx:xxxx:xxxx:xxxx:~0~ are down
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [CORE] Node 8: executing node 12 failed handlers on a dedicated thread
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [NODE] Node 8: Cleaning up connections for n12.
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [Nodename] Clearing 0 unsent and 15 unacknowledged messages.
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [NODE] Node 8: n12 node object is closing its connections
0000ab34.00008b68::2014/06/10-07:54:34.099 INFO  [DCM] HandleNetftRemoteRouteChange
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Route history 1: Old: 05.936, Message: Response, Route sequence: 150415, Received sequence: 150415, Heartbeats counter/threshold: 5/5, Error: Success, NtStatus: 0 Timestamp: 2014/06/10-07:54:28.000, Ticks since last sending: 4
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [NODE] Node 8: closing n12 node object channels
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Route history 2: Old: 06.434, Message: Request, Route sequence: 150414, Received sequence: 150402, Heartbeats counter/threshold: 5/5, Error: Success, NtStatus: 0 Timestamp: 2014/06/10-07:54:27.665, Ticks since last sending: 36
0000ab34.0000a8ac::2014/06/10-07:54:34.099 INFO  [DCM] HandleRequest: dcm/netftRouteChange
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Route history 3: Old: 06.934, Message: Response, Route sequence: 150414, Received sequence: 150414, Heartbeats counter/threshold: 5/5, Error: Success, NtStatus: 0 Timestamp: 2014/06/10-07:54:27.165, Ticks since last sending: 4
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Route history 4: Old: 07.434, Message: Request, Route sequence: 150413, Received sequence: 150401, Heartbeats counter/threshold: 5/5, Error: Success, NtStatus: 0 Timestamp: 2014/06/10-07:54:26.664, Ticks since last sending: 36
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <realLocal>10.xxx.xx.xxx:~3343~</realLocal>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <realRemote>10.xxx.xx.xxx:~3343~</realRemote>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <virtualLocal>fexx::xxxx:xxxx:xxxx:xxxx:~0~</virtualLocal>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <virtualRemote>fexx::xxxx:xxxx:xxxx:xxxx:~0~</virtualRemote>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <Delay>1000</Delay>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <Threshold>5</Threshold>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <Priority>140481</Priority>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <Attributes>2147483649</Attributes>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO  </struct mscs::FaultTolerantRoute>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO   removed
0000ab34.0000a7c0::2014/06/10-07:54:38.433 ERR   [QUORUM] Node 8: Lost quorum (3 4 5 6 7 8)
0000ab34.0000a7c0::2014/06/10-07:54:38.433 ERR   [QUORUM] Node 8: goingAway: 0, core.IsServiceShutdown: 0
0000ab34.0000a7c0::2014/06/10-07:54:38.433 ERR   lost quorum (status = 5925)

原因Cause

有两个设置用于配置群集的连接运行状况。There are two settings that are used to configure the connectivity health of the cluster.

延迟–定义在节点之间发送分类检测信号的频率。Delay – This defines the frequency at which cluster heartbeats are sent between nodes. 延迟是发送下一个检测信号之前等待的秒数。The delay is the number of seconds before the next heartbeat is sent. 在同一群集中,同一子网中的节点与不同子网中的节点之间可能存在不同的延迟。Within the same cluster, there can be different delays between nodes on the same subnet and between nodes, which are on different subnets.

阈值–此值定义在群集采取恢复操作之前丢失的检测信号数。Threshold – This defines the number of heartbeats, which are missed before the cluster takes recovery action. 阈值为多个检测信号。The threshold is a number of heartbeats. 在同一群集中,同一子网上的节点与不同子网中的节点之间可能存在不同的阈值。Within the same cluster, there can be different thresholds between nodes on the same subnet and between nodes that are on different subnets.

默认情况下,Windows Server 2016 将SameSubnetThreshold设置为10,将SameSubnetDelay设置为1000毫秒。By default Windows Server 2016 sets the SameSubnetThreshold to 10 and SameSubnetDelay to 1000 ms. 例如,如果连接监视失败10秒,则达到故障转移阈值会导致无法访问从群集成员身份中删除的节点。For example, if connectivity monitoring fails for 10 seconds, the failover Threshold is reached resulting in the unreachable that node being removed from cluster membership. 这会导致将资源移到群集上的另一个可用节点。This results in the resources being moved to another available node on the cluster. 将报告群集错误,其中) 报告了上述群集错误 1135 (。Cluster errors will be reported, including cluster error 1135 (above) is reported.

解决方法Resolution

在 IaaS 环境中,放宽群集网络配置设置。In an IaaS environment, relax the Cluster network configuration settings.

验证当前配置的步骤Steps to verify current configuration

检查当前群集网络配置设置是否使用 "获取群集" 命令:Check the current Cluster network configuration settings use the get-cluster command:

C:\Windows\system32> get-cluster | fl *subnet*

每个支持 OS 的默认值、最小值、最大值和推荐值Default, minimum, maximum, and recommended values for each support OS

描述Description (OS)OS MinMin MaxMax 默认值Default 建议Recommended
CrossSubnetThresholdCrossSubnetThreshold 2008 R22008 R2 33 2020 55 2020
CrossSubnet 阈值CrossSubnet Threshold 20122012 33 120120 55 2020
CrossSubnet 阈值CrossSubnet Threshold 2012 R22012 R2 33 120120 55 2020
CrossSubnet 阈值CrossSubnet Threshold 20162016 33 120120 2020 2020
SameSubnet 阈值SameSubnet Threshold 2008 R22008 R2 33 1010 55 1010
SameSubnet 阈值SameSubnet Threshold 20122012 33 120120 55 1010
SameSubnet 阈值SameSubnet Threshold 2012 R22012 R2 33 120120 55 1010
SameSubnetThresholdSameSubnetThreshold 20162016 33 120120 1010 1010

阈值的值反映有关部署范围的当前建议,如以下文章中所述:The values for Threshold reflect the current recommendations regarding the scope of deployment as described in the following article:

优化 Windows Server 2012 R2 中的故障转移群集网络阈值Fine tuning failover cluster network thresholds in Windows Server 2012 R2

阈值定义在群集采取恢复操作之前丢失的检测信号数。The Threshold defines the number of heartbeats, which are missed before the cluster takes recovery action. 阈值为多个检测信号。The threshold is a number of heartbeats. 在同一群集中,同一子网中的节点与不同子网中的节点之间可能存在不同的阈值。Within the same cluster, there can be different thresholds between nodes on the same subnet and between nodes, which are on different subnets.

针对多租户环境(例如 Azure (IaaS)更改为更宽松设置的建议) Recommendations for changing to more relaxed settings for multi-tenant environments like Azure (IaaS)

备注

通过调整群集网络配置设置,提高群集环境的复原能力会导致停机时间增加。Increasing the resiliency of your Cluster environment by adjusting the Cluster network configuration settings can result in increased downtime. 有关详细信息,请参阅优化故障转移群集网络阈值For more information, see Tuning Failover Cluster Network Thresholds.

  1. 修改为更宽松的设置:Modify to more relaxed settings:

    备注

    更改群集阈值将立即生效,无需重新启动群集或任何资源。Changing the cluster threshold will take effect immediately, you don't have to restart the cluster or any resources.

    对于 AlwaysOn 可用性组的同一子网和跨区域部署,建议使用以下设置。The following settings are recommended for both same subnet and cross-region deployments of AlwaysOn availability groups.

    C:\Windows\system32> (get-cluster).SameSubnetThreshold = 20
    
    C:\Windows\system32> (get-cluster).CrossSubnetThreshold = 20
    
  2. 验证更改:Verify the changes:

    C:\Windows\system32> get-cluster | fl *subnet*
    

    cmd

参考References

有关优化 Windows 群集网络配置设置的详细信息,请参阅优化故障转移群集网络阈值For more information on tuning Windows Cluster network configuration settings, see Tuning Failover Cluster Network Thresholds.

有关使用 cluster.exe 优化 Windows 群集网络配置设置的信息,请参阅如何为故障转移群集配置群集网络For information on using cluster.exe to tune Windows Cluster network configuration settings, see How to Configure Cluster Networks for a Failover Cluster.