IaaS SQL Server - 优化故障转移群集网络阈值

本文介绍用于调整故障转移群集网络的阈值的解决方案。

症状

使用Windows组在 IaaS 中运行故障转移群集SQL Server Always On时,建议将群集设置更改为更宽松的监视状态。 开箱即用群集设置具有限制性,可能会导致不需要的中断。 默认设置专为高度优化本地网络设计,不会考虑多租户环境(如 Windows Azure (IaaS) )导致的延迟。

Windows服务器故障转移群集持续监视群集中节点的网络连接和Windows运行状况。 如果某个节点不可通过网络访问,则执行恢复操作进行恢复,并使应用程序和在线服务进入群集中的另一个节点上。 群集节点之间的通信延迟可能会导致以下错误:

错误 1135 (系统事件日志)

群集节点 Node1 已从活动故障转移群集成员身份中删除。 此群集服务上的服务器可能已停止。 这可能是由于节点与故障转移群集中其他活动节点的通信丢失。 运行验证配置向导以检查网络配置。 如果条件仍然存在,请检查与此节点上的网络适配器相关的硬件或软件错误。 另请检查节点连接到的其他任何网络组件(例如集线器、交换机或桥)中的故障。

Cluster.log 示例:

0000ab34.00004e64::2014/06/10-07:54:34.099 DBG   [NETFTAPI] Signaled NetftRemoteUnreachable event, local address 10.xx.x.xxx:3343 remote address 10.x.xx.xx:3343
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] got event: Remote endpoint 10.xx.xx.xxx:~3343~ unreachable from 10.xx.x.xx:~3343~
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Marking Route from 10.xxx.xxx.xxxx:~3343~ to 10.xxx.xx.xxxx:~3343~ as down
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [NDP] Checking to see if all routes for route (virtual) local fexx::xxx:5dxx:xxxx:3xxx:~0~ to remote xxx::cxxx:xxxd:xxx:dxxx:~0~ are down
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [NDP] All routes for route (virtual) local fxxx::xxxx:5xxx:xxxx:3xxx:~0~ to remote fexx::xxxx:xxxx:xxxx:xxxx:~0~ are down
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [CORE] Node 8: executing node 12 failed handlers on a dedicated thread
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [NODE] Node 8: Cleaning up connections for n12.
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [Nodename] Clearing 0 unsent and 15 unacknowledged messages.
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [NODE] Node 8: n12 node object is closing its connections
0000ab34.00008b68::2014/06/10-07:54:34.099 INFO  [DCM] HandleNetftRemoteRouteChange
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Route history 1: Old: 05.936, Message: Response, Route sequence: 150415, Received sequence: 150415, Heartbeats counter/threshold: 5/5, Error: Success, NtStatus: 0 Timestamp: 2014/06/10-07:54:28.000, Ticks since last sending: 4
0000ab34.00007328::2014/06/10-07:54:34.099 INFO  [NODE] Node 8: closing n12 node object channels
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Route history 2: Old: 06.434, Message: Request, Route sequence: 150414, Received sequence: 150402, Heartbeats counter/threshold: 5/5, Error: Success, NtStatus: 0 Timestamp: 2014/06/10-07:54:27.665, Ticks since last sending: 36
0000ab34.0000a8ac::2014/06/10-07:54:34.099 INFO  [DCM] HandleRequest: dcm/netftRouteChange
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Route history 3: Old: 06.934, Message: Response, Route sequence: 150414, Received sequence: 150414, Heartbeats counter/threshold: 5/5, Error: Success, NtStatus: 0 Timestamp: 2014/06/10-07:54:27.165, Ticks since last sending: 4
0000ab34.00004b38::2014/06/10-07:54:34.099 INFO  [IM] Route history 4: Old: 07.434, Message: Request, Route sequence: 150413, Received sequence: 150401, Heartbeats counter/threshold: 5/5, Error: Success, NtStatus: 0 Timestamp: 2014/06/10-07:54:26.664, Ticks since last sending: 36
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <realLocal>10.xxx.xx.xxx:~3343~</realLocal>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <realRemote>10.xxx.xx.xxx:~3343~</realRemote>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <virtualLocal>fexx::xxxx:xxxx:xxxx:xxxx:~0~</virtualLocal>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <virtualRemote>fexx::xxxx:xxxx:xxxx:xxxx:~0~</virtualRemote>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <Delay>1000</Delay>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <Threshold>5</Threshold>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <Priority>140481</Priority>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO    <Attributes>2147483649</Attributes>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO  </struct mscs::FaultTolerantRoute>
0000ab34.00007328::2014/06/10-07:54:34.100 INFO   removed
0000ab34.0000a7c0::2014/06/10-07:54:38.433 ERR   [QUORUM] Node 8: Lost quorum (3 4 5 6 7 8)
0000ab34.0000a7c0::2014/06/10-07:54:38.433 ERR   [QUORUM] Node 8: goingAway: 0, core.IsServiceShutdown: 0
0000ab34.0000a7c0::2014/06/10-07:54:38.433 ERR   lost quorum (status = 5925)

原因

有两个设置用于配置群集的连接运行状况。

延迟 - 定义在节点之间发送群集检测信号的频率。 延迟是发送下一个检测信号之前等待的秒数。 在同一群集中,同一子网上的节点之间以及位于不同子网的节点之间可能会有不同的延迟。

阈值 – 定义在群集执行恢复操作之前丢失的检测信号数。 阈值是一些检测信号。 在同一群集中,同一子网上的节点之间以及不同子网上的节点之间可以有不同的阈值。

默认情况下,Windows Server 2016 SameSubnetThreshold设置为 10,将SameSubnetDelay设置为 1000 毫秒。 例如,如果连接监视失败 10 秒,则达到故障转移阈值,导致无法从群集成员身份中删除该节点。 这导致资源移动到群集上的另一个可用节点。 将报告群集错误,包括群集错误 1135 (报告) 群集错误。

解决方法

若要解决此问题,请放宽群集网络配置设置。 请参阅 检测信号和阈值

参考

有关优化群集网络Windows设置的信息,请参阅优化故障转移群集网络阈值

有关使用群集cluster.exe来Windows群集网络配置设置的信息,请参阅如何为故障转移群集配置群集网络