Hello to all,
Let me start to thank you, all help is welcomed
Current Environment: 4 nodes Hyper-V Cluster 2016 (Migrated from 2012 R2), CSV´s FC | Encrypted by Bitlocker (SAN Huawei OceanStore 2600 V3).
Following the Cluster OS Rolling Upgrade, we’ve migrated successfully from Cluster 2012 R2 to Cluster 2016 without any issues.
Source: https://learn.microsoft.com/en-us/windows-server/failover-clustering/cluster-operating-system-rolling-upgrade
Using the same guideline, we begin to migrate to Cluster 2019, however, we started to have the cluster service crashing when we add the first Hyper-V Server 2019 on the cluster 2016.
Symptoms:
Cluster service crashed and restart.
After that CSVs start to pause.
VMs crash and start to appear to power off.
Backups start to fail after the CSVs pause
Since this problem, we keep the 2019 host paused and all works great with the other 3 nodes.
We reinstalled the 2019 node from scratch and the problem remains…
Some interesting logs from the cluster log in two different attempts of resuming the node (before and after reinstall of the OS):
21-11-2021
[MRR] Node 1: CRITICAL: 1 outstanding requests for this node, average mrr age: 58:22.940, max mrr age: 58:22.940 (rid: 2953589, stagglers: (3 4 5))
[MRR] Node 1: CRITICAL: 1 outstanding requests for this node, average mrr age: 59:22.940, max mrr age: 59:22.940 (rid: 2953589, stagglers: (3 4 5))
[MRR] Node 1: CRITICAL: request RID 2953589 timed out. Stragglers: (3 4 5). Action: /rcm/mrr/CreateCryptoContainersIfNecessary
[MRR] Node 1: CRITICAL: Local node was attempting to kill a quorum of other nodes ((3 4 5)) for stuck MRRs
MRR stall (status = 1359)
[RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.
MRR stall (status = 1359), executing OnStop
In this case, we resumed the node (it was paused), we saw it rebalancing the CSVs and it started logging this event:
[MRR] Node 1: CRITICAL: 1 outstanding requests for this node, average mrr age: xx:xx.xxx, max mrr age: xx:xx.xxx (rid: 2953589, stagglers: (3 4 5)).
One event per minute always with the same RID and incrementing the MRR age (xx:xx:xxx).
We paused the node with role drain 40 minutes after the resume, and the CSVs rebalanced ok, but the MRR event keep occurring and incrementing until it reached one hour and the cluster service stopped and restarted…..
30-11-2021 (this one a bit more detailed)
000024e8.00003e14::2021/11/30-01:16:22.734 WARN [MRR] Node 1: insert request record for RID 119474 logged due to expiring
000024e8.00003e14::2021/11/30-01:16:22.734 WARN [MRR] SendRequest(<message action='/rcm/mrr/GimProvisionalPlacementDecisionRemove' GemId='0' target='RCMA' sender='-1'mrr='119474'/>) NumDest:65 logged due to expiring
000024e8.00003e14::2021/11/30-01:16:22.734 WARN [MRR] Node 1: insert request record for RID 119478 logged due to expiring
000024e8.00003e14::2021/11/30-01:16:22.734 WARN [MRR] SendRequest(<message action='/rcm/mrr/GimProvisionalPlacementDecisionRemove' GemId='0' target='RCMA' sender='-1'mrr='119478'/>) NumDest:65 logged due to expiring
000024e8.00003e14::2021/11/30-01:16:22.734 WARN [MRR] Node 1: insert request record for RID 119480 logged due to expiring
000024e8.00003e14::2021/11/30-01:16:22.734 WARN [MRR] SendRequest(<message action='/rcm/mrr/GimProvisionalPlacementDecisionRemove' GemId='0' target='RCMA' sender='-1'mrr='119480'/>) NumDest:65 logged due to expiring
……..
……..
000024e8.00003c20::2021/11/30-01:16:59.836 WARN [MRR] Node 1: CRITICAL: 3 outstanding requests for this node, average mrr age: 52.103, max mrr age: 52.106 (rid: 119474, stagglers: (3))
……..
……..
000024e8.00003c20::2021/11/30-01:17:59.837 WARN [MRR] Node 1: CRITICAL: 3 outstanding requests for this node, average mrr age: 1:52.104, max mrr age: 1:52.107 (rid: 119474, stagglers: (3))
……..
……..
000024e8.00003c20::2021/11/30-01:18:59.837 WARN [MRR] Node 1: CRITICAL: request RID 119474 timed out. Stragglers: (3). Action: /rcm/mrr/GimProvisionalPlacementDecisionRemove
000024e8.00003c20::2021/11/30-01:18:59.837 WARN [MRR] Node 1: CRITICAL: request RID 119478 timed out. Stragglers: (3). Action: /rcm/mrr/GimProvisionalPlacementDecisionRemove
000024e8.00003c20::2021/11/30-01:18:59.837 WARN [MRR] Node 1: CRITICAL: request RID 119480 timed out. Stragglers: (3). Action: /rcm/mrr/GimProvisionalPlacementDecisionRemove
000024e8.00003c20::2021/11/30-01:18:59.837 WARN [MRR] Node 1: CRITICAL: timeout is reached for mrr sent to nodes (3), kicking out of view
000024e8.00003c20::2021/11/30-01:18:59.837 WARN [MRR] Node 1: CRITICAL: request RID 119480 timed out. Stragglers: (3). Action: /rcm/mrr/GimProvisionalPlacementDecisionRemove
000024e8.00003c20::2021/11/30-01:18:59.837 WARN [MRR] Node 1: CRITICAL: timeout is reached for mrr sent to nodes (3), kicking out of view
000024e8.00003c20::2021/11/30-01:18:59.837 WARN [MRR] Node 1: CRITICAL: 3 outstanding requests for this node, average mrr age: 2:52.104, max mrr age: 2:52.107 (rid: 119474, stagglers: (3))
000024e8.00002aac::2021/11/30-01:18:59.837 INFO [NETWORK] Requested to remove node(s) LX1-VI-HPV-10
000024e8.00002a94::2021/11/30-01:19:00.283 INFO [MM] Node 3 went down with fatal error. Initiating regroup.
000024e8.00002a94::2021/11/30-01:19:00.283 INFO [IM] got event: Node with FaultTolerantAddress fe80::7060:5dc3:443d:498f:~0~ has gone down with fatal error\crash
000024e8.00002aac::2021/11/30-01:19:00.283 INFO [RGP] node 1: I don't have a connection to node 3, ignoring node going down due to fatal error
000024e8.00002a94::2021/11/30-01:19:00.283 WARN [NETWORK] Node LX1-VI-HPV-10 going down with fatal error
000024e8.00002b68::2021/11/30-01:19:00.283 INFO [NODE] Node 1: Cleaning up connections for n3.
000024e8.00002b68::2021/11/30-01:19:00.283 INFO [MQ-LX1-VI-HPV-10] Clearing 0 unsent and 0 unacknowledged messages.
000024e8.00002b68::2021/11/30-01:19:00.283 INFO [NODE] Node 1: n3 node object is closing its connections
000024e8.00002b68::2021/11/30-01:19:00.283 INFO [NODE] Node 1: closing n3 node object channels
In this case, we resumed the node (it was paused), we saw it rebalancing the CSVs and no [MRR] critical events were logged. Everything seemed fine.
After one hour we paused the node with role drain, the [MRR] critical events started and this time, after more or less 3 minutes one of the 2016 server crashed (stopped and restarted) the cluster service… We are pretty sure that it was the 2019 server that forced the other node to shut down the cluster service…
We have already checked all networks configurations, making sure that the CSVs encrypted are operational on all of the servers, including the Server 2019.
Please, we need help to understand what's causing this to happen.