Hyper-V Cluster Rolling upgrade 2012 -> 2016 -> 2019 - Cluster service crashing

Ricardo Simão 1 Reputation point
2021-12-03T14:59:11.437+00:00

Hello to all,

Let me start to thank you, all help is welcomed

Current Environment: 4 nodes Hyper-V Cluster 2016 (Migrated from 2012 R2), CSV´s FC | Encrypted by Bitlocker (SAN Huawei OceanStore 2600 V3).

Following the Cluster OS Rolling Upgrade, we’ve migrated successfully from Cluster 2012 R2 to Cluster 2016 without any issues.
Source: https://learn.microsoft.com/en-us/windows-server/failover-clustering/cluster-operating-system-rolling-upgrade

Using the same guideline, we begin to migrate to Cluster 2019, however, we started to have the cluster service crashing when we add the first Hyper-V Server 2019 on the cluster 2016.

Symptoms:
Cluster service crashed and restart.
After that CSVs start to pause.
VMs crash and start to appear to power off.
Backups start to fail after the CSVs pause

Since this problem, we keep the 2019 host paused and all works great with the other 3 nodes.

We reinstalled the 2019 node from scratch and the problem remains…

Some interesting logs from the cluster log in two different attempts of resuming the node (before and after reinstall of the OS):

21-11-2021

[MRR] Node 1: CRITICAL: 1 outstanding requests for this node, average mrr age: 58:22.940, max mrr age: 58:22.940 (rid: 2953589, stagglers: (3 4 5))
[MRR] Node 1: CRITICAL: 1 outstanding requests for this node, average mrr age: 59:22.940, max mrr age: 59:22.940 (rid: 2953589, stagglers: (3 4 5))
[MRR] Node 1: CRITICAL: request RID 2953589 timed out. Stragglers: (3 4 5). Action: /rcm/mrr/CreateCryptoContainersIfNecessary
[MRR] Node 1: CRITICAL: Local node was attempting to kill a quorum of other nodes ((3 4 5)) for stuck MRRs
MRR stall (status = 1359)
[RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.
MRR stall (status = 1359), executing OnStop

In this case, we resumed the node (it was paused), we saw it rebalancing the CSVs and it started logging this event:
[MRR] Node 1: CRITICAL: 1 outstanding requests for this node, average mrr age: xx:xx.xxx, max mrr age: xx:xx.xxx (rid: 2953589, stagglers: (3 4 5)).
One event per minute always with the same RID and incrementing the MRR age (xx:xx:xxx).

We paused the node with role drain 40 minutes after the resume, and the CSVs rebalanced ok, but the MRR event keep occurring and incrementing until it reached one hour and the cluster service stopped and restarted…..

30-11-2021 (this one a bit more detailed)

000024e8.00003e14::2021/11/30-01:16:22.734 WARN [MRR] Node 1: insert request record for RID 119474 logged due to expiring
000024e8.00003e14::2021/11/30-01:16:22.734 WARN [MRR] SendRequest(<message action='/rcm/mrr/GimProvisionalPlacementDecisionRemove' GemId='0' target='RCMA' sender='-1'mrr='119474'/>) NumDest:65 logged due to expiring
000024e8.00003e14::2021/11/30-01:16:22.734 WARN [MRR] Node 1: insert request record for RID 119478 logged due to expiring
000024e8.00003e14::2021/11/30-01:16:22.734 WARN [MRR] SendRequest(<message action='/rcm/mrr/GimProvisionalPlacementDecisionRemove' GemId='0' target='RCMA' sender='-1'mrr='119478'/>) NumDest:65 logged due to expiring
000024e8.00003e14::2021/11/30-01:16:22.734 WARN [MRR] Node 1: insert request record for RID 119480 logged due to expiring
000024e8.00003e14::2021/11/30-01:16:22.734 WARN [MRR] SendRequest(<message action='/rcm/mrr/GimProvisionalPlacementDecisionRemove' GemId='0' target='RCMA' sender='-1'mrr='119480'/>) NumDest:65 logged due to expiring
……..
……..
000024e8.00003c20::2021/11/30-01:16:59.836 WARN [MRR] Node 1: CRITICAL: 3 outstanding requests for this node, average mrr age: 52.103, max mrr age: 52.106 (rid: 119474, stagglers: (3))
……..
……..
000024e8.00003c20::2021/11/30-01:17:59.837 WARN [MRR] Node 1: CRITICAL: 3 outstanding requests for this node, average mrr age: 1:52.104, max mrr age: 1:52.107 (rid: 119474, stagglers: (3))
……..
……..
000024e8.00003c20::2021/11/30-01:18:59.837 WARN [MRR] Node 1: CRITICAL: request RID 119474 timed out. Stragglers: (3). Action: /rcm/mrr/GimProvisionalPlacementDecisionRemove
000024e8.00003c20::2021/11/30-01:18:59.837 WARN [MRR] Node 1: CRITICAL: request RID 119478 timed out. Stragglers: (3). Action: /rcm/mrr/GimProvisionalPlacementDecisionRemove
000024e8.00003c20::2021/11/30-01:18:59.837 WARN [MRR] Node 1: CRITICAL: request RID 119480 timed out. Stragglers: (3). Action: /rcm/mrr/GimProvisionalPlacementDecisionRemove
000024e8.00003c20::2021/11/30-01:18:59.837 WARN [MRR] Node 1: CRITICAL: timeout is reached for mrr sent to nodes (3), kicking out of view
000024e8.00003c20::2021/11/30-01:18:59.837 WARN [MRR] Node 1: CRITICAL: request RID 119480 timed out. Stragglers: (3). Action: /rcm/mrr/GimProvisionalPlacementDecisionRemove
000024e8.00003c20::2021/11/30-01:18:59.837 WARN [MRR] Node 1: CRITICAL: timeout is reached for mrr sent to nodes (3), kicking out of view
000024e8.00003c20::2021/11/30-01:18:59.837 WARN [MRR] Node 1: CRITICAL: 3 outstanding requests for this node, average mrr age: 2:52.104, max mrr age: 2:52.107 (rid: 119474, stagglers: (3))
000024e8.00002aac::2021/11/30-01:18:59.837 INFO [NETWORK] Requested to remove node(s) LX1-VI-HPV-10
000024e8.00002a94::2021/11/30-01:19:00.283 INFO [MM] Node 3 went down with fatal error. Initiating regroup.
000024e8.00002a94::2021/11/30-01:19:00.283 INFO [IM] got event: Node with FaultTolerantAddress fe80::7060:5dc3:443d:498f:~0~ has gone down with fatal error\crash
000024e8.00002aac::2021/11/30-01:19:00.283 INFO [RGP] node 1: I don't have a connection to node 3, ignoring node going down due to fatal error
000024e8.00002a94::2021/11/30-01:19:00.283 WARN [NETWORK] Node LX1-VI-HPV-10 going down with fatal error
000024e8.00002b68::2021/11/30-01:19:00.283 INFO [NODE] Node 1: Cleaning up connections for n3.
000024e8.00002b68::2021/11/30-01:19:00.283 INFO [MQ-LX1-VI-HPV-10] Clearing 0 unsent and 0 unacknowledged messages.
000024e8.00002b68::2021/11/30-01:19:00.283 INFO [NODE] Node 1: n3 node object is closing its connections
000024e8.00002b68::2021/11/30-01:19:00.283 INFO [NODE] Node 1: closing n3 node object channels

In this case, we resumed the node (it was paused), we saw it rebalancing the CSVs and no [MRR] critical events were logged. Everything seemed fine.
After one hour we paused the node with role drain, the [MRR] critical events started and this time, after more or less 3 minutes one of the 2016 server crashed (stopped and restarted) the cluster service… We are pretty sure that it was the 2019 server that forced the other node to shut down the cluster service…

We have already checked all networks configurations, making sure that the CSVs encrypted are operational on all of the servers, including the Server 2019.

Please, we need help to understand what's causing this to happen.

Hyper-V
Hyper-V
A Windows technology providing a hypervisor-based virtualization solution enabling customers to consolidate workloads onto a single server.
2,558 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
962 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. Limitless Technology 39,391 Reputation points
    2021-12-06T09:36:27.253+00:00

    Hi there,

    This issue occurs if you pause one node of a server cluster and then you restart the active cluster node. When the active node restarts, the paused node tries to bring resource groups online.

    Because this node is paused, the node cannot make additional connections, and it cannot bring the Quorum disk group online. Error code 70 corresponds to the following error message:

    The remote server has been paused or is in the process of being started.

    To resolve this issue, resume the paused cluster node before you restart the active cluster node. Before you resume a paused cluster node, you must first determine if a cluster node is paused.

    Do follow up the link for a better understanding

    https://learn.microsoft.com/en-us/troubleshoot/windows-server/high-availability/cluster-service-stops-responding-a-cluster-node

    Hope this resolves your Query!!

    ----------------------------------------------------------------------------------------------------------------------------------------------------------

    --If the reply is helpful, please Upvote and Accept it as an answer--

    0 comments No comments

  2. Ricardo Simão 1 Reputation point
    2021-12-09T11:58:21.127+00:00

    Hello LimitlessTechnology-2700,

    Thank you for your time and reply.
    We appreciate it!
    But in fact, what you referred it's not at all our case.
    Actually, putting the node on pause it's what's mitigating the issue and not creating the issue after a restart.

    The problem arises when we try adding a Hyper-V 2019 to a cluster Hyper-V 2016, when we do that, we start to see the problems reported initially.

    0 comments No comments

  3. Edd 1 Reputation point
    2022-09-05T13:26:27.857+00:00

    @Ricardo Simão

    Did you get to the bottom of your issue?

    Thanks
    Edd