Failover cluster crashes when 1 node gets restarted

ITNewb 1 Reputation point
2021-09-23T17:42:01.967+00:00

Hello,

We are having an issue with our failover cluster.

We have 4 nodes all connected to a storage array and all are on Server 2016, whenever we restart just 1 node it starts to take down the entire cluster. The cluster crashes and can no longer connect to any node. The VM's get put into a failed state. We have to wait until the node completely boots up than wait again until the clusters sees all the nodes again.

This happens if we restart any node, there's not a specific one. Same happens when a node crashes. I have tried migrating all VM's off the node and restarting but the same exact issue happens. Once everything starts crashing it can take hours before everything goes back to normal...

What can be the cause of the cluster crashing when only 1 of 4 nodes restarts and how can I fix this issue?

Hyper-V
Hyper-V
A Windows technology providing a hypervisor-based virtualization solution enabling customers to consolidate workloads onto a single server.
2,538 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
958 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Limitless Technology 39,351 Reputation points
    2021-09-24T07:28:54.95+00:00

    Hello @ITNewb ,

    This issue occurs if you pause one node of a server cluster and then you restart the active cluster node. When the active node restarts, the paused node tries to bring resource groups online.

    Because this node is paused, the node cannot make additional connections, and it cannot bring the Quorum disk group online. Error code 70 corresponds to the following error message:

    The remote server has been paused or is in the process of being started.

    To resolve this issue, resume the paused cluster node before you restart the active cluster node. Before you resume a paused cluster node, you must first determine if a cluster node is paused.

    Do follow up the link for better understanding

    https://learn.microsoft.com/en-us/troubleshoot/windows-server/high-availability/cluster-service-stops-responding-a-cluster-node

    Hope this answers all your queries, if not please do repost back.
    If an Answer is helpful, please click "Accept Answer" and upvote it : )

    0 comments No comments