question

NicSmith-0053 avatar image
0 Votes"
NicSmith-0053 asked ErlandSommarskog commented

Windows Failover Cluster/Availability Groups - Issues because all servers of one site where unreachable

Hi everyone,

we had some issues because of maintenance and hope someone can explain what happened.

Our cluster looks like this:

Site A:
- Node A1 (Quorum vote)
- Node A2 (Quorum vote)
- Node A3 (Quorum vote)
- Node A4 (NO Quorum vote)

Site B:
- Node B1 (Quorum vote)
- Node B2 (Quorum vote)
- Node B3 (Quorum vote)

Site C:
- File Share Witness

Every node has 4 NICs that are configured for teaming:
- NIC 1 + 2 teamed | Configured for Cluster and Client
- NIC 3 + 4 teamed | Configured for Cluster

So in total we have 4 VLANs on each node.

All nodes run with Microsoft Windows Server 2016 Standard + Microsoft SQL Server 2016 Enterprise.

Our AGs have one automatic failover partner on both sites each and regularly the nodes in site A are primary.
For example AG1:
- A1 - PRIMARY
- B1 - SECONDARY

We had to perform firmware updates on our servers. So we did a manual failover of all AGs from site A to site B.
Everything was fine until the installation of the network drivers started.
They started at the same time on all 4 server of site A.

When these 4 servers reported that they are not connected to the cluster networks the same happened for all 3 server on site B as well.
Server A1 wrote into the cluster log that it lost quorum and it's going to shut down the cluster service.
Some of the cluster resources (Availabilty Groups) also went down.

Server B1 for example tried to bring its AGs online again also trying to use the IPs from VLANs of site A.

After a few minutes when the servers on site A where back up again server B1 was able to bring up its resources.
In the SQL log we saw that the states of the AGs changed from "PRIMARY_PENDING" to "PRIMARY_NORMAL". But at that time the dns registration wasn't finished.
Every database changed from "RESOLVING" to "PRIMARY" except one.
The AG itself was fine but the database was stuck in state "Not Synchronized" on all nodes.
After the SQL server started listening to the Availabilty Group Listener again we could see the following line with no other notification stating this database inbetween:
"Nonqualified transactions are being rolled back in database DB1 for an Always On Availability Groups state change. Estimated rollback completion: 0%. This is an informational message only. No user action is required."
The database is very small with just a few hundred MB and there couldn't be much transactions since the application servers which use this database couldn't connect to it. But it was stuck in this rollback state. The percentage of completion didn't even change.
Trying to resume it didn't work. I had to abort the query after 3 - 4 minutes.
Because nothing changed within 30 minutes we performed a failover of the regarding AG vom B1 to A1 and the databse became healthy within seconds.

I have 3 question I hope someone can answer:
1. Why where all 3 remaining nodes on site B not able to cummunicate any longer when the 4 servers on site A where down? They should have been able to communicate with eachother and also the File Share Witness on site C.
Could it have something todo with the design described in the following article: https://docs.microsoft.com/en-us/troubleshoot/windows-server/high-availability/cluster-ip-resources-fail-2-node-2-site-fsw-cluster
2. How do you prevent the cluster/cluster ressources from going down completely when there is just an issue with cross-vlan communication?
3. Any idea what happened to DB1 that was stuck in "No Synchronized" and the roll back process until we triggered a failover?

Thanks

sql-server-general
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

TomPhillips-1744 avatar image
1 Vote"
TomPhillips-1744 answered NicSmith-0053 edited

Your question is more of a Windows Cluster question than a SQL Server question.

The cluster log file will tell you exactly why this happened. My guess is you lost quorum. So it took down the cluster until the quorum could be reestablished.

Best practice on a cluster is to stager your reboots to avoid this problem.



· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks for your answer.

Yes it's kind of a mixed question. Our main problem was question 3 regarding the database that was stuck in the "Not Synchronized" state.

Just server A1 reported that it lost quorum and will shut down the cluster service.
But we are surprised because we have 3 quorum votes in site A, 3 quorum votes in site B and the File Share Witness in site C.

In the future we won't install firmware updates on all servers of one site at the same time to be shure it won't happen again.

0 Votes 0 ·
Criszhan-msft avatar image
0 Votes"
Criszhan-msft answered ErlandSommarskog commented

Hi,

When these 4 servers reported that they are not connected to the cluster networks the same happened for all 3 server on site B as well.

The cluster may have been shut down at that time.
Dynamic quorum management does not allow the cluster to sustain a simultaneous failure of a majority of voting members. To continue running, the cluster must always have a quorum majority at the time of a node shutdown or failure.
https://docs.microsoft.com/en-us/windows-server/failover-clustering/manage-cluster-quorum

· 6
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Thanks for your answer.

We have 3 quorum votes on site A, 3 quorum votes on site B and the File Share Witness on site C.
From my understanding site B + File Share Witness should have had the quorum majority at that time when the servers of site A went unreachable.
Is that wrong?

0 Votes 0 ·

Hello,
Please verify the cluster quorum configuration. Run the Validate a Configuration Wizard(validate cluster) in the failover cluster manager, you can just choose "validate quorum configuration".

0 Votes 0 ·

Hi,
The validation of the quorum configuration looks fine to me:
104407-quorum-validation.jpg


0 Votes 0 ·
Show more comments