question

26000957 avatar image
0 Votes"
26000957 asked Cathyji-msft commented

SQL Server 2019 Availability Group Listener: What's the expected behavior for client connections during automatic failover

We've setup a SQL Server 2019 Enterprise High Availability cluster with an availability group listener (AGL) in front. We are experiencing strange (?) behavior when testing automatic (dirty) failovers compared to when we do manual/planned ones. The most pressing issue is when for clients using Microsoft ODBC Driver 17.7 for SQL Server on Centos 7. But the same happens for .net framework/core as well.

Scenarios


Manual/planned failover

Test

We tell the AG to fail over the cluster from the primary to the secondary replica.

AG-behavior

The AG fails over to the secondary replica.

Client behavior

The clients are instantly disconnected from the AG, the ODBC-drive tries to reconnect, succeeds and the client can continue working.

Automatic/dirty failover (doesn't work as we expect)

Test

We assert the clients are connected to the primary replica through the AG then pull the plug on the primary replica machine (to simulate a power outage).

AG-behavior

The AG fails over to the secondary replica.

Client behavior

The clients doesn't get disconnected and instead continues waiting for a response until the query timeout kicks in (this takes several minutes, way too long).

Question

From what we understand the AG is supposed to disconnect the clients both during manual/planned and automatic failovers, but for us this only happens during the manual/planned ones. Is this the inteded behavior for this solution or is there something wrong with our SQL Server/AG-configuration?

sql-server-general
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Cathyji-msft avatar image
0 Votes"
Cathyji-msft answered Cathyji-msft commented

Hi @JensHittenkofer-9041,

By default, the behavior of the SQL client libraries is to try all IP addresses returned by the DNS lookup - one after another (serially) until the all of the IP addresses have been exhausted and either a connection is made, or a connection timeout threshold has been reached. By default, the availability group listener directs incoming connections to the primary replica. Windows Failover Cluster Manager need time to check the primary replica is failed. Then AG automatic failover occurred. When primary replica failed, the client still try to connect to primary replica until the connection reach the connection timeout threshold. It will waste some time.

Did you set MultiSubnetFailover = TRUE in connection string, SQL Server Native Client will aggressively retry the TCP connection. a multi-subnet scenario, it will attempt connections in parallel, which you already know. a single subnet scenario, it will aggressively retry the TCP connection faster than the OS default TCP retransmit interval, so your reconnect time after an AG failover will be improved.

Refer to this MS document.


If the response is helpful, please click "Accept Answer" and upvote it, thank you.
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


· 2
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi,

Yes, we've tried with and without MultiSubnetFailover = TRUE. It doesn't make any difference.
The question is, when the AG fails over due to a manual/planned-failover all connected clients are actively disconnected, hence the clients notice it right away and reconnects to the new primary node. When the primary node goes down due to a power cut they aren't, and instead they wait until the query timeout occurs (too long). Is this the expected behavior for automatic failovers?

0 Votes 0 ·

Hi @JensHittenkofer-9041,

When you manual AG failover, the resource group will move to secondary replica as expect. But the resource group will try to online several times on primary node when you AG primary failed. It will move the resources to secondary replica when meet the max restart time. Then the AG failover.

Please check the settings for your AG resource policies.
81429-screenshot-2021-03-25-155044.jpg


0 Votes 0 ·
TomPhillips-1744 avatar image
0 Votes"
TomPhillips-1744 answered 26000957 edited

There is always an outage (perceived or not by users) on failover. How long the outage actually is, has a lot of dependencies.

In most cases, the driver does an internal retry and users are not impacted. In some cases, they may see a short outage, while the new server comes up. Normally, this is <30 seconds.

· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Yeah.
The failover time for a manual/planned failover is 5-7 seconds and for an automatic 10-20 seconds.
The issue is that when we do a manual failover the connected clients are actively disconnected. While for an automatic failover, this doesn't happen and the clients wait for a timeout. This difference in behavior is an issue for us and we're interested in knowing if its "how its supposed to be" or if we can change something in our configuration to make the AG actively disconnect the clients during an automatic failover (as it does for a manual/planned one).

0 Votes 0 ·

Please see:
https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/failover-and-failover-modes-always-on-availability-groups?view=sql-server-ver15#HowAutoFoWorks

Automatic failover works the same as manual failover. The time it takes to become available is primarily dependent on how long it takes to roll back any uncommitted transactions.

Are your connections being made to the listener?

0 Votes 0 ·

Yes, our connections are being made to the listener. And I've read that document as well and thought so too. What is seems like is that when during a manual failover the primary SQL instance gets a signal from the AGL that it should disconnect all clients and become the secondary. But when we just pull the plug it (of course) won't have the chance to do this. We just assumed that since the connections always pass through the AGL it would be able to close them when this happens but that doesn't seem to be the case
(our supplier also has a support ticket open with Microsoft regarding this but it's very hard to get a straight answer so thought I would try my luck here).

0 Votes 0 ·