Reserve load balancer health probe port on secondary nodes Always On Availability Groups on Azure VM

Sivert Solem 16 Reputation points
2022-07-06T09:22:26.33+00:00

We have several two node + witness WSFC SQL Server Always On Availability Groups on Azure VMs, set up with a standard load balancer and health probes.

This has been working fine for the most part, but we have had a couple of incidents where failover fails because a health probe port on the new primary is in use by some other application, such as a monitoring agent.
The result is that WSFC is unable to start the IP address resource of the availability group, and our databases became unavailable for an extended period of time.

So I have two questions:

  1. How can we best reserve the health probe ports on all cluster nodes?
  2. Why isn't that done by WSFC when there's resources with defined probe ports?

We are using port 58888 for the cluster network name, and 59991... for the availability groups
See code sample for the configuration of the Cluster IP Address

PS C:\windows\system32> Get-ClusterResource "Cluster IP Address" | Get-ClusterParameter  
  
Object             Name                  Value               Type  
------             ----                  -----               ----  
Cluster IP Address Network               Cluster Network 1   String  
Cluster IP Address Address               10.10.10.2          String  
Cluster IP Address SubnetMask            255.255.255.255     String  
Cluster IP Address EnableNetBIOS         0                   UInt32  
Cluster IP Address OverrideAddressMatch  0                   UInt32  
Cluster IP Address EnableDhcp            0                   UInt32  
Cluster IP Address ProbePort             58888               UInt32  
Cluster IP Address ProbeFailureThreshold 0                   UInt32  
Cluster IP Address LeaseObtainedTime     01.01.0001 00:00:00 DateTime  
Cluster IP Address LeaseExpiresTime      01.01.0001 00:00:00 DateTime  
Cluster IP Address DhcpServer            255.255.255.255     String  
Cluster IP Address DhcpAddress           0.0.0.0             String  
Cluster IP Address DhcpSubnetMask        255.0.0.0           String  

We have followed this documentation.
https://learn.microsoft.com/en-us/azure/azure-sql/virtual-machines/windows/availability-group-manually-configure-tutorial-single-subnet?view=azuresql

Azure Virtual Machines
Azure Virtual Machines
An Azure service that is used to provision Windows and Linux virtual machines.
7,123 questions
SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
12,705 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
958 questions
{count} votes

2 answers

Sort by: Most helpful
  1. vipullag-MSFT 24,106 Reputation points Microsoft Employee
    2022-07-22T17:52:04.117+00:00

    @Sivert Solem

    Firstly, apologies for the delay in responding here and any inconvenience this issue may have caused.

    I checked with internal team on this to confirm few things around your issue.

    You mentioned on the issue as “ failover fails because a health probe port on the new primary is in use by some other application, such as a monitoring agent.”
    So you can try using a different port, or try to figure out which other application(s) are interfering with the Load Balancer resource.

    I guess this is NOT a problem with WSFC, which works for many other customers who are running on-premises and some other clouds. This could be some issue specially you are seeing in your Azure Environment. This has to be checked by Network Support team (I would recommend you to open a support case).

    Internal team has confirmed that the WSFC’s DNN resource would work well in the SQL Server Listener – so the Load Balancer does not need to be used at all Ref: Configure DNN listener for availability group - SQL Server on Azure VMs

    Hope that helps.
    If you need further help on this, tag me in a comment.
    If the suggested response helped you resolve your issue, please 'Accept as answer', so that it can help others in the community looking for help on similar topics.


  2. Sivert Solem 16 Reputation points
    2022-08-15T14:10:47.72+00:00

    @vipullag-MSFT
    Sorry for not coming back to you earlier on this myself.
    I was unable to find my question again :P
    I was also unable to comment on your answer, for an unknown reason.

    DNN's must be relatively new.
    We've had the clusters for 1-2 years before this issue first happened.

    The documentation we followed use Azure Load Balancer, and maybe the tutorial should be updated to use DNN if this is more reliable.

    When setting up with Azure Load Balancer, we configure a health probe as in point 7
    https://learn.microsoft.com/en-us/azure/azure-sql/virtual-machines/windows/availability-group-manually-configure-tutorial-single-subnet?view=azuresql#configure-listener

    $ClusterNetworkName = "<MyClusterNetworkName>" # the cluster network name (Use Get-ClusterNetwork on Windows Server 2012 of higher to find the name)  
    $IPResourceName = "<IPResourceName>" # the IP Address resource name  
    $ListenerILBIP = "<n.n.n.n>" # the IP Address of the Internal Load Balancer (ILB). This is the static IP address for the load balancer you configured in the Azure portal.  
    [int]$ListenerProbePort = <nnnnn>  
      
    Import-Module FailoverClusters  
      
    Get-ClusterResource $IPResourceName | Set-ClusterParameter -Multiple @{"Address"="$ListenerILBIP";"ProbePort"=$ListenerProbePort;"SubnetMask"="255.255.255.255";"Network"="$ClusterNetworkName";"EnableDhcp"=0}  
    

    The Failover Cluster only consumes the port on the primary node of the cluster.
    As mentioned, we had a failover fail because the probe port was in use on the secondary.
    As such, we've had to add the port to exclusion list with netsh.

    It feels unneccesary to manually add this exclusion on all nodes.

    Reading up on the DNN, that would require us to update connection strings with the DNN port number as well as the "MultiSubnetFailover=True" parameter.
    I am unable to verify at this time if our applications support the "MultiSubnetFailover=True" parameter.