Configure RHEL Cluster for SQL Server Availability Group

This document explains how to create a two-node availability group cluster for SQL Server on Red Hat Enterprise Linux. The clustering layer is based on Red Hat Enterprise Linux (RHEL) HA add-on built on top of Pacemaker.

Note

Access to Red Hat documentation requires a subscription.

For more details on cluster configuration, resource agents options, and management, visit RHEL reference documentation.

Note

At this point, SQL Server service's integration with Pacemaker on Linux is not as coupled as with WSFC on Windows. From within SQL, there is no knowledge about the presence of the cluster, all orchestration is outside in and the service is controlled as a standalone instance by Pacemaker. Also, virtual network name is specific to WSFC, there is no equivalent of the same in Pacemaker. It is expected Always On dmvs that query cluster information to return empty rows. You can still create a listener to use it for transparent reconnection after failover, but you will have to manually register the listener name in the DNS server with the IP used to create the virtual IP resource (as explained below).

The following sections walk through the steps to set up a failover cluster solution.

Configure Pacemaker for RHEL

  1. On both cluster nodes, open the Pacemaker firewall ports. To open these ports with firewalld, run the following command:

    sudo firewall-cmd --permanent --add-service=high-availability
    sudo firewall-cmd --reload
    

    If you’re using another firewall that doesn’t have a built-in high-availability configuration, the following ports need to be opened for Pacemaker to be able to communicate with other nodes in the cluster

    • TCP: Ports 2224, 3121, 21064
    • UDP: Port 5405
  2. Install Pacemaker packages on each node.

    sudo yum install pacemaker pcs fence-agents-all resource-agents
    

  3. Set the password for for the default user that is created when installing Pacemaker and Corosync packages. Use the same password on both nodes.

    sudo passwd hacluster
    

  4. Enable and start pcsd service and Pacemaker. This will allow nodes to rejoin the cluster after the reboot. Run the following command on both nodes.

    sudo systemctl enable pcsd
    sudo systemctl start pcsd
    sudo systemctl enable pacemaker
    
  5. Create the Cluster. To create the cluster, run the following command:

    sudo pcs cluster auth <nodeName1> <nodeName2…> -u hacluster -p <password for hacluster>
    sudo pcs cluster setup --name <clusterName> <nodeName1> <nodeName2…> 
    sudo pcs cluster start --all
    
    Note

    If you previously configured a cluster on the same nodes, you need to use '--force' option when running 'pcs cluster setup'. Note this is equivalent to running 'pcs cluster destroy' and pacemaker service needs to be reenabled using 'sudo systemctl enable pacemaker'.

  6. Install SQL Server resource agent for SQL Server. Run the following commands on both nodes.

    sudo yum install mssql-server-ha
    

Configure fencing (STONITH)

Pacemaker cluster vendors require STONITH to be enabled and a fencing device configured for a supported cluster setup. Fencing configuration is needed to allow a surviving cluster node to forcibly remove a non-responsive node from the cluster. For details, see Pacemaker Clusters from Scratch and Red Hat High Availability Add-On with Pacemaker: Fencing.

To continue the configuration and validate the cluster setup , disable STONITH (it can be configured at a later time):

sudo pcs property set stonith-enabled=false
Important

This is not supported by the clustering vendors in a production setup.

Create a SQL Server login for Pacemaker

  1. On all SQL Servers, create a Server login for Pacemaker. The following Transact-SQL creates a login:

    USE [master]
    GO
    CREATE LOGIN [pacemakerLogin] with PASSWORD= N'ComplexP@$$w0rd!'
    
    ALTER SERVER ROLE [sysadmin] ADD MEMBER [pacemakerLogin]
    

    Alternatively, you can set the permissions at a more granular level. The Pacemaker login requires ALTER, CONTROL, and VIEW DEFINITION PERMISSION. For more information, see GRANT Availability Group Permissions (Transact-SQL).

    The following Transact-SQL grants only the required permission to the Pacemaker login. In the statement below 'ag1' is the name of the availability group that will be added as a cluster resource.

    GRANT ALTER, CONTROL, VIEW DEFINITION ON AVAILABILITY GROUP::ag1 TO pacemakerLogin
    
  2. On all SQL Servers, save the credentials for the SQL Server login.

    echo 'pacemakerLogin' >> ~/pacemaker-passwd
    echo 'ComplexP@$$w0rd!' >> ~/pacemaker-passwd
    sudo mv ~/pacemaker-passwd /var/opt/mssql/secrets/passwd
    sudo chown root:root /var/opt/mssql/secrets/passwd
    sudo chmod 400 /var/opt/mssql/secrets/passwd # Only readable by root
    

Create availability group resource

To create the availability group resource, set properties as follows:

  • clone-max: Number of AG replicas, including primary. For example, if you have one primary and one secondary, set this to 2. Default is number of nodes in the cluster.
  • clone-node-max: Number of replicas that can be started on a node. Set this to 1 (or use the default which is 1).

The following script sets these properties.

sudo pcs resource create ag_cluster ocf:mssql:ag ag_name=ag1 \
--master meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true

Create virtual IP resource

To create the virtual IP address resource, run the following command on one node. Use an available static IP address from the network. Replace the IP address between `<...>'

sudo pcs resource create virtualip ocf:heartbeat:IPaddr2 ip=**<10.128.16.240>**

There is no virtual server name equivalent in Pacemaker. To use a connection string that points to a string server name and not use the IP address, register the IP resource address and desired virtual server name in DNS. For DR configurations, register the desired virtual server name and IP address with the DNS servers on both primary and DR site.

Important

Due to a known issue, listener like capabilities are not working properly in the curent preview. We are working on a fix to be available in the upcoming releases.

Add colocation constraint

To add colocation constraint, run the following command on one node.

sudo pcs constraint colocation add virtualip ag_cluster-master INFINITY with-rsc-role=Master

Add ordering constraint

The colocation constraint has an implicit ordering constraint. It moves the virtual IP resource before it moves the availability group resource. By default the sequence of events is:

  1. User issues pcs resource move to the availability group primary from node1 to node2.
  2. The virtual IP resource stops on node 1.
  3. The virtual IP resource starts on node 2.

    Note

    At this point, the IP address temporarily points to node 2 while node 2 is still a pre-failover secondary.

  4. The availability group primary on node 1 is demoted to secondary.

  5. The availability group secondary on node 2 is promoted to primary.

To prevent the IP address from temporarily pointing to the node with the pre-failover secondary, add an ordering constraint.

To add an ordering constraint, run the following command on one node:

sudo pcs constraint order promote ag_cluster-master then start virtualip

Manual failover

Important

After you configure the cluster and add the availability group as a cluster resource, you cannot use Transact-SQL to fail over the availability group resources. SQL Server cluster resources on Linux are not coupled as tightly with the operating system as they are on a Windows Server Failover Cluster (WSFC). SQL Server service is not aware of the presence of the cluster. All orchestration is done through the cluster management tools. In RHEL or Ubuntu use pcs and in SLES use 'crm' tools.

Important

If the availability group is a cluster resource, there is a known issue in current release where manual failover to an asynchronous replica does not work. This will be fixed in the upcoming release. Manual or automatic failover to a synchronous replica will succeed.

Manually failover the availability group with pcs. Do not initiate failover with Transact-SQL.

To manually failover to cluster node2, run the following command.

sudo pcs resource move ag_cluster-master node2 --master
Important

After you manually failover a resource, you need to remove a location constraint that is automatically added during the move.

Remove the location constraint

During a manual move, the move command adds a location constraint for the resource to be placed on the new target node. To see the new constraint, run the following command after manually moving the resource:

sudo pcs constraint --full

You need to remove the location constraint so future moves - including automatic failover - succeed.

To remove the constraint run the following command. In the following command ag_cluster-master is the name of the resource that was moved. Replace this name with the name of your resource:

sudo pcs resource clear ag_cluster-master 

Alternatively, you can run the following command to remove the location constraint. In the following command cli-prefer-ag_cluster-master is the ID of the constraint that needs to be removed. sudo pcs constraint --full returns this ID.

sudo pcs constraint remove cli-prefer-ag_cluster-master  
Note

Automatic failover does not add a location constraint, so no cleanup is necessary.

For more information:

Managing synchronous commit mode

Warning

In some cases, a SQL Server availability group in synchronous commit mode on Linux may be vulnerable to data loss. See the following details on the root cause and the workaround to make sure data loss is avoided. This issue is going to be fixed in the upcoming releases.

Pacemaker notification for Availability Group resource promotion

Before CTP 1.4 release, the Pacemaker resource agent for availability groups could not know if a replica marked as SYNCHRONOUS_COMMIT was really up-to-date or not. It was possible that the replica had stopped synchronizing with the primary but was not aware. Thus the agent could promote an out-of-date replica to primary - which if successful would cause data loss.

SQL Server vNext CTP 1.4 adds sequence_number to sys.availability_groups to solve this issue. sequence_number is a monotonically increasing BIGINT that represents how up-to-date the local AG replica is with respect to the rest of the system. Performing failovers, adding/removing replicas and other AG operations update this number. The number is updated on the primary, then pushed to secondaries. Thus a secondary that is up-to-date will have the same number as the primary.

When Pacemaker decides to promote a replica to primary, it first sends a notification to all replicas to extract the sequence number and store it. Next, when Pacemaker actually tries to promote a replica to primary, the replica only promotes itself if its sequence number is the highest of all the sequence numbers and rejects the promote operation otherwise. In this way only the replica with the highest sequence number can be promoted to primary, ensuring no data loss.

Note that this is only guaranteed to work as long as at least one replica available for promotion has the same sequence number as the previous primary. Because the resource agent requires Pacemaker to send notifications to all replicas, the availability resource needs to be configured with notify=true. For every existing ocf:mssql:ag resource, the user will need to update he resource configuration with notify=true before updating the mssql-server-ha package to CTP 1.4, otherwise the resource will go into error state.

How to avoid potential for data loss

In the following case, an availability group may still be vulnerable to data loss. In a cluster that includes five nodes, if there is an availability group with replicas on three instances of SQL Server, it is possible for the primary replica and one secondary replica to be ahead of the other secondary replica. When they are ahead, the sequence_number in sys.availability_groups will be higher than the other replicas. In this condition, if the two replicas lose connectivity with the rest of the cluster, Pacemaker may see the remaining three nodes as a quorum, and allow the latent replica to promote itself to primary. In this situation, there is potential for data loss.

sqlVnext introduces a new feature to force a certain number of secondaries to be available before any transactions can be committed on the primary. REQUIRED_COPIES_TO_COMMIT allows you to set a number of replicas that must commit to secondary replica database transaction logs before a transaction can proceed. You can use this option with CREATE AVAILABILITY GROUP or ALTER AVAILABILITY GROUP. See CREATE AVAILABILITY GROUP.

To avoid the potential of data loss described above set REQUIRED_COPIES_TO_COMMIT to the maximum number of synchronized replicas. Upcoming releases will have built in logic to handle setting REQUIRED_COPIES_TO_COMMIT automatically. When REQUIRED_COPIES_TO_COMMIT is set, transactions at the primary replica databases will wait until the transaction is committed on the required number of synchronous secondary replica database transaction logs. If enough synchronous secondary replicas are not online, transactions will stop until communication with sufficient secondary replicas resume.

The following example sets an availability group name [ag1] to REQUIRED_COPIES_TO_COMMIT = 2.

ALTER AVAILABILITY GROUP [ag1]
SET (REQUIRED_COPIES_TO_COMMIT = 2)

In the example above, if the availability group has two secondary replicas, it will wait until both secondary replicas acknowledge commits. If one becomes unresponsive, transactions will be blocked.