Configure Ubuntu Cluster and Availability Group Resource

This document explains how to create a two-node cluster on Ubuntu and add a previously created availability group as a resource in the cluster.

Note

At this point, SQL Server's integration with Pacemaker on Linux is not as coupled as with WSFC on Windows. From within SQL, there is no knowledge about the presence of the cluster, all orchestration is outside in and the service is controlled as a standalone instance by Pacemaker. Also, virtual network name is specific to WSFC, there is no equivalent of the same in Pacemaker. Always On dynamic management views that query cluster information will return empty rows. You can still create a listener to use it for transparent reconnection after failover, but you will have to manually register the listener name in the DNS server with the IP used to create the virtual IP resource (as explained below).

The following sections walk through the steps to set up a failover cluster solution.

Roadmap

The steps to create an availability group on Linux servers for high availability are different from the steps on a Windows Server failover cluster. The following list describes the high level steps:

  1. Configure SQL Server on the cluster nodes.

  2. Create the availability group.

  3. Configure a cluster resource manager, like Pacemaker. These instructions are in this document.

    The way to configure a cluster resource manager depends on the specific Linux distribution.

    Important

    Production environments require a fencing agent, like STONITH for high availability. The demonstrations in this documentation do not use fencing agents. The demonstrations are for testing and validation only.

    A Linux cluster uses fencing to return the cluster to a known state. The way to configure fencing depends on the distribution and the environment. At this time, fencing is not available in some cloud environments. See Support Policies for RHEL High Availability Clusters - Virtualization Platforms for more information.

  4. Add the availability group as a resource in the cluster.

Install and configure Pacemaker on each cluster node

  1. On all nodes open the firewall ports. Open the port for the Pacemaker high-availability service, SQL Server instance, and the availability group endpoint. The default TCP port for server running SQL Server is 1433.

    sudo ufw allow 2224/tcp
    sudo ufw allow 3121/tcp
    sudo ufw allow 21064/tcp
    sudo ufw allow 5405/udp
    
    sudo ufw allow 1433/tcp # Replace with TDS endpoint
    sudo ufw allow 5022/tcp # Replace with DATA_MIRRORING endpoint
    
    sudo ufw reload
    

    Alternatively, you can just disable the firewall:

    sudo ufw disable
    
  2. Install Pacemaker packages. On all nodes, run the following commands:

    sudo apt-get install pacemaker pcs fence-agents resource-agents
    
  3. Set the password for the default user that is created when installing Pacemaker and Corosync packages. Use the same password on both nodes.

    sudo passwd hacluster
    

Enable and start pcsd service and Pacemaker

The following command enables and starts pcsd service and pacemaker. This allows the nodes to rejoin the cluster after reboot.

sudo systemctl enable pcsd
sudo systemctl start pcsd
sudo systemctl enable pacemaker
Note

Enable pacemaker command will complete with the error 'pacemaker Default-Start contains no runlevels, aborting.' This is harmless, cluster configuration can continue. We are following up with cluster vendors for fixing this issue.

Create the Cluster

  1. Remove any existing cluster configuration.

    Running 'sudo apt-get install pcs' installs pacemaker, corosync, and pcs at the same time and starts running all 3 of the services. Starting corosync generates a template '/etc/cluster/corosync.conf' file. To have next steps succeed this file should not exist – so the workaround is to stop pacemaker / corosync and delete '/etc/cluster/corosync.conf', and then next steps will complete successfully. 'pcs cluster destroy' does the same thing, and you can use it as a one time initial cluster setup step.

    The following command removes any existing cluster configuration files and stops all cluster services. This permanently destroys the cluster. Run it as a first step in a pre-production environment. Run the following command on all nodes. Note that 'pcs cluster destroy' disabled the pacemaker service and needs to be reenabled.

    Warning

    The command will destroy any existing cluster resources.

    sudo pcs cluster destroy # On all nodes
    sudo systemctl enable pacemaker
    
  2. Create the cluster.

    Warning

    Due to a known issue that the clustering vendor is investigating, starting the cluster ('pcs cluster start') will fail with below error. This is because the log file configured in /etc/corosync/corosync.conf is wrong. To workaround this issue, change the log file to: /var/log/corosync/corosync.log. Alternatively you could create the /var/log/cluster/corosync.log file.

    Job for corosync.service failed because the control process exited with error code. 
    See "systemctl status corosync.service" and "journalctl -xe" for details.
    

The following command creates a two node cluster. Before you run the script, replace the values between **< ... >**. Run the following command on the primary SQL Server.

sudo pcs cluster auth **<nodeName1>** **<nodeName2>**  -u hacluster -p **<password for hacluster>**
sudo pcs cluster setup --name **<clusterName>** **<nodeName1>** **<nodeName2…>** 
sudo pcs cluster start --all
Note

If you previously configured a cluster on the same nodes, you need to use '--force' option when running 'pcs cluster setup'. Note this is equivalent to running 'pcs cluster destroy' and pacemaker service needs to be reenabled using 'sudo systemctl enable pacemaker'.

Configure fencing (STONITH)

Pacemaker cluster vendors require STONITH to be enabled and a fencing device configured for a supported cluster setup. When the cluster resource manager cannot determine the state of a node or of a resource on a node, fencing is used to bring the cluster to a known state again. Resource level fencing ensures mainly that there is no data corruption in case of an outage by configuring a resource. You can use resource level fencing, for instance, with DRBD (Distributed Replicated Block Device) to mark the disk on a node as outdated when the communication link goes down. Node level fencing ensures that a node does not run any resources. This is done by resetting the node and the Pacemaker implementation of it is called STONITH (which stands for "shoot the other node in the head"). Pacemaker supports a great variety of fencing devices, e.g. an uninterruptible power supply or management interface cards for servers. For more details, see Pacemaker Clusters from Scratch and Fencing and Stonith

Because the node level fencing configuration depends heavily on your environment, we will disable it for this tutorial (it can be configured at a later time):

sudo pcs property set stonith-enabled=false
Important

Disabling STONITH is just for testing purposes. If you plan to use Pacemaker in a production environment, you should plan a STONITH implementation depending on your environment and keep it enabled. Note that at this point there are no fencing agents for any cloud environments (including Azure) or Hyper-V. Consequentially, the cluster vendor does not offer support for running production clusters in these environments. We are working on a solution for this gap that will be available in future releases.

Set cluster property start-failure-is-fatal to false

Start-failure-is-fatal indicates whether a failure to start a resource on a node prevents further start attempts on that node. When set to false, the cluster will decide whether to try starting on the same node again based on the resource's current failure count and migration threshold. So, after failover occurs, Pacemaker will retry starting the availability group resource on the former primary once the SQL instance is available. Pacemaker will take care of demoting the replica to secondary and it will automatically rejoin the availability group. To update the property value to false run:

pcs property set start-failure-is-fatal=false

If the property has the default value of true, if first attempt to start the resource fails, user intervention is required after an automatic failover to cleanup the resource failure count and reset the configuration using: pcs resource cleanup <resourceName> command.

Install SQL Server resource agent for integration with Pacemaker

Run the following commands on all nodes.

sudo apt-get install mssql-server-ha

Create a SQL Server login for Pacemaker

  1. On all SQL Servers, create a Server login for Pacemaker. The following Transact-SQL creates a login:

    USE [master]
    GO
    CREATE LOGIN [pacemakerLogin] with PASSWORD= N'ComplexP@$$w0rd!'
    
    ALTER SERVER ROLE [sysadmin] ADD MEMBER [pacemakerLogin]
    

    Alternatively, you can set the permissions at a more granular level. The Pacemaker login requires ALTER, CONTROL, and VIEW DEFINITION PERMISSION. For more information, see GRANT Availability Group Permissions (Transact-SQL).

    The following Transact-SQL grants only the required permission to the Pacemaker login. In the statement below 'ag1' is the name of the availability group that will be added as a cluster resource.

    GRANT ALTER, CONTROL, VIEW DEFINITION ON AVAILABILITY GROUP::ag1 TO pacemakerLogin
    
  2. On all SQL Servers, save the credentials for the SQL Server login.

    echo 'pacemakerLogin' >> ~/pacemaker-passwd
    echo 'ComplexP@$$w0rd!' >> ~/pacemaker-passwd
    sudo mv ~/pacemaker-passwd /var/opt/mssql/secrets/passwd
    sudo chown root:root /var/opt/mssql/secrets/passwd
    sudo chmod 400 /var/opt/mssql/secrets/passwd # Only readable by root
    

Create availability group resource

To create the availability group resource, use pcs resource create command and set the resource properties. Below command creates a ocf:mssql:ag master/slave type resource for availability group with name ag1.

sudo pcs resource create ag_cluster ocf:mssql:ag ag_name=ag1 --master meta notify=true

Create virtual IP resource

To create the virtual IP address resource, run the following command on one node. Use an available static IP address from the network. Before you run the script, replace the values between **< ... >** with a valid IP address.

sudo pcs resource create virtualip ocf:heartbeat:IPaddr2 ip=**<10.128.16.240>**

There is no virtual server name equivalent in Pacemaker. To use a connection string that points to a string server name and not use the IP address, register the IP resource address and desired virtual server name in DNS. For DR configurations, register the desired virtual server name and IP address with the DNS servers on both primary and DR site.

Add colocation constraint

Almost every decision in a Pacemaker cluster, like choosing where a resource should run, is done by comparing scores. Scores are calculated per resource, and the cluster resource manager chooses the node with the highest score for a particular resource. (If a node has a negative score for a resource, the resource cannot run on that node.) We can manipulate the decisions of the cluster with constraints. Constraints have a score. If a constraint has a score lower than INFINITY, it is only a recommendation. A score of INFINITY means it is a must. We want to ensure that primary of the availability group and the virtual ip resource are run on the same host, so we will define a colocation constraint with a score of INFINITY. To add the colocation constraint, run the following command on one node.

sudo pcs constraint colocation add virtualip ag_cluster-master INFINITY with-rsc-role=Master

Add ordering constraint

The colocation constraint has an implicit ordering constraint. It moves the virtual IP resource before it moves the availability group resource. By default the sequence of events is:

  1. User issues pcs resource move to the availability group primary from node1 to node2.
  2. The virtual IP resource stops on nodeName1.
  3. The virtual IP resource starts on nodeName2.

    Note

    At this point, the IP address temporarily points to nodeName2 while nodeName2 is still a pre-failover secondary.

  4. The availability group primary on nodeName1 is demoted to secondary.

  5. The availability group secondary on nodeName2 is promoted to primary.

To prevent the IP address from temporarily pointing to the node with the pre-failover secondary, add an ordering constraint.

To add an ordering constraint, run the following command on one node:

sudo pcs constraint order promote ag_cluster-master then start virtualip
Important

After you configure the cluster and add the availability group as a cluster resource, you cannot use Transact-SQL to fail over the availability group resources. SQL Server cluster resources on Linux are not coupled as tightly with the operating system as they are on a Windows Server Failover Cluster (WSFC). SQL Server service is not aware of the presence of the cluster. All orchestration is done through the cluster management tools. In RHEL or Ubuntu use pcs.

Understand SQL Server resource agent for pacemaker

Before the CTP 1.4 release, the Pacemaker resource agent for availability groups could not know if a replica marked as SYNCHRONOUS_COMMIT was really up-to-date or not. It was possible that the replica had stopped synchronizing with the primary but was not aware. Thus the agent could promote an out-of-date replica to primary - which, if successful, would cause data loss.

SQL Server 2017 CTP 1.4 added sequence_number to sys.availability_groups to solve this issue. sequence_number is a monotonically increasing BIGINT that represents how up-to-date the local availability group replica is with respect to the rest of the replicas in the availability group. Performing failovers, adding or removing replicas, and other availability group operations update this number. The number is updated on the primary, then pushed to secondary replicas. Thus a secondary replica that is up-to-date will have the same sequence_number as the primary.

When Pacemaker decides to promote a replica to primary, it first sends a notification to all replicas to extract the sequence number and store it (we call this the pre-promote notification). Next, when Pacemaker actually tries to promote a replica to primary, the replica only promotes itself if its sequence number is the highest of all the sequence numbers from all replicas and rejects the promote operation otherwise. In this way only the replica with the highest sequence number can be promoted to primary, ensuring no data loss.

Note that this is only guaranteed to work as long as at least one replica available for promotion has the same sequence number as the previous primary. To ensure this, the default behavior is for the Pacemaker resource agent to automatically set REQUIRED_COPIES_TO_COMMIT such that at least one synchronous commit secondary replica is up to date and available to be the target of an automatic failover. With each monitoring action, the value of REQUIRED_COPIES_TO_COMMIT is computed (and updated if necessary) as ('number of synchronous commit replicas' / 2). Then, at failover time, the resource agent will require (total number of replicas - required_copies_to_commit replicas) to respond to the pre-promote notification to be able to promote one of them to primary. The replica with the highest sequence_number will be promoted to primary.

For example, let's consider the case of an availability group with three synchronous replicas - one primary replica and two synchronous commit secondary replicas.

  • REQUIRED_COPIES_TO_COMMIT is 3 / 2 = 1

  • The required number of replicas to respond to pre-promote action is 3 - 1 = 2. So 2 replicas have to be up for the failover to be triggered. This means that, in the case of primary outage, if one of the secondary replicas is unresponsive and only one of the secondaries responds to the pre-promote action, the resource agent cannot guarantee that the secondary that responded has the highest sequence_number, and a failover is not triggered.

A user can choose to override the default behavior, and configure the availability group resource to not set REQUIRED_COPIES_TO_COMMIT automatically as above.

Important

When REQUIRED_COPIES_TO_COMMIT is 0 there is risk of data loss. In the case of an outage of the primary, the resource agent will not automatically trigger a failover. The user has to decide if they want to wait for primary to recover or manually fail over.

To set REQUIRED_COPIES_TO_COMMIT to 0, run:

sudo pcs resource update <**ag1**> required_copies_to_commit=0

To revert to default computed value, run:

sudo pcs resource update <**ag1**> required_copies_to_commit=
Note

Updating resource properties causes all replicas to stop and restart. This means primary will temporarily be demoted to secondary, then promoted again which will casue temporary write unavailability. The new value for REQUIRED_COPIES_TO_COMMIT will only be set once replicas are restarted, so it won't be instantaneous with running the pcs command.

Balancing high availability and data protection

The above default behavior applies to the case of 2 synchronous replicas (primary + secondary) as well. Pacemaker will default REQUIRED_COPIES_TO_COMMIT = 1 to ensure the secondary replica is always up to date for maximum data protection.

Warning

This comes with higher risk of unavailability of the primary replica due to planned or unplanned outages on the secondary. The user can choose to change the default behavior of the resource agent and override the REQUIRED_COPIES_TO_COMMIT to 0:

sudo pcs resource update <**ag1**> required_copies_to_commit=0

Once overridden, the resource agent will use the new setting for REQUIRED_COPIES_TO_COMMIT and stop computing it. This means that users have to manually update it accordingly (for example, if they increase the number of replicas).

The tables below describes the outcome of an outage for primary or secondary replicas in different availability group resource configurations:

Availability group - 2 sync replicas

Primary outage One secondary replica outage
REQUIRED_COPIES_TO_COMMIT=0 User has to issue a manual FAILOVER.
Might have data loss.
New primary is R/W
Primary is R/W, running exposed to data loss
REQUIRED_COPIES_TO_COMMIT=1 * Cluster will automatically issue FAILOVER
No data loss.
New primary will reject all connections until former primary recovers and joins availability group as secondary.
Primary will reject all connections until secondary recovers.

* SQL Server resource agent for Pacemaker default behavior.

Availability group - 3 sync replicas

Primary outage One secondary replica outage
REQUIRED_COPIES_TO_COMMIT=0 User has to issue a manual FAILOVER.
Might have data loss.
New primary is R/W
Primary is R/W
REQUIRED_COPIES_TO_COMMIT=1 * Cluster will automatically issue FAILOVER.
No data loss.
New primary is RW
Primary is R/W

* SQL Server resource agent for Pacemaker default behavior.