Configure SLES Cluster for SQL Server Availability Group

This guide provides instructions to create a two-node cluster for SQL Server on SUSE Linux Enterprise Server (SLES) 12 SP2. The clustering layer is based on SUSE High Availability Extension (HAE) built on top of Pacemaker.

For more details on cluster configuration, resource agent options, management, best practices, and recommendations, see SUSE Linux Enterprise High Availability Extension 12 SP2.

Note

At this point, SQL Server's integration with Pacemaker on Linux is not as coupled as with WSFC on Windows. SQL Server service on Linux is not cluster aware. Pacemaker controls all of the orchestration of the cluster resources as if SQL Server were a standalone instance. Also, virtual network name is specific to WSFC, there is no equivalent of the same in Pacemaker. On Linux, Always On Availability Group Dynamic Management Views (DMVs) will return empty rows. You can still create a listener to use it for transparent reconnection after failover, but you will have to manually register the listener name in the DNS server with the IP used to create the virtual IP resource (as explained below).

Roadmap

The steps to create an availability group on Linux servers for high availability are different from the steps on a Windows Server failover cluster. The following list describes the high level steps:

  1. Configure SQL Server on the cluster nodes.

  2. Create the availability group.

  3. Configure a cluster resource manager, like Pacemaker. These instructions are in this document.

    The way to configure a cluster resource manager depends on the specific Linux distribution.

    Important

    Production environments require a fencing agent, like STONITH for high availability. The demonstrations in this documentation do not use fencing agents. The demonstrations are for testing and validation only.

    A Linux cluster uses fencing to return the cluster to a known state. The way to configure fencing depends on the distribution and the environment. At this time, fencing is not available in some cloud environments. See SUSE Linux Enterprise High Availability Extension.

  4. Add the availability group as a resource in the cluster.

Prerequisites

To complete the end-to-end scenario below you need two machines to deploy the two nodes cluster. The steps below outline how to configure these servers.

Setup and configure the operating system on each cluster node

The first step is to configure the operating system on the cluster nodes. For this walk through, use SLES 12 SP2 with a valid subscription for the HA add-on.

Install and configure SQL Server service on each cluster node

  1. Install and setup SQL Server service on both nodes. For detailed instructions see Install SQL Server on Linux.

  2. Designate one node as primary and the other as secondary, for purposes of configuration. Use these terms throughout this guide.

  3. Make sure nodes that are going to be part of the cluster can communicate to each other.

    The following example shows /etc/hosts with additions for two nodes named SLES1 and SLES2.

    127.0.0.1   localhost
    10.128.18.128 SLES1
    10.128.16.77 SLES2
    

    All cluster nodes must be able to access each other via SSH. Tools like hb_report or crm_report (for troubleshooting) and Hawk's History Explorer require passwordless SSH access between the nodes, otherwise they can only collect data from the current node. In case you use a non-standard SSH port, use the -X option (see man page). For example, if your SSH port is 3479, invoke a crm_report with:

    crm_report -X "-p 3479" [...]
    

    For additional information, see the SLES Administration Guide - Miscellaneous section.

Create a SQL Server login for Pacemaker

  1. On all SQL Servers, create a Server login for Pacemaker. The following Transact-SQL creates a login:

    USE [master]
    GO
    CREATE LOGIN [pacemakerLogin] with PASSWORD= N'ComplexP@$$w0rd!'
    
    ALTER SERVER ROLE [sysadmin] ADD MEMBER [pacemakerLogin]
    

    Alternatively, you can set the permissions at a more granular level. The Pacemaker login requires ALTER, CONTROL, and VIEW DEFINITION PERMISSION. For more information, see GRANT Availability Group Permissions (Transact-SQL).

    The following Transact-SQL grants only the required permission to the Pacemaker login. In the statement below 'ag1' is the name of the availability group that will be added as a cluster resource.

    GRANT ALTER, CONTROL, VIEW DEFINITION ON AVAILABILITY GROUP::ag1 TO pacemakerLogin
    
  2. On all SQL Servers, save the credentials for the SQL Server login.

    echo 'pacemakerLogin' >> ~/pacemaker-passwd
    echo 'ComplexP@$$w0rd!' >> ~/pacemaker-passwd
    sudo mv ~/pacemaker-passwd /var/opt/mssql/secrets/passwd
    sudo chown root:root /var/opt/mssql/secrets/passwd
    sudo chmod 400 /var/opt/mssql/secrets/passwd # Only readable by root
    

Configure an Always On Availability Group

On Linux servers configure the availability group and then configure the cluster resources. To configure the availability group, see Configure Always On Availability Group for SQL Server on Linux

Install and configure Pacemaker on each cluster node

  1. Install the High Availability extension

    For reference, see Installing SUSE Linux Enterprise Server and High Availability Extension

  2. Install SQL Server resource agent package on both nodes.

    sudo zypper install mssql-server-ha
    

Set up the first node

Refer to SLES installation instructions

  1. Log in as root to the physical or virtual machine you want to use as cluster node.
  2. Start the bootstrap script by executing:

    ha-cluster-init
    

    If NTP has not been configured to start at boot time, a message appears.

    If you decide to continue anyway, the script will automatically generate keys for SSH access and for the Csync2 synchronization tool and start the services needed for both.

  3. To configure the cluster communication layer (Corosync):

    a. Enter a network address to bind to. By default, the script will propose the network address of eth0. Alternatively, enter a different network address, for example the address of bond0.

    b. Enter a multicast address. The script proposes a random address that you can use as default.

    c. Enter a multicast port. The script proposes 5405 as default.

    d. To configure SBD (), enter a persistent path to the partition of your block device that you want to use for SBD. The path must be consistent across all nodes in the cluster. Finally, the script will start the Pacemaker service to bring the one-node cluster online and enable the Web management interface Hawk2. The URL to use for Hawk2 is displayed on the screen.

  4. For any details of the setup process, check /var/log/sleha-bootstrap.log. You now have a running one-node cluster. Check the cluster status with crm status:

    crm status
    

    You can also see cluster configuration with crm configure show xml or crm configure show.

  5. The bootstrap procedure creates a Linux user named hacluster with the password linux. Replace the default password with a secure one as soon as possible:

    passwd hacluster
    

Add nodes to the existing cluster

If you have a cluster running with one or more nodes, add more cluster nodes with the ha-cluster-join bootstrap script. The script only needs access to an existing cluster node and will complete the basic setup on the current machine automatically. Follow the steps below:

If you have configured the existing cluster nodes with the YaST cluster module, make sure the following prerequisites are fulfilled before you run ha-cluster-join:

  • The root user on the existing nodes has SSH keys in place for passwordless login.
  • Csync2 is configured on the existing nodes. For details, refer to Configuring Csync2 with YaST.
  1. Log in as root to the physical or virtual machine supposed to join the cluster.
  2. Start the bootstrap script by executing:

    ha-cluster-join
    

    If NTP has not been configured to start at boot time, a message appears.

  3. If you decide to continue anyway, you will be prompted for the IP address of an existing node. Enter the IP address.
  4. If you have not already configured a passwordless SSH access between both machines, you will also be prompted for the root password of the existing node. After logging in to the specified node, the script will copy the Corosync configuration, configure SSH and Csync2, and will bring the current machine online as new cluster node. Apart from that, it will start the service needed for Hawk. If you have configured shared storage with OCFS2, it will also automatically create the mountpoint directory for the OCFS2 file system.
  5. Repeat the steps above for all machines you want to add to the cluster.

  6. For details of the process, check /var/log/ha-cluster-bootstrap.log.

  7. Check the cluster status with crm status. If you have successfully added a second node, the output will be similar to the following:

    crm status
    
    2 nodes configured
    1 resource configured
    Online: [ SLES1 SLES2 ]
    Full list of resources:   
    admin_addr     (ocf::heartbeat:IPaddr2):       Started SLES1
    
    Note

    admin_addr is the virtual IP cluster resource which is configured during initial one-node cluster setup.

After adding all nodes, check if you need to adjust the no-quorum-policy in the global cluster options. This is especially important for two-node clusters. For more information, refer to Section 4.1.2, Option no-quorum-policy.

Configure the cluster resources for SQL Server

Refer to SLES Administration Guid

Create availability group resource

The following command creates and configures the availability group resource for 2 replicas of availability group [ag1]. Run the command on one of the nodes in the cluster:

  1. Run crm configure to open the crm prompt:

    crm configure 
    
  2. In the crm prompt, run the command below to configure the resource properties.

    primitive ag_cluster \
       ocf:mssql:ag \
       params ag_name="ag1" 
    ms ms-ag_cluster ag_cluster \
       meta notify="true"
    commit
    

Create virtual IP resource

If you did not create the virtual IP resource when you ran ha-cluster-init you can create this resource now. The following command creates a virtual IP resource. Replace <**0.0.0.0**> with an available address from your network and <**24**> with the number of bits in the CIDR subnet mask. Run on one node.

crm configure \
primitive admin_addr \
   ocf:heartbeat:IPaddr2 \
   params ip=<**0.0.0.0**> \
      cidr_netmask=<**24**>

Add colocation constraint

Almost every decision in a Pacemaker cluster, like choosing where a resource should run, is done by comparing scores. Scores are calculated per resource, and the cluster resource manager chooses the node with the highest score for a particular resource. (If a node has a negative score for a resource, the resource cannot run on that node.) We can manipulate the decisions of the cluster with constraints. Constraints have a score. If a constraint has a score lower than INFINITY, it is only a recommendation. A score of INFINITY means it is a must. We want to ensure that primary of the availability group and the virtual ip resource are run on the same host, so we will define a colocation constraint with a score of INFINITY.

To set colocation constraint for the virtual IP to run on same node as the master, run the following command on one node:

crm configure
colocation vip_on_master inf: \
    admin_addr ms-ag_cluster:Master
commit

Add ordering constraint

The colocation constraint has an implicit ordering constraint. It moves the virtual IP resource before it moves the availability group resource. By default the sequence of events is:

  1. User issues resource migrate to the availability group master from node1 to node2.
  2. The virtual IP resource stops on node 1.
  3. The virtual IP resource starts on node 2. At this point, the IP address temporarily points to node 2 while node 2 is still a pre-failover secondary.
  4. The availability group master on node 1 is demoted to slave.
  5. The availability group slave on node 2 is promoted to master.

To prevent the IP address from temporarily pointing to the node with the pre-failover secondary, add an ordering constraint. To add an ordering constraint, run the following command on one node:

crm configure \
   order ag_first inf: ms-ag_cluster:promote admin_addr:start
Important

After you configure the cluster and add the availability group as a cluster resource, you cannot use Transact-SQL to fail over the availability group resources. SQL Server cluster resources on Linux are not coupled as tightly with the operating system as they are on a Windows Server Failover Cluster (WSFC). SQL Server service is not aware of the presence of the cluster. All orchestration is done through the cluster management tools. In SLES use crm.

Manually fail over the availability group with crm. Do not initiate failover with Transact-SQL. For instructions, see Failover.

For additional details see:

Understand SQL Server resource agent for pacemaker

Before the CTP 1.4 release, the Pacemaker resource agent for availability groups could not know if a replica marked as SYNCHRONOUS_COMMIT was really up-to-date or not. It was possible that the replica had stopped synchronizing with the primary but was not aware. Thus the agent could promote an out-of-date replica to primary - which, if successful, would cause data loss.

SQL Server 2017 CTP 1.4 added sequence_number to sys.availability_groups to solve this issue. sequence_number is a monotonically increasing BIGINT that represents how up-to-date the local availability group replica is with respect to the rest of the replicas in the availability group. Performing failovers, adding or removing replicas, and other availability group operations update this number. The number is updated on the primary, then pushed to secondary replicas. Thus a secondary replica that is up-to-date will have the same sequence_number as the primary.

When Pacemaker decides to promote a replica to primary, it first sends a notification to all replicas to extract the sequence number and store it (we call this the pre-promote notification). Next, when Pacemaker actually tries to promote a replica to primary, the replica only promotes itself if its sequence number is the highest of all the sequence numbers from all replicas and rejects the promote operation otherwise. In this way only the replica with the highest sequence number can be promoted to primary, ensuring no data loss.

Note that this is only guaranteed to work as long as at least one replica available for promotion has the same sequence number as the previous primary. To ensure this, the default behavior is for the Pacemaker resource agent to automatically set REQUIRED_COPIES_TO_COMMIT such that at least one synchronous commit secondary replica is up to date and available to be the target of an automatic failover. With each monitoring action, the value of REQUIRED_COPIES_TO_COMMIT is computed (and updated if necessary) as ('number of synchronous commit replicas' / 2). Then, at failover time, the resource agent will require (total number of replicas - required_copies_to_commit replicas) to respond to the pre-promote notification to be able to promote one of them to primary. The replica with the highest sequence_number will be promoted to primary.

For example, let's consider the case of an availability group with three synchronous replicas - one primary replica and two synchronous commit secondary replicas.

  • REQUIRED_COPIES_TO_COMMIT is 3 / 2 = 1

  • The required number of replicas to respond to pre-promote action is 3 - 1 = 2. So 2 replicas have to be up for the failover to be triggered. This means that, in the case of primary outage, if one of the secondary replicas is unresponsive and only one of the secondaries responds to the pre-promote action, the resource agent cannot guarantee that the secondary that responded has the highest sequence_number, and a failover is not triggered.

A user can choose to override the default behavior, and configure the availability group resource to not set REQUIRED_COPIES_TO_COMMIT automatically as above.

Important

When REQUIRED_COPIES_TO_COMMIT is 0 there is risk of data loss. In the case of an outage of the primary, the resource agent will not automatically trigger a failover. The user has to decide if they want to wait for primary to recover or manually fail over.

To set REQUIRED_COPIES_TO_COMMIT to 0, run:

sudo pcs resource update <**ag1**> required_copies_to_commit=0

To revert to default computed value, run:

sudo pcs resource update <**ag1**> required_copies_to_commit=
Note

Updating resource properties causes all replicas to stop and restart. This means primary will temporarily be demoted to secondary, then promoted again which will casue temporary write unavailability. The new value for REQUIRED_COPIES_TO_COMMIT will only be set once replicas are restarted, so it won't be instantaneous with running the pcs command.

Balancing high availability and data protection

The above default behavior applies to the case of 2 synchronous replicas (primary + secondary) as well. Pacemaker will default REQUIRED_COPIES_TO_COMMIT = 1 to ensure the secondary replica is always up to date for maximum data protection.

Warning

This comes with higher risk of unavailability of the primary replica due to planned or unplanned outages on the secondary. The user can choose to change the default behavior of the resource agent and override the REQUIRED_COPIES_TO_COMMIT to 0:

sudo pcs resource update <**ag1**> required_copies_to_commit=0

Once overridden, the resource agent will use the new setting for REQUIRED_COPIES_TO_COMMIT and stop computing it. This means that users have to manually update it accordingly (for example, if they increase the number of replicas).

The tables below describes the outcome of an outage for primary or secondary replicas in different availability group resource configurations:

Availability group - 2 sync replicas

Primary outage One secondary replica outage
REQUIRED_COPIES_TO_COMMIT=0 User has to issue a manual FAILOVER.
Might have data loss.
New primary is R/W
Primary is R/W, running exposed to data loss
REQUIRED_COPIES_TO_COMMIT=1 * Cluster will automatically issue FAILOVER
No data loss.
New primary will reject all connections until former primary recovers and joins availability group as secondary.
Primary will reject all connections until secondary recovers.

* SQL Server resource agent for Pacemaker default behavior.

Availability group - 3 sync replicas

Primary outage One secondary replica outage
REQUIRED_COPIES_TO_COMMIT=0 User has to issue a manual FAILOVER.
Might have data loss.
New primary is R/W
Primary is R/W
REQUIRED_COPIES_TO_COMMIT=1 * Cluster will automatically issue FAILOVER.
No data loss.
New primary is RW
Primary is R/W

* SQL Server resource agent for Pacemaker default behavior.

Next steps

Configure availability group for SQL Server on Linux

Configure read-scale availability group for SQL Server on Linux