Business continuity and database recovery - SQL Server

Article
12/05/2022

Applies to: SQL Server 2016 (13.x) and later versions

This article provides an overview of business continuity solutions for high availability and disaster recovery in SQL Server, on Windows and Linux.

One common task everyone deploying SQL Server has to account for is making sure that all mission critical SQL Server instances and the databases within them are available when the business and end users need them, whether that is 9 to 5 or around the clock. The goal is to keep the business up and running with minimal or no interruption. This concept is also known as business continuity.

SQL Server 2017 (14.x) introduced many new features or enhancements to existing ones, some of which are for availability. The biggest addition to SQL Server 2017 (14.x) was support for SQL Server on Linux distributions. For a full list of the new features in SQL Server, see the following articles:

This article is focused on covering the availability scenarios in SQL Server 2017 (14.x) and later versions, as well as the new and enhanced availability features. The scenarios include hybrid ones that will be able to span SQL Server deployments on both Windows Server and Linux, and ones that can increase the number of readable copies of a database.

While this article doesn't cover availability options external to SQL Server, such as those provided by virtualization, everything discussed here applies to SQL Server installations inside a guest virtual machine whether in the public cloud or hosted by an on-premises hypervisor server.

SQL Server scenarios using the availability features

Always On availability groups, Always On failover cluster instances, and log shipping, can be used in various ways, and not necessarily just for availability purposes. There are four main ways the availability features can be used:

High availability
Disaster recovery
Migrations and upgrades
Scaling out readable copies of one or more databases

The following sections discuss the relevant features that can be used for that particular scenario. The one feature not covered is SQL Server replication. While not officially designated as an availability feature under the Always On umbrella, SQL Server replication is often used for making data redundant in certain scenarios. Merge replication isn't supported for SQL Server on Linux. For more information, see SQL Server Replication on Linux.

Important

The SQL Server availability features do not replace the requirement to have a robust, well tested backup and restore strategy, the most fundamental building block of any availability solution.

High availability

It's important to ensure that SQL Server instances or database are available in the case of a problem that is local to a data center or single region in the cloud. This section will cover how the SQL Server availability features can help with that task. All of the features described are available both on Windows Server and on Linux.

Availability groups

Introduced in SQL Server 2012 (11.x), availability groups (AGs) provide database-level protection by sending each transaction of a database to another instance, or replica, which contains a copy of that database in a special state. An AG can be deployed on Standard or Enterprise editions. The instances participating in an AG can be either standalone or failover cluster instances (FCIs, described in the next section). Since the transactions are sent to a replica as they happen, AGs are recommended where there are requirements for lower recovery point and recovery time objectives. Data movement between replicas can be synchronous or asynchronous, with Enterprise edition allowing up to three replicas (including the primary) as synchronous. An AG has one fully read/write copy of the database that is on the primary replica, while all secondary replicas can't receive transactions directly from end users or applications.

Note

Always On is an umbrella term for the availability features in SQL Server and covers both AGs and FCIs. Always On is not the name of the AG feature.

Before SQL Server 2022 (16.x), AGs only provide database-level, and not instance-level protection. Anything not captured in the transaction log or configured in the database will need to be manually synchronized for each secondary replica. Some examples of objects that must be synchronized manually are logins at the instance level, linked servers, and SQL Server Agent jobs.

Starting with SQL Server 2022 (16.x), you can manage metadata objects including users, logins, permissions and SQL Server Agent jobs at the AG level in addition to the instance level. For more information, see Contained availability groups.

An AG also has another component called the listener, which allows applications and end users to connect without needing to know which SQL Server instance is hosting the primary replica. Each AG would have its own listener. While the implementations of the listener are slightly different on Windows Server versus Linux, the functionality it provides and how it's used is the same. The diagram below shows a Windows Server-based AG that is using a Windows Server Failover Cluster (WSFC). An underlying cluster at the OS layer is required for availability whether it is on Linux or Windows Server. The example shows a simple configuration with two servers, or nodes, with a WSFC as the underlying cluster.

Diagram of a simple availability group.

Standard and Enterprise edition have different maximums when it comes to replicas. An AG in Standard edition, known as a basic availability group, supports two replicas (one primary and one secondary) with only a single database in the AG. Enterprise edition not only allows multiple databases to be configured in a single AG, but also can have up to nine total replicas (one primary, eight secondary). Enterprise edition also provides other optional benefits such as readable secondary replicas, the ability to make backups off of a secondary replica, and more.

Note

Database mirroring, which was deprecated in SQL Server 2012 (11.x), is not available on the Linux version of SQL Server, nor will it be added. Customers still using database mirroring should plan to migrate to AGs, which is the replacement for database mirroring.

When it comes to availability, AGs can provide either automatic or manual failover. Automatic failover can occur if synchronous data movement is configured and the database on the primary and secondary replica are in a synchronized state. As long as the listener is used and the application uses a later version of .NET Framework (3.5 with an update, or 4.0 and above), the failover should be handled with minimal to no effect on end users if a listener is utilized. Failing over to a secondary replica to make it the new primary replica can be configured to be automatic or manual, and is generally measured in seconds.

The list below highlights some differences with AGs on Windows Server versus Linux:

Owing to differences in the way the underlying cluster works on Linux and Windows Server, all failovers (manual or automatic) of AGs are done via the cluster on Linux. On Windows Server-based AG deployments, manual failovers must be done via SQL Server. Automatic failovers are handled by the underlying cluster on both Windows Server and Linux.
For SQL Server on Linux, the recommended configuration for AGs is a minimum of three replicas. This is due to the way that the underlying clustering works.
On Linux, the common name used by each listener is defined in DNS and not in the cluster like it is on Windows Server.

Starting with SQL Server 2017 (14.x), there are some new features and enhancements to AGs:

Cluster types
REQUIRED_SECONDARIES_TO_COMMIT
Enhanced Microsoft Distributor Transaction Coordinator (DTC) support for Windows Server-based configurations
Additional scale out scenarios for read-only databases (described later in this article)

Availability group cluster types

The built-in availability form of clustering in Windows Server is enabled via a feature named Failover Clustering. It allows you to build a WSFC to be used with an AG or FCI. Integration for AGs and FCIs is provided by cluster-aware resource DLLs shipped by SQL Server.

SQL Server on Linux supports multiple clustering technologies. Microsoft supports the SQL Server components, while our partners support the relevant clustering technology. For example, along with Pacemaker, SQL Server on Linux supports HPE Serviceguard and DH2i DxEnterprise as a cluster solution.

A Windows-based failover cluster and Linux cluster solution are more similar than different. Both provide a way to take individual servers and combine them in a configuration to provide availability, and have concepts of things like resources, constraints (even if implemented differently), failover, and so on.

For example, to support Pacemaker for both AG and FCI configurations including things like automatic failover, Microsoft provides the mssql-server-ha package, which is similar to, but not exactly the same as the resource DLLs in a WSFC, for Pacemaker. One of the differences between a WSFC and Pacemaker is that there's no network name resource in Pacemaker, which is the component that helps to abstract the name of the listener (or the name of the FCI) on a WSFC. DNS provides that name resolution on Linux.

Because of the difference in the cluster stack, some changes needed to be made for AGs because SQL Server has to handle some of the metadata that is natively handled by a WSFC. One such significant change is the introduction of a cluster type for an availability group. This is stored in sys.availability_groups in the cluster_type and cluster_type_desc columns. There are three cluster types:

WSFC
External
None

All AGs that require high availability must use an underlying cluster, which in the case of SQL Server 2017 (14.x) and later versions means WSFC or a Linux clustering agent. For Windows Server-based AGs that use an underlying WSFC, the default cluster type is WSFC and doesn't need to be set. For Linux-based AGs, when creating the AG, the cluster type must be set to External. The integration with an external cluster solution in Linux is configured after the AG is created, whereas on a WSFC, it's done at creation time.

A cluster type of None can be used with both Windows Server and Linux AGs. Setting the cluster type to None means that the AG doesn't require an underlying cluster. This means SQL Server 2017 (14.x) is the first version of SQL Server to support AGs without a cluster, but the tradeoff is that this configuration isn't supported as a high availability solution.

Important

Starting with SQL Server 2017 (14.x), you can't change a cluster type for an AG after it is created. This means that an AG cannot be switched from None to External or WSFC, or vice versa.

For those who are only looking to just add additional read-only copies of a database, or like what an AG provides for migration/upgrades but don't want to be tied to the additional complexity of an underlying cluster or even the replication, an AG with a cluster type of None is a perfect solution. For more information, see the sections Migrations and Upgrades and read-scale.

The screenshot below shows the support for the different kinds of cluster types in SQL Server Management Studio (SSMS). You must be running version 17.1 or later. The screenshot below is from version 17.2.

REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT

SQL Server 2016 (13.x) increased support for the number of synchronous replicas from two to three in Enterprise edition. However, if one secondary replica was synchronized but the other was having a problem, there was no way to control the behavior to tell the primary to either wait for the misbehaving replica or to allow it to move on. This means that the primary replica at some point would continue to receive write traffic even though the secondary replica wouldn't be in a synchronized state, which means that there's data loss on the secondary replica. Starting with SQL Server 2017 (14.x), you can control the behavior of what happens when there are synchronous replicas with REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT. This option works as follows:

There are three possible values: 0, 1, and 2
The value is the number of secondary replicas that must be synchronized, which has implications for data loss, AG availability, and failover
For WSFCs and a cluster type of None, the default value is 0, and can be manually set to 1 or 2
For a cluster type of External, by default, the cluster mechanism will set this value, and it can be overridden manually. For three synchronous replicas, the default value will be 1.

On Linux, the value for REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT is configured on the AG resource in the cluster. On Windows, it's set via Transact-SQL.

A value that is higher than 0 ensures higher data protection, because if the required number of secondary replicas isn't available, the primary won't be available until that is resolved. REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT also affects failover behavior since automatic failover couldn't occur if the right number of secondary replicas weren't in the proper state. On Linux, a value of 0 won't allow automatic failover, so on Linux, when using synchronous with automatic failover, the value must be set higher than 0 to achieve automatic failover. 0 on Windows Server is the behavior in SQL Server 2016 (13.x) and earlier versions.

Enhanced Microsoft Distributed Transaction Coordinator support

Before SQL Server 2016 (13.x), the only way to get availability in SQL Server for applications that require distributed transactions, which use DTC underneath the covers was to deploy FCIs. A distributed transaction can be done in one of two ways:

A transaction that spans more than one database in the same SQL Server instance
A transaction that spans more than one SQL Server instance or possibly involves a non-SQL Server data source

SQL Server 2016 (13.x) introduced partial support for DTC with AGs that covered the latter scenario. SQL Server 2017 (14.x) completes the story by supporting both scenarios with DTC.

In SQL Server 2017 (14.x) and later versions, DTC support can also be added to an AG after it's created. In SQL Server 2016 (13.x), enabling support for DTC to an AG could only be done when the AG was created.

Failover cluster instances

Clustered installations have been a feature of SQL Server since version 6.5. FCIs are a proven method of providing availability for the entire installation of SQL Server, known as an instance. This means that everything inside the instance, including databases, SQL Server Agent jobs, linked servers, and so on, will move to another server should the underlying server encounter a problem. All FCIs require some sort of shared storage, even if it's provided via networking. The FCI's resources can only be running and owned by one node at any given time. In the diagram below, the first node of the cluster owns the FCI, which also means it owns the shared storage resources associated with it denoted by the solid line to the storage.

Diagram of a Failover Cluster Instance.

After a failover, ownership changes as is seen in the diagram below.

Diagram of a Failover Cluster Instance, post failover.

There's zero data loss with an FCI, but the underlying shared storage is a single point of failure since there's one copy of the data. FCIs are often combined with another availability method, such as an AG or log shipping, to have redundant copies of databases. The additional method deployed should use physically separate storage from the FCI. When the FCI fails over to another node, it stops on one node and starts on another, not unlike powering off a server and turning it on. An FCI goes through the normal recovery process, meaning any transactions that need to be rolled forward will be, and any transactions that are incomplete will be rolled back. Therefore, the database is consistent from a data point to the time of the failure or manual failover, hence no data loss. Databases are only available after recovery is complete, so recovery time will depend on many factors, and will generally be longer than failing over an AG. The tradeoff is that when you fail over an AG, there may be extra tasks required to make a database usable, such as enabling a SQL Server Agent job.

Like an AG, FCIs abstract which node of the underlying cluster is hosting it. An FCI always retains the same name. Applications and end users never connect to the nodes; the unique name assigned to the FCI is used. An FCI can participate in an AG as one of the instances hosting either a primary or secondary replica.

The list below highlights some differences with FCIs on Windows Server versus Linux:

On Windows Server, an FCI is part of the installation process. An FCI on Linux is configured after installing SQL Server.
Linux only supports a single installation of SQL Server per host, so all FCIs will be a default instance. Windows Server supports up to 25 FCIs per WSFC.
The common name used by FCIs in Linux is defined in DNS, and should be the same as the resource created for the FCI.

Log shipping

If recovery point and recovery time objectives are more flexible, or databases aren't considered to be highly mission critical, log shipping is another proven availability feature in SQL Server. Based on SQL Server's native backups, the process for log shipping automatically generates transaction log backups, copies them to one or more instances known as a warm standby, and automatically applies the transaction log backups to that standby. Log shipping uses SQL Server Agent jobs to automate the process of backing up, copying, and applying the transaction log backups.

Diagram of Log Shipping.

Arguably the biggest advantage of using log shipping in some capacity is that it accounts for human error. The application of transaction logs can be delayed. Therefore, if someone issues something like an UPDATE without a WHERE clause, the standby may not have the change so you could switch to that while you repair the primary system. While log shipping is easy to configure, switching from the primary to a warm standby, known as a role change, is always manual. A role change is initiated via Transact-SQL, and like an AG, all objects not captured in the transaction log must be manually synchronized. Log shipping also needs to be configured per database, whereas a single AG can contain multiple databases.

Unlike an AG or FCI, log shipping has no abstraction for a role change, which applications must be able to handle. Techniques such as a DNS alias (CNAME) could be employed, but there are pros and cons, such as the time it takes for DNS to refresh after the switch.

Disaster recovery

When your primary availability location experiences a catastrophic event like an earthquake or flood, the business must be prepared to have its systems come online elsewhere. This section will cover how the SQL Server availability features can assist with business continuity.