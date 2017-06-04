AlwaysOn Availability Groups Troubleshooting and Monitoring Guide

04/27/2017

THIS TOPIC APPLIES TO: SQL Server (starting with 2008) Azure SQL Database Azure SQL Data Warehouse Parallel Data Warehouse

This guide helps you get started on troubleshooting some of the common issues in AlwaysOn Availability Groups and monitoring AlwaysOn Availability Groups. It is intended to provide original content as well as a landing page of useful information that is already published elsewhere.

While this guide cannot fully discuss all the issues that can occur on the large surface area covered by AlwaysOn Availability Groups, it can point you in the right direction in your root-cause analysis and resolution of the issues. As AlwaysOn Availability Groups is an integrated technology, many of the problems you encounter are only symptoms of other issues in your database system. Some issues are caused by settings within an availability group, such as an availability database being suspended. Other issues can include problems you can isolate to other aspects of SQL Server, such as SQL Server settings, database file deployments, and systemic performance issues unrelated to the availability group, replica, or database. Still other problems and exist outside of SQL Server, such as network I/O, TCP/IP, Active Directory, and Windows Server Failover Clustering (WSFC). Often, problems that surface in an availability group, replica, or database require you to troubleshoot multiple technologies before you can identify the root cause.

Troubleshooting Scenarios

The table below contains links to the common troubleshooting scenarios for AlwaysOn Availability Groups. They are categorized by their scenario types, such as configuration, client connectivity, failover, and performance.

When configuring or running AlwaysOn Availability Groups, the different tools can help you diagnose different types of issues. The table below provides links to useful information on the tools.

Monitoring AlwaysOn Availability Groups

The ideal time to troubleshoot an availability group is before a problem necessitates a failover, whether automatic or manual. This can be done by monitoring the availability group’s performance metrics and sending alerts when the availability replicas are performing outside the bounds of your service-level agreement (SLA). For example, if a synchronous secondary replica has performance issues that cause the estimated failover time to increase, you do not want to wait until an automatic failover occurs and you find out that the failover time exceeds your recovery time objective.

As AlwaysOn Availability Groups is a high availability and disaster recovery solution, the most important performance metrics to monitor are the estimated failover time, which affects your recovery time objective (RTO), and the potential data loss in a disaster, which affects your recovery point objective (RPO). You can gather these metrics from the data that SQL Server exposes at any given time, so you can be alerted of a problem in the HADR capabilities of your system before the actual failure events occur. Therefore, it is important to familiarize yourself with the data synchronization process of AlwaysOn Availability Groups and gather the metrics accordingly.

This table below points you to topics that can help you monitor the health of your AlwaysOn Availability Groups solution.

