Failover Detection Utility - Availability Group Failover Analysis Made Easy
To analyze the root cause for the failover of an Availability group, users are required to perform coordinated analysis of various logs including the Cluster Logs, SQL Error Logs, and the Availability groups extended events logs. This coordinated analysis of the logs can be difficult and requires extensive knowledge of the internals and the various logs associated with Availability Groups. The failover detection utility is meant to make this analysis easier and provide a quick root cause analysis for the unexpected failovers and/or failed failovers. The failover detection utility currently only supports analysis of Availability groups on Windows.
Failover Detection Utility
The failover detection utility can be downloaded here. The download consists of the following files.
- FailoverDetector.exe – The Failover analysis utility.
- Configuration.json – The configuration file for the failover analysis utility. The file consists of the following fields.
"Data Source Path": "F:\\DataLocation\\Data",
"Health Level": 3,
Data Source Path – Location for the Logs from the different replicas in the Availability Group.
Health Level – The failover policy level configured for the availability group. More information on the different failover conditions can be found here.
Instance – The list of replicas in the availability groups.
- Supporting DLL’s – Supporting DLLs for the Failover analysis utility.
- Data and Reports Folders – The Data and Analysis reports folders.
Preparing for Failover Analysis with the Failover Detection Utility
The first step in using the utility is to configure the configuration.json file to include the location of the data files and the details of the availability group for which analysis is being done. For a correct analysis, all replicas in the availability groups needs to be listed in the configuration file.
The next step is to capture the various logs from each of the replicas and add those under the data folder. The following files are required for the analysis.
- SQL error logs
- Always On Availability Groups Extended Event Logs
- System Health Extended Event Logs
- System log
- Windows cluster log
The first three categories of logs can be found in the LOG folder of the SQL instance installation path. For example, if you have a SQL Server 2017 instance named "SQL17RTM01" and If the installation path is on C drive, you will find the above three categories of logs under "C:\Program Files\Microsoft SQL Server\MSSQL14.SQL17RTM01\MSSQL\Log", and copy all AlwaysOn XEvent, System Health XEvent and SQL ERRORLOG files.
The last two categories of logs can be found. You can use commands mentioned below to collect these logs.
Get-ClusterLog -Node $server.ComputerNamePhysicalNetBIOS -Destination $server.ErrorLogPath
Get-Eventlog -ComputerName $server.ComputerNamePhysicalNetBIOS -LogName System | Export-CSV -Path $SystemLogExportPath
Organizing the files
The log files collected in the previous step should be placed in a folder with correct subfolder structure. This folder should be the folder path specified in the "Data Source Path" section of the configuration file "Configuration.json". The following is an example of the subfolder structure, assuming there are two replicas (instances) in the AG. One is "Replica1" and another one is "Replica2".
¦ ¦ AlwaysOn_health_0_131532583880140000.xel
¦ ¦ AlwaysOn_health_0_131532634780070000.xel
¦ ¦ AlwaysOn_health_0_131532680183890000.xel
¦ ¦ AlwaysOn_health_0_131532701164380000.xel
¦ ¦ ERRORLOG
¦ ¦ ERRORLOG.1
¦ ¦ ERRORLOG.2
¦ ¦ ERRORLOG.3
¦ ¦ SQLDUMPER_ERRORLOG.log
¦ ¦ system_health_0_131532583879830000.xel
¦ ¦ system_health_0_131532634779760000.xel
¦ ¦ system_health_0_131532680183430000.xel
¦ ¦ system_health_0_131532701164070000.xel
¦ ¦ Replica1_system.log
¦ ¦ Replica1_cluster.log
Executing the Utility
The FailoverDetection.exe can be executed with the following flags (or default mode without any flags)
Default Mode (No execution flags defined) - In the default mode, where no parameters are passed to the utility, the analysis tool will load the configuration file, copy log data from data source path to local workspace data folder, examine the files with the configuration settings, perform the analysis, and write the analysis result to a JSON report file. User should make sure the execution credential running tool has access to share folder defined in Configuration.json. When tool runs into access denial issues or the directory does not exist, the tool will exit with proper error message. After the tool copies the log files to the local tool directory, it scans the local data folder and compare with the AG configuration in Configuration.json. If it finds log data is not complete, the tool will show which files are missing. The tool will also alert the user that the failover root cause might not be properly identified due to missing files.
If analysis is successful, the utility persists the analysis result report as JSON file in Result directory. Analysis result will NOT be presented on the console, unless the "--Show" parameter is specified.
--Analyze - When "--Analyze" is specified as a parameter, the utility will load configuration file without copying log data. It assumes the log files have already been copied over. It does everything as default mode except copying log data. This option is useful if you already have the data in the local tool execution subdirectories and want to rerun the analysis.
- -Show -The utility after analyzing log data will display the results in the command console. Additionally, the results will be persisted to a JSON file in the results folder.
Currently supported failover root causes
The tool in current form can analyze and report failover reason if a failover is due to the following causes.
- Stopping SQL service: Admin or other process gracefully shuts down SQL Server hosting the AG primary. In this case, if automatic failover is enabled, Windows notifies WSFC to initiate a failover and picks one secondary as new primary.
- Shutting down Windows Server: The Windows Server which hosts the SQL Server that has the primary replica is shut down or restarted. In this case, if automatic failover is enabled, Windows notifies WSFC to initiate a failover and picks one secondary as new primary.
- Manual failover: An administrator performs a manual failover through Management Studio or through a direct Transact-SQL statement "ALTER AVAILABILITY GROUP <AG name> FAILOVER" or "ALTER AVAILABILITY GROUP <AG name> FAILOVER WITH DATA_LOSS" on the secondary replica that will become the primary replica.
SQL Service Internal Health Issue
- Too many memory dumps generated on the SQL Server hosting the primary replica: There are more than 100 SQL Server dumps on the primary replica since the last SQL Server restart, and there is at least one dump in the last 10 seconds. In this case, sp_server_diagnostics running in SQL Server determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
- Memory scribbler: SQL Server has a write access violation and the write address is more than 64KB. If there are more than three such memory corruptions since SQL Server was started, sp_server_diagnostics running in SQL Server determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
- Sick Spinlock: After an access violation, a spinlock is marked as sick if it backs off more than three times, which is the threshold. In this case, sp_server_diagnostics running in SQL Server determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
- Out of Memory: If no memory has been freed in 2 minutes, sp_server_diagnostics running in SQL Server determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
- Unresolved deadlock: If sp_server_diagnostics detects unresolved deadlock from query processing component, it determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
- Deadlocked scheduler: If sp_server_diagnostics detects deadlocked schedulers from query processing component, it determines that SQL Server system component is in error state and notifies failover WSFC to initiate a failover.
System Wide Issues
- Unexpected crash: SQL Server service was shut down unexpectedly. Resources host service (rhs.exe) does not detect lease check from SQL Server about availability group lease. This results an AG lease timeout signal to WSFC and WSFC will initiate a failover.
- Long dump: SQL Server is creating a dump file. During the process, threads handling the AG lease are frozen and which may result in a lease timeout. This results an AG lease timeout signal to WSFC and WSFC will initiate a failover if automatic failover is enabled.
- Quorum loss: AG resource is brought offline because quorum is lost. This could be because of a plethora of issues, which would need to be determined by a closure examination of the cluster logs.
- Network interface failure: Network interface used to communicate between cluster nodes fails. Primary and secondary replicas cannot communicate. WSFC will initiate quorum vote and determine a new primary to fail over to.
- Cluster disk failure: Network interface used to communicate between cluster nodes has failed, resulting in a communication loss between the cluster nodes. Cluster will initiate quorum vote and determine a new primary failover to.
- Cluster Service paused: Cluster service on primary was paused, resulting in primary unable to communicate with other nodes in the cluster. Cluster will initiate quorum vote and determine a new primary to fail over to.
- Node offline: When primary node is frozen or loses power, WSFC loses connection from and to the primary. Failover cluster decide to initiate failover and pick a primary from other possible nodes.
- High CPU utilization: System wide performance issue causes SQL Server service to be unable to respond to AG lease handler. For example, it is possible that some process on the node is consuming 100% CPU and hence the lease renewal threads cannot execute. This is a best effort estimation based on the inputs from the Cluster logs.
- High I/O: System wide performance issue causes SQL Server service to be unable to respond to AG lease handler. For example, it is possible that some process on the node is throttling the disks on the system and as such the lease renewal cannot be completed. This is again a best effort estimate based on the inputs from the cluster logs.
The Failover Detection utility can be used to analyze SQL Server 2012, 2014, 2016 and 2017 Availability groups, provided the servers are on the latest service packs and cumulative updates.
A few users have reported the following error while using the utility. We are looking into this and will post an update shortly.
“An unhandled exception of type ‘System.BadImageFormatException’ occurred in FailoverDetector.exe
Could not load file or assembly ‘Microsoft.SqlServer.XEvent.Linq, Version=126.96.36.199, Culture=neutral, PublicKeyToken=89845dcd8080cc91’ or one of its dependencies. An attempt was made to load a program with an incorrect format.”
New updated binaries for the Failover Detector Utility have been uploaded to GitHub. Before running the utility, please install the "Microsoft Visual C++ Redistributable for Visual Studio 2017" from here. On the download page, scroll down to the "Other Tools and Frameworks" section to download the redistributable (x64 version).
You may encounter the "Strong name validation failed" error while using the utility. To resolve the error use sn.exe (available in the zip file) to disable strong name validation. To disable strong name use the following command
sn.exe -Vr <<path to the DLLs>>