SQL 2012 Availability Group does not come up on one instance

I recently came across this interesting issue with SQL 2012 Always on Availability Groups, wherein after the network and IP were changed, the AG would not come up on one of the instances.

 

We checked the errorlogs on the server, found the following successful stacks for the failovers that had been attempted for the successful instance:

<

2013-01-30 13:03:21.09 spid1347 The state of the local availability replica in availability group 'SQL2012CLUS02' has changed from 'NOT_AVAILABLE' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:03:21.66 spid135 AlwaysOn: The local replica of availability group 'SQL2012CLUS02' is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

2013-01-30 13:03:21.69 spid135 The state of the local availability replica in availability group 'SQL2012CLUS02' has changed from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:03:21.70 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:03:21.72 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:03:21.82 Server The state of the local availability replica in availability group 'SQL2012CLUS02' has changed from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:03:21.82 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:04:15.72 spid50s A connection for availability group 'SQL2012CLUS02' from availability replica 'CTGDNAV' with id [E4A205DD-0098-481F-94F4-B15ABDC3BAD1] to 'CDGI-SQLPROD-02' with id [58F02E10-68CB-4EB2-B517-60306BCC0E72] has been successfully established. This is an informational message only. No user action is required.

>

And

<

2013-01-30 13:04:36.51 spid754 AlwaysOn: The local replica of availability group 'SQL2012CLUS02' is preparing to transition to the resolving role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

2013-01-30 13:04:36.51 spid754 The state of the local availability replica in availability group 'SQL2012CLUS02' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:04:36.52 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:04:38.62 spid305 AlwaysOn: The local replica of availability group 'SQL2012CLUS02' is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

2013-01-30 13:04:38.97 spid305 The state of the local availability replica in availability group 'SQL2012CLUS02' has changed from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:04:38.97 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:04:39.13 Server The state of the local availability replica in availability group 'SQL2012CLUS02' has changed from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availability group dashboard, SQL Server error log, Windows Server Failover Cluster management console or Windows Server Failover Cluster log.

2013-01-30 13:04:39.14 Server The Service Broker endpoint is in disabled or stopped state.

2013-01-30 13:04:46.01 spid30s A connection for availability group 'SQL2012CLUS02' from availability replica 'CTGDNAV' with id [E4A205DD-0098-481F-94F4-B15ABDC3BAD1] to 'CDGI-SQLPROD-02' with id [58F02E10-68CB-4EB2-B517-60306BCC0E72] has been successfully established. This is an informational message only. No user action is required.

>

We then checked the critical events for the AG in failover cluster manager, this was all i could find:

Cluster resource 'SQL2012CLUS02' in clustered service or application 'SQL2012CLUS02' failed.

 

I then collected the errorlog for the CDGI-SQLPROD-02 instance, and found this:

2013-01-30 13:04:14.890 spid1535 The state of the local availability replica in availability group 'SQL2012CLUS02' has changed from 'NOT_AVAILABLE' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more

2013-01-30 13:04:15.580 spid1536s The state of the local availability replica in availability group 'SQL2012CLUS02' has changed from 'RESOLVING_NORMAL' to 'SECONDARY_NORMAL'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For m

2013-01-30 13:04:15.890 spid43s A connection for availability group 'SQL2012CLUS02' from availability replica 'CDGI-SQLPROD-02' with id [58F02E10-68CB-4EB2-B517-60306BCC0E72] to 'CTGDNAV' with id [E4A205DD-0098-481F-94F4-B15ABDC3BAD1] has been successfully established. This is an infor

2013-01-30 13:04:36.370 spid1540 The state of the local availability replica in availability group 'SQL2012CLUS02' has changed from 'SECONDARY_NORMAL' to 'RESOLVING_PENDING_FAILOVER'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster er

2013-01-30 13:04:38.080 Logon Error: 18456, Severity: 14, State: 5.

2013-01-30 13:04:38.080 Logon Login failed for user 'CLT-SQLPROD-02$'. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]

 

We can clearly see that the Login failure seems to be responsible for the failed failover of the AG. I tried adding the login manually, and restarted the instance, but the failover still failed. I then checked the event logs on CDGI-SQLPROD-02, found just this:

Log Name: Application

Source: MSSQLSERVER

Date: 1/30/2013 4:54:43 PM

Event ID: 35206

Task Category: Server

Level: Information

Keywords: Classic

User: N/A

Computer: CDGI-SQLPROD-02.CLT.com

Description:

A connection timeout has occurred on a previously established connection to availability replica 'CTGDNAV' with id [E4A205DD-0098-481F-94F4-B15ABDC3BAD1]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.

And then this:

Log Name: Application

Source: MSSQLSERVER

Date: 1/30/2013 4:54:33 PM

Event ID: 18456

Task Category: Logon

Level: Information

Keywords: Classic,Audit Failure

User: SYSTEM

Computer: CDGI-SQLPROD-02.CLT.com

Description:

Login failed for user 'CLT-SQLPROD-02$'. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]

The interesting thing here is that the connection attempt seems to be coming from the OS on the same box. I then captured a Profiler trace to confirm, and saw this:

 

Looked at the profiler trace, found the login error event:

ErrorLog 2013-01-31 11:21:30.55 Logon Error: 18456, Severity: 14, State: 5.

2013-01-31 11:21:30.55 Logon Login failed for user 'CLT-SQLPROD-02$'. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>]

Microsoft® Windows® Operating System CDGI-SQLPROD-02$ 3828 86 2013-01-31 11:21:30.550 1 master 18456 26102 CDGI-SQLPROD-02 CLT0 CDGI-SQLPROD-02 CLTCDGI-SQLPROD-02$ 5 14

EventLog Login failed for user 'CLTCDGI-SQLPROD-02$'. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>] Microsoft® Windows® Operating System CDGI-SQLPROD-02$ 3828 86 2013-01-31 11:21:30.550 0X184800000E0000001000000043004400470049002D00530051004C00500052004F0044002D00300032000000070000006D00610073007400650072000000 1 master 18456 26103 CDGI-SQLPROD-02 CLT0 CDGI-SQLPROD-02 CLTCDGI-SQLPROD-02$ 5 14

Audit Login Failed Login failed for user 'CLTCDGI-SQLPROD-02$'. Reason: Could not find a login matching the name provided. [CLIENT: <local machine>] Microsoft® Windows® Operating System CDGI-SQLPROD-02$ CLTCDGI-SQLPROD-02$ 3828 86 2013-01-31 11:21:30.550 1 master 18456 26104 1 - Nonpooled CDGI-SQLPROD-02 CLT 0 CDGI-SQLPROD-02 CLTCDGI-SQLPROD-02$ 5 0 1 - Non-DAC

User Error Message Login failed for user 'CLTCDGI-SQLPROD-02$'. Microsoft® Windows® Operating System CDGI-SQLPROD-02$ 3828 86 2013-01-31 11:21:30.550 1 master 18456 26105 CDGI-SQLPROD-02 CLT 0 CDGI-SQLPROD-02 CLTCDGI-SQLPROD-02$ 1 1 0 14

The profiler trace confirms our hunch. We then proceeded to run the following commands to add local system as a sysadmin on the problem instance:

 

USE [master]

GO

 

/****** Object:  Login [NT AUTHORITYSYSTEM]    Script Date: 01-02-2013 03:31:56 ******/

CREATE LOGIN [NT AUTHORITYSYSTEM] FROM WINDOWS WITH DEFAULT_DATABASE=[master],

DEFAULT_LANGUAGE=[us_english]

GO

 

GO

ALTER SERVER ROLE [sysadmin] ADD MEMBER [NT AUTHORITYsystem]

GO

 

After this, the failover worked perfectly fine.

 

Hope this helps.