Troubleshooting specific Login Failed error messages

A recent discussion on the #sqlhelp hashtag on Twitter got me researching into this issue. The error message was:

2010-10-19 02:56:59.380 Logon Error: 18456, Severity: 14, State: 11.

2010-10-19 02:56:59.380 Logon Login failed for user <Server name>\User1'. Reason: Token-based server access validation failed with an infrastructure error. Check for previous errors. [CLIENT: ]

Starting from SQL Server 2008, the login failed messages have the reasons for the login failure printed in the SQL Errorlog. This definitely saves us a lot of trouble of jumping through hoops to find out the corresponding meaning of a State number reported.

Additionally, there are Ring Buffers entries associated with Login Failures which can be very useful when troubleshooting such issues. This has been blogged about in an earlier blog post here:

http://blogs.msdn.com/b/psssql/archive/2008/03/24/how-it-works-sql-server-2005-sp2-security-ring-buffer-ring-buffer-security-error.aspx

So, why so much fuss about this error?
The above English text even though it is supposed to be helpful can be more of an evil in this case. To explain this simply, the error message is trying to tell you that the security information contained in the user’s token doesn’t have the necessary privileges to grant access to the SQL instance.

State 11 corresponds to “Valid login but server access failure” which again points to the same fact that the login is valid but is missing certain security privileges which would grant it access to the instance. The first option that half the SQL world would suggest to you would be to use “Run As Administrator” option and try the same operation from SSMS or SQLCMD.

If disabling UAC or if you don’t have UAC enabled, then here is what you should be trying:

1. Look into the SQL Errorlog and verify that that the login failed message for the user has a State 11. You can alternatively verify the state number from the SQL Server default traces as well (available from SQL Server 2005 and above).
2. Next look into the Ring Buffers output and find out what was the API that failed. Use the query at the end of the blog post to retrieve the information from the Ring Buffers.
3. Check if that login is directly mapped to one of the SQL Server logins by looking into the output of sys.server_principals.
4. If the login is directly mapped to the list of available logins in the SQL instance, then check if the SID of the login matches the SID of the Windows Login.

The output that I got from step#2 was:
Calling API Name: NLShimImpersonate
API Name: ImpersonateSecurityContext
Error Code: 0x139F (which translates to The group or resource is not in the correct state to perform the requested operation)

There weren’t any other API failures corresponding to the same time frame in the Ring Buffer output other than the one mentioned above. If this is the only error code that you have associated with a failed login, then it will not help you troubleshoot the login failure.

If you are not sure how the login that is encountering the Login Failed error is acquiring permissions to access the SQL instance, then you can use xp_logininfo to retrieve that information. The output of this Extended Stored Procedure will give you the permission path from where the login is inheriting it’s permissions to access the instance.

Eg: EXEC xp_logininfo '<domain or server name>\User1','all'

One of the common reasons this might happen is when an account SID changes due to some changes made at the Domain/Local Windows server level depending on where the account is created. The second reason that I found based on my research is that the when the account lacks the valid security privileges to the access the instance. The root cause analysis of the second cause is quite involved and outside the scope of this blog post.

I had reproduced the following error by doing the following:
1. Created a local windows group and added a user to it
2. Gave the windows group permission to access the SQL instance
3. Then dropped the windows group and created it again with the same name from Server Manager (on a Windows 2008 R2 box)

The other way that this issue can be reproduced is by dropping the windows user and adding it back again. The reason this issue occurs is because SQL Server maps logins using SIDs. When the SID changes at the Windows/Domain level, the SID stored in the SQL system catalog is not updated. When an attempt is made to use that login to access SQL, a SID mis-match occurs which results in the error.

Query to extract Security Ring Buffer information

-- Extract Ring Buffer Information for SQL Server 2008 instances and above

SELECT CONVERT (varchar(30), GETDATE(), 121) as runtime,

dateadd (ms, (a.[Record Time] - sys.ms_ticks), GETDATE()) as [Notification_Time],

a.* , sys.ms_ticks AS [Current Time]

FROM

(SELECT

x.value('(//Record/Error/ErrorCode)[1]', 'varchar(30)') AS [ErrorCode],

x.value('(//Record/Error/CallingAPIName)[1]', 'varchar(255)') AS [CallingAPIName],

x.value('(//Record/Error/APIName)[1]', 'varchar(255)') AS [APIName],

x.value('(//Record/Error/SPID)[1]', 'int') AS [SPID],

x.value('(//Record/@id)[1]', 'bigint') AS [Record Id],

x.value('(//Record/@type)[1]', 'varchar(30)') AS [Type],

x.value('(//Record/@time)[1]', 'bigint') AS [Record Time]

FROM (SELECT CAST (record as xml) FROM sys.dm_os_ring_buffers

WHERE ring_buffer_type = 'RING_BUFFER_SECURITY_ERROR') AS R(x)) a

CROSS JOIN sys.dm_os_sys_info sys

ORDER BY a.[Record Time] ASC

Regards,
Amit Banerjee
SEE,Microsoft SQL server support