You may see an instant timeout when connecting to a SQL Server Mirrored database from a .Net application

I can't take the credit for all this content as many of my colleagues were part of the investigation on this issue. But I feel its worthwhile to share the details as many customers are facing this problem.

Consider you have a .Net application which is connecting to a SQL Server 2005 or 2008 database which is mirrored to another instance of SQL Server using the .NET SQLClient Provider. Intermittently you may see connectivity errors from your application with the following stack trace:

Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection)
at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning()
at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
at System.Data.SqlClient.TdsParserStateObject.ReadSni(DbAsyncResult asyncResult, TdsParserStateObject stateObj)
at System.Data.SqlClient.TdsParserStateObject.ReadNetworkPacket()
at System.Data.SqlClient.TdsParser.ConsumePreLoginHandshake(Boolean encrypt, Boolean trustServerCert, Boolean& marsCapable)
at System.Data.SqlClient.TdsParser.Connect(ServerInfo serverInfo, SqlInternalConnectionTds connHandler, Boolean ignoreSniOpenTimeout, Int64 timerExpire, Boolean encrypt, Boolean trustServerCert, Boolean integratedSecurity)
at System.Data.SqlClient.SqlInternalConnectionTds.AttemptOneLogin(ServerInfo serverInfo, String newPassword, Boolean ignoreSniOpenTimeout, TimeoutTimer timeout, SqlConnection owningObject)
at System.Data.SqlClient.SqlInternalConnectionTds.LoginWithFailover(Boolean useFailoverHost, ServerInfo primaryServerInfo, String failoverHost, String newPassword, Boolean redirectedUserInstance, SqlConnection owningObject, SqlConnectionString connectionOptions, TimeoutTimer timeout)

What is more surprising is that the timeout happens instantaneously, like within a second. Ideally you would expect to get a timeout after 15 seconds (the default connection timeout), buts that’s not the case.

If we take a Netmon trace, here is what we see:


In this example we can see the app server closes the connection close to 1 sec during the TDS: Prelogin. When the SQL Server responds back in 5 sec its too late and client will reset the connection.

As per the Connection Retry Algorithm for Mirroring, SqlClient code takes the connection timeout and multiplies it by 0.08 to determine the milliseconds to wait for first read (SniReadSync call). Now if this first SniReadSync fails due to a timeout (due to a slow response from the server), the connection is incorrectly set to a doomed state by the sqlClient .NET Provider and this causes the connection attempt to prematurely fail before the entire connection timeout expires.

Similarly if you look into the Connectivity Ring Buffer in SQL Server, you will notice that the entire login process got timed out in a short timeframe of milliseconds.

You can run the following query to get the ring buffer output from Management Studio:

 SELECT CAST(record AS XML) FROM sys.dm_os_ring_buffers 

You can find more about Connectivity Ring Buffer from here.

But all of the above were symptoms of this issue. So how can you be sure if you are actually hitting this problem? The answer is BID tracing. I would not go into details of what is BID tracing and how to collect it. You can read about it here. All you need to trace is System.Data.SNI.1 and System.Data.1 from the application server. Here is a sample of what you will see in the trace:

 <prov.DbConnectionHelper.ConnectionString_Set|API> 1#, 'Data 
Source=MySQLServer;Failover Partner=MySQLServer2; Initial Catalog= 
TestMirror;Integrated Security=SSPI;Connection Timeout=15'
 enter_02 <SNIReadSync|API|SNI> 2#{SNI_Conn}, pConn: 
007D9EA8{SNI_Conn*}, ppNewPacket: 0671E268{SNI_Packet**}, timeout: 1134 

enter_03 <Np::ReadSync|API|SNI> 3#, ppNewPacket: 
0671E268{SNI_Packet**}, iTimeOut: 1134 

<SNI_Packet::SNIPacketAllocateEx2|API|SNI> pConn: 007D9EA8{SNI_Conn*}, 
IOType: 0, consumer: 0 
<SNI_Packet::SNIPacketNew|API|SNI> pConn: 007D9EA8{SNI_Conn*}, IOType: 0 
ObtainIDa 5# <SNI_Packet::SNI_Packet|ID|SNI> 0624EEF0{.} 

<SNI_Packet::SNI_Packet|SNI> 5#{SNI_Packet} created by 2#{SNI_Conn} 

<SNI_Packet::SNIPacketNew|RET|SNI> 0624EEF0{SNI_Packet*} 

<SNI_Packet::SNIPacketAllocateEx2|RET|SNI> 0624EEF0{SNI_Packet*} 

<Np::ReadSync|ERR|SNI> ProviderNum: 4{ProviderNum}, SNIError: 
11{SNIError}, NativeError: 258{WINERR} 
<Np::ReadSync|RET|SNI> 258{WINERR} 
<SNIReadSync|RET|SNI> 258{WINERR}, Packet: 00000000 
<sc.SqlError.SqlError|ERR> infoNumber=-2, 
errorState=0, errorClass=11, errorMessage='Timeout expired. The timeout period 
elapsed prior to completion of the operation or the server is not responding.', 
procedure='', lineNumber=0 

<sc.SqlInternalConnectionTds.BreakConnection|RES|CPOOL> 4#, Breaking connection. 

<prov.DbConnectionInternal.DoomThisConnection|RES|INFO|CPOOL> 4#, Dooming 
<sc.SqlInternalConnectionTds.LoginFailure|RES|CPOOL> 4#

When Connection Timeout=15 sec (default value), actual Connection Timeout calculated= .08*15 ~ 1.2 sec

From the trace you can see the SNIReadSync is setting the timeout to 1134 milliseconds. This is close to 1.2 seconds and when the server does not respond back within that time, it incorrectly dooms the connection. Truly speaking you do not actually need a mirrored database for this. All you need to mention is "Failover Partner" in your connection string. That is enough to take this connection retry algorithm code path.

While this problem should be fixed in Visual Studio 2011, there a few workarounds that can easily avoid this.

1. You can increase the connection string timeout to 150 sec. This will give the first attempt of SniReadSync enough time to connect (150* .08=12 sec)

2. Add "Min Pool Size=20" in the connection string. This will always maintain a minimum of 20 connections in the pool and there will be less need of creating new connection, thus reducing the chance of this error.

3. Improve the network performance. Most of these issues can be a result of a slow network. Update your NIC drivers to the latest firmware version. We have seen network latency when your NIC card is not compatible with certain Scalable Networking Pack settings, most commonly the TCP Chimney offload and the Receive Side Scaling(RSS) . If you are on Windows Vista SP1 or above you may also consider disabling Receive Window Auto-Tuning. If you have NIC teaming enabled, disabling it would be a good option.