VSS backups might cause SQL to generate Non-Yielding Scheduler dumps if Backup verification is turned on

Found an interesting Non-Yielding scheduler recently. Opened the dump, and found function calls related to backup verification (such as validating the file name, verifying that the drive is part of the cluster group, etc.) at the top of the stack:

Child-SP          RetAddr           Call Site
00000000`29cda478 000007fe`fe21a776 ntdll!ZwAlpcSendWaitReceivePort
00000000`29cda480 000007fe`fe2bcc74 rpcrt4!LRPC_CCALL::SendReceive
00000000`29cda540 000007fe`fe2bcf25 rpcrt4!NdrpClientCall3
00000000`29cda800 000007fe`f8902196 rpcrt4!NdrClientCall3
00000000`29cdab90 000007fe`f89023bb clusapi!ConnectCluster
00000000`29cdac10 00000000`01e83b60 clusapi!OpenClusterImpl
00000000`29cdac80 00000000`01e903f3 sqlservr!FClusMgr::VerifyDriveInClusterGroup
00000000`29cdbd00 00000000`024d258f sqlservr!FileMgr::ValidateFileName
00000000`29cdc5e0 00000000`02dbf13f sqlservr!BackupFileList::GenerateVolumeUsageList
00000000`29cdc610 00000000`02dc03e0 sqlservr!BackupOperation::VerifyBackupSet
00000000`29cddb20 00000000`02dc3d2d sqlservr!BackupEntry::VerifyBackupSet
00000000`29cddcf0 00000000`022224d3 sqlservr!CStmtLoadVol::XretExecute
00000000`29cdde30 00000000`02225f9c sqlservr!CExecStmtLoopVars::ExecuteXStmtAndSetXretReturn
00000000`29cdde60 00000000`015bdad0 sqlservr!CMsqlExecContext::ExecuteStmts<1,0>
00000000`29cdee10 00000000`00724479 sqlservr!CMsqlExecContext::FExecute

I confirmed that SQL was indeed installed on a cluster. I also found these messages in the event logs, at around the same time as the Non-Yielding dump:-

Information 3/15/2012 1:03:59 AM SQLISPackage100 12289 None "Package ""Backup_TLog_EnterpriseSecurity"" finished successfully."
Information 3/15/2012 1:03:58 AM SQLISPackage100 12289 None "Package ""Backup_TLog_Caching"" finished successfully."
Information 3/15/2012 1:03:56 AM MSSQL$REPL 1440 Server Database mirroring is active with database 'AmsAuditing' as the principal copy. This is an informational message only. No user action is required.
Information 3/15/2012 1:03:55 AM MSSQL$REPL 1442 Server Database mirroring is inactive for database 'AmsAuditing'. This is an informational message only. No user action is required.

Warning 3/15/2012 1:03:55 AM VSS 4003 None Volume Shadow Copy Service warning: Writer received a Freeze event more than two minutes ago. The writer is still waiting for either an Abort or a Thaw event.
Operation:
   Gathering Writer Data
Context:
   Writer Class Id: {41e12264-35d8-479b-8e5c-9b23d1dad37e}
   Writer Name: Cluster Database
   Writer Instance ID: {db6da9b0-343e-43e2-ab1d-78501c1c1d32}

Warning 3/15/2012 1:03:55 AM VSS 4003 None Volume Shadow Copy Service warning: Writer received a Freeze event more than two minutes ago. The writer is still waiting for either an Abort or a Thaw event.
Operation:
   Gathering Writer Data
Context:
   Writer Class Id: {4dc3bdd4-ab48-4d07-adb0-3bee2926fd7f}
   Writer Name: Shadow Copy Optimization Writer
   Writer Instance ID: {58c69918-2ebe-4bf2-8035-4cdc660ff0c2}

Information 3/15/2012 1:03:23 AM MSSQL$REPL 17883 Server Process 0:0:0 (0x1754) Worker 0x00000016BF0BA1A0 appears to be non-yielding on Scheduler 29. Thread creation time: 12974041676579. Approx Thread CPU Used: kernel 0 ms, user 0 ms. Process Utilization 8%%. System Idle 91%%. Interval: 70095 ms.

Information 3/15/2012 1:03:01 AM MSSQL$REPL 18265 Backup Log was backed up. Database: EnterpriseSecurity, creation date(time): 2008/03/15(12:30:30), first LSN: 11537:1227:1, last LSN: 11537:1229:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {'N:EnterpriseSecurity_backup_2012_03_15_010301_4748794.trn'}). This is an informational message only. No user action is required.
Information 3/15/2012 1:03:01 AM SQLISPackage100 12288 None "Package ""Backup_TLog_EnterpriseSecurity"" started."
Information 3/15/2012 1:02:04 AM MSSQL$REPL 18265 Backup Log was backed up. Database: Caching, creation date(time): 2008/03/15(12:30:10), first LSN: 31246:4891:1, last LSN: 31252:2022:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {'N:Caching_backup_2012_03_15_010201_5628888.trn'}). This is an informational message only. No user action is required.
Information 3/15/2012 1:02:00 AM SQLISPackage100 12288 None "Package ""Backup_TLog_Caching"" started."

So if you read the message stack from the bottom up, you see the “Backup_Tlog_Caching” and “Backup_Tlog_EnterpriseSecurity” jobs starting, then the Backup log message, followed by the VSS warning, and finally we see both the “Backup_Tlog..” packages complete successfully.

After discussing with my (did i say genius?) TL, we found that this is what is happening:-

  • The customer is running SQL Backups, with the verify backup integrity option enabled.
  • The customer is also running VSS (volume) backups.

How the Non-Yielding situation arises

  • SQL Server runs the backup job, and the backup is successful
  • After that, it needs to verify the backup, which involves several steps such as checking if the disk is part of the cluster group, verify backupset, validate filename, etc. For these operations, SQL needs access to the disk.
  • However, the VSS backup is running at the same time on the target drive, and as a result, I/O on the drive is frozen. So the SQL request is blocked i.e. made to wait
  • While waiting, the SQL request hits the time threshold for a non-yielding scheduler dump i.e. it’s made to wait for long enough to trigger the non-yielding scheduler condition. This is why we see the non-yielding dump.

So the obvious workaround/solution to this situation would be to either change the schedule of the VSS backups, or (much more simple) remove the verify backup integrity option from the SQL backups (which would eliminate the need to call those cluster API’s, and hence not cause the SQL server thread to be blocked).

An interesting one, what say?