DTS/SSIS and debugging mysterious relational engine deadlocks

Over the last few days I've learnt a lot in how to debug SQL hangs thanks to Sunil from the SQL Server engine team.

For one of the customers last week, the problem seemed to be that their DTS2000 package worked great with SQL2000. However, upgrading the relational engine to SQL2005 and making the DTS2000 packages go against that version caused hangs during some of the dimension loads. The dimension load, to be clear, was doing an outer join between the source and destination table, and pushing to the destination table.

Turns out the problem was the table lock setting that's turned on by default in DTS2000. In SQL2000, if you were doing a bulk insert to a table with a clustered index and had asked for a table lock, the engine would silently ignore. Now (in SQL2005), we don't refuse, so the SELECT and X update caused the package to hang. The fix was to remove table lock.

Earlier today we got to the bottom of another hang with SSIS going against a SQL2005 database. This time, it turns out, one of the destinations that was getting tablock on a table with a clustered index and inserting about 10K rows. The locksĀ escalated to a table lock which prevented other writers. Not having used SQL Server trace flags myself as yet, it was great to see the problem go away by turning on trace flags 1211 and 1224, narrowing the problem down to lock escalation. There were several fixes that Sunil mentioned, one was to remove the clustered index, another was to limit the batch size to be around 4500, under the ~5000 that triggers the lock escalation.

The DMVs in SQL 2005 were invaluable in this investigation, specifically sys.dm_os_waiting_tasks.

While all of this is in BOL, and probably trivial to most folks familiar with SQL, the logic and debugging abilities of locking in SQL2005 are truly impressive to me. Of course, it doesn't hurt at all if someone knowledgeable is making sense of all this!

If you're running into such unexplicable hangs, undestand the output of sys.dm_os_waiting_tasks (and sp_who2, of course); check if you have table locks and clustered indices (remember that turning on the primary key in the table designer might create a clustered index on that column for you); and if you have readers or multiple writers against the same table. If you run into an interesting case, drop me a line!