CHECKDB (Part 1): How does CHECKDB get a consistent view of the database?
As you can guess from the title, I'm planning a long series over the summer to go into the guts of how CHECKDB works (both the consistency checks part and the repair part). And as you can guess from 'CHECKDB', I'm already bored with putting DBCC in front of it all the time and changing the font to Courier New to make it stand out. I don't do that in real life so why on the blog?
I have no idea how long it'll be - easily more than 20, probably less than 50, but there's a ton of info in my head that's desperate to get out (ever seen the scene in the movie Scanners where that guy's head explodes? Well, its not quite that bad but remembering that scene was fun). As part of this I'll need to go into some of the low-level structural details of the database, which will hopefully be interesting too. (How do I know all this stuff? Read my bio). If there's something I need to explain, post a comment with a question in and I'll do a post to answer it. I may even do a post on how to use DBCC PAGE...
So here's the problem statement: CHECKDB needs a consistent view of the database.
Why? Well, usually its running on a live database with all kinds of stuff going on. It needs to read and analyze the whole database but it can't do it instantaneously (this isn't Star Trek) so it has to take steps to ensure that what it reads is transactionally consistent.
Here's an example. Consider a transaction to insert a record into a table that is a heap and has a non-clustered index, with a concurrent CHECKDB that doesn't enforce a consistent view.
- The table record is inserted first, and then the non-clustered index record is inserted (that's just the way the operations are split up in the database engine)
- Because this hypothetical CHECKDB doesn't have a consistent view, it could read see the record in the table but not that in the index, conclude that the non-clustered index is out of sync with the table and flag an 8951 error (missing index row).
- How could this happen? Depending on the order in which the pages are read, the page on which the new non-clustered index record should go could be read before the page on which the new heap record should go. (I use record and row somewhat interchangeably to mean the physically-stored contents of a table or index row). If the index page read happens just before the record is inserted into the table page, and then the table page is read, then we see the inconsistent state.
The easy way to get the consistent state is through locking, which is what SQL Server 7.0 did. You can still do that in SQL Server 2000 and 2005 using the TABLOCK option. Another way to do it is to put the database into single-user or read-only mode.
However, excessive locking is a drag and taking the database essentially offline tends to irritate users so with SQL Server 2000 we came up with a neat way to get the consistent view and be able to run CHECKDB online - log analysis. In a nutshell, after we've read through all the database, we read the transaction log to make sure we didn't miss anything. Sounds simple, right? Dream on. Here's how it works:
- The log is read from the LSN of the 'begin tran' log record of the oldest transaction that is active at the time the database scan started, to the LSN at the time the database scan stops.
- Log records from transactions that commit during that time are used to generate REDO facts. (We have a chicken-and-egg situation here - it's difficult to explain how CHECKDB works without referencing some mechanisms that haven't been explained yet - I'll get to what facts are in part <single-digits-I-promise>.) The REDO facts either reinforce something we've already seen (in which case we ignore them) or provide information on something we haven't seen. For example:
- a page allocation log record would produce a REDO fact of 'page X is allocated to IAM chain Y' (yes, I'm throwing around unexplained terms again - unavoidable I'm afraid and I'll explain them later - or read Kalen's books)
- a row insertion record (such as from the index example above) would produce a REDO fact of 'a row with these index keys was inserted into page A of table B, index C at slot position S'
- Log records from transactions that rollback or don't commit during that time are used to generate UNDO facts. The UNDO facts either cancel something that we've already seen (e.g. the first half of the index example above, if it didn't commit while CHECKDB was doing the database scan) or reference something we haven't seen (in which case we ignore them). For example:
- page allocation log record would produce an UNDO fact of 'page X was deallocated from IAM chain Y'
- a row insert record would produce an UNDO fact of 'a row with these index keys was removed from page A of table B, index C at slot position S'
- As you may have realized, what we're essentially doing is our own log recovery, inside CHECKDB, but without actually affecting the database.
- This can get excruciatingly complicated (e.g. having to generate UNDO facts from the compensation log records that wrap sections of a cancelled index rebuild transaction...) I spent too many days of 2000 working out what was going on in the log and making tweaks to this code. However, it worked really well and we had online CHECKDB finally. The kudos for writing most of this stuff goes to Steve Lindell - while he was busy writing the online code I was up to my eyes writing DBCC INDEXDEFRAG (another post).
Back in late 2000, it became apparent that with all the new features we were planning for 'Yukon' (we had no idea it would be called SQL Server 2005 back then), including some changes to the transaction log to allow for fast recovery and deferred transactions and stuff like versioning and online index build, the transaction log analysis was a non-starter. While it had given us the holy-grail of online consistency checks, with all the added complications of Yukon it would become impossible to maintain and get right.
But what to use instead? Who would come to my rescue? Turns out that database snapshots would be my savior. (That is their eventual name. I preferred their first name COW databases - Copy-On-Write databases - and my TechEd slide deck has a nice animated cow in homage). The in-depth details of database snapshots are beyond the scope of this post. To put it simply:
- they use NTFS sparse-file technology
- database recovery is run when the snapshot is created, but the recovered database is stored in the snapshot, not the source database
- they only hold pages from the source database that have been changed since the database snapshot was created (either by the recovery process, or as part of normal operations on the source database)
- Books Online has a bunch more info about their use by DBCC - look in the 'DBCC Statements' section.
By moving to database snapshots we changed to using mainline server code to get our transactionally consistent view. This vastly reduced the complexity of the code and meant that someone else was responsible for finding and fixing its bugs during development :-)
So when CHECKDB starts, the first thing we do is work out whether we'd like to run online - if so we create a hidden database snapshot of the source database (i.e. CHECKDB's target database). That could cause you a problem - depending on your transaction load concurrent with the CHECKDB, the database snapshot can grow in size. As the we create a hidden one, you have no control over where we places the files - we just place them as alternate streams of the files comprising the source database. If you don't have room for this, just create your own database snapshot and check that.
Once the database snapshot is created, we're guaranteed a transactionally consistent view of the database and can merrily run our check algorithms against the database snapshot. Ah, you might say, but that means CHECKDB is checking the database as it was at some point in the past! Yes, I'd say, but that point is the start time of the CHECKDB, just as it was (essentially) with the log analysis mechanism in SQL Server 2000.
There are a few slight gotchas (all documented) with this approach:
- sparse files are only available with NTFS so online checks can't be run on databases stored on FAT or FAT32 volumes
- recovery cannot be run on TEMPDB, so online checks can't be run on TEMPDB (CHECKDB automatically switches to locking in that case)
And that's that. Now its time for breakfast - no oatmeal hopefully...