lSQL Q & AUnexpected Consistency Checks, Troubleshooting Memory Usage, More
Paul S. Randal
QI've noticed something unusual happening since I've moved some of our databases to SQL Server 2005. Whenever the server starts up, I see messages in the error log indicating that SQL Server is running consistency checks on the database as part of the startup process. It does this for some databases, but not for the system databases. The checks seem to run very quickly no matter how large the database is, and happen every time SQL Server starts. Can you explain what's going on?
AQuite a few people have been asking about this in various online forums. Here's an example of one of the error log messages in question:
2009-06-16 18:12:16.85 spid5s CHECKDB for database 'master' finished without errors on 2009-05-12 16:07:15.647 (local time). This is an informational message only; no user action is required.
It certainly does look like DBCC CHECKDB (which performs database-wide consistency checks) was run on the "master" database, but in reality it was not. This message is simply reporting a statistic about the database called the "last known good" time.
From SQL Server 2005 onward, whenever DBCC CHECKDB finishes consistency-checking a database without finding any consistency errors (that is, the database is corruption-free), the time that DBCC CHECKDB completed is noted in the boot page of the database (a single page in the database that stores critical metadata about it).
Every time a database is started up (on instance startup or database attach), the boot page is checked to see if there is a stored "last known good" time, and if so, it is reported in the error log. Unfortunately, there is no documented way to query this value, but you can use a number of undocumented commands to find it. Sankar Reddy, a fellow MVP, recently blogged a script that will report it for you.
The rationale behind storing the "last known good" time is that it can be very useful in a disaster recovery situation to know how long (potentially) a database might have been corrupt.
QI'd like to be able to determine how much of SQL Server's memory is being used by each database. I noticed a SQL Server 2005 instance suddenly using nearly all the available memory on the server and I'm concerned that there's a problem somewhere. Is it possible to figure out where this memory is being used from within SQL Server?
AThe good news is that this is not likely to be a problem. SQL Server 2005 will use as much memory as it's able to when necessary, but will respond to memory pressure requests from the OS to free up memory. The sudden memory usage you're seeing is probably the buffer pool expanding to allow more of the database to be held in memory.
The buffer pool (sometimes called the buffer cache) is part of the Storage Engine layer in SQL Server and is responsible for managing in-memory copies of portions of the data files in the various databases in the SQL Server instance. If a query starts that requires reading a lot of database pages into memory (for instance for a large table scan or a join), then the buffer pool may grab more server memory from the OS so it can expand. This allows it to accommodate the extra page images in memory without necessarily having to discard in-memory copies of other database pages that are being used for other queries.
As an aside, one of the benefits of adding more memory for SQL Server to use is that the buffer pool can be larger. This means that more data can be in memory at any particular time, potentially leading to reduced I/Os and better workload throughput.
There are other possible causes of SQL Server using a lot of memory, for instance if a very large number of different query plans are being cached in another area of memory called the plan cache, but my guess is that it's most likely to be the buffer pool, as I've described earlier.
In SQL Server 2005, you can determine what proportion of the buffer pool is being used for each database using the dynamic-management view sys.dm_os_buffer_descriptors. This simple query will tell you how many 8KB pages are in the buffer pool for each database:
SELECT (CASE WHEN ([is_modified] = 1) THEN 'Dirty' ELSE 'Clean' END) AS 'Page State', (CASE WHEN ([database_id] = 32767) THEN 'Resource Database' ELSE DB_NAME (database_id) END) AS 'Database Name', COUNT (*) AS 'Page Count'FROM sys.dm_os_buffer_descriptors GROUP BY [database_id], [is_modified] ORDER BY [database_id], [is_modified];GO
I explain this a little more in the blog post "Inside the Storage Engine: What's in the Buffer Pool?."
For the other sections of memory that SQL Server uses, you can use the DBCC MEMORYSTATUS command to monitor how much memory the SQL Server instance as a whole is using, but it does not allow the memory usage to be broken down per database. Take a look at Knowledge Base article 907877, which describes "How to use the DBCC MEMORYSTATUS command to monitor memory usage on SQL Server 2005."
QEvery so often, one of the databases on our SQL Server 2005 instance will become "suspect." We can't access the database, and its status is SUSPECT. Occasionally the status shows RECOVERY_PENDING. I know this is caused by corruption of some sort, but can you explain what it really means and how to recover? We usually end up having to restore from old backups and lose data, which is not ideal.
AThere's a lot of confusion about what these two database states mean, but you're correct that they're caused by some form of corruption. They both indicate that something has gone wrong with crash recovery.
If a database is not cleanly shut down (in other words, if there are uncommitted transactions when the database shuts down) then, when the database is started up again, it must go through crash recovery. Crash recovery is the process of ensuring that all transactions that were committed at the time of the database shut-down are properly reflected in the database, and all transactions that were uncommitted at the time of the shut-down are not reflected in any way in the database.
For a much deeper explanation of how recovery works, see my article on "Understanding Logging and Recovery in SQL Server" from the February 2009 issue.
A database knows whether it was cleanly shut down or not—this information is stored in the database boot page, which I described in the answer to the first question. If crash recovery is required, then the transaction log must be accessible, as it stores all the details of which (committed) transactions need to be replayed, and which (uncommitted) transactions must be rolled back. If the transaction log is not available (because it has been deleted, for instance), crash recovery cannot start and the database state becomes RECOVERY_PENDING. The RECOVERY_PENDING status means recovery could not start.
If the transaction log is available, crash recovery starts running. If it can't complete for any reason, the database is transactionally inconsistent and the status becomes SUSPECT. The SUSPECT status means recovery started but could not complete.
There are two reasons recovery can't complete. The first is corruption in the transaction log itself, resulting in a transaction log record that cannot be processed by SQL Server. The second is corruption in the data files, encountered while the recovery system is trying to apply a transaction log record to a database page, or to reverse the effects of a transaction log record on a database page.
One other issue can push a database into the SUSPECT state. If a transaction is cancelled by a user or application and the database encounters corruption while rolling back the effects of the transaction, then the roll-back cannot complete and the database is transactionally inconsistent. In this case, the database is automatically taken offline and the state set to SUSPECT.
There are two common ways to recover in this situation. The first is to restore from up-to-date backups. If your backups are old, you're likely to lose work and data and should re-evaluate your backup strategy with the aim of backing up more frequently to allow recovery without losing a lot of data. See my article "Understanding SQL Server Backups" in the July 2009 issue of TechNet Magazine for some tips on planning a backup strategy. If you're going to take the restore route, you should always attempt a tail-of-the-log backup, as explained in the article, as this will allow you to recover right up to the point of the problem that made the database SUSPECT.
If no backups are available, you can use a mechanism called EMERGENCY-mode repair. Take a look at my comprehensive blog post that describes this feature, explains its usage, and shows some examples.
QWe have synchronous database mirroring installed with SQL Server 2005 and we've noticed that sometimes it can take quite a few seconds for the database mirror to fail over when something goes wrong with the main server. I thought that synchronous database mirroring using a witness was supposed to provide instant failure detection. What's going on?
AThe use of a witness server with database mirroring simply allows the mirror server to automatically initiate failover. The witness agrees (or not) with the mirror as to whether it can "see" the principal server. If both the witness and the mirror cannot see the principal, the mirror initiates the failover and becomes the new principal. The presence of a witness server in a database mirroring configuration has no bearing on how fast a failure is detected or how fast a failover occurs.
Instant failure detection is a misconception—how quickly a failure is detected depends on the type of failure. Here are some examples:
- The SQL Server instance (hosting the principal database) crashes. As long as Windows is still running and responsive, the failure should be detected in at most one second. Every one second, the witness and mirror ping the principal. If the SQL Server instance is not listening on the TCP port that was configured, Windows knows this and can immediately respond that SQL Server is not there.
- The whole principal server crashes. In this case, Windows is not there to say that SQL Server is not listening on the defined TCP port—so there's nothing there to say there's nothing there. In that case, the failure will not be detected until the mirroring partner timeout elapses. This is the number of once-per-second pings that must be not responded to until the mirror declares a failure on the part of the principal. By default, this number is set to 10 pings (and so 10 seconds), but if it has been increased for any reason, the failure detection will take longer.
- A transaction log drive fails on the principal. Initially nothing will happen except that I/Os will start to queue up to the log drive. After 20 seconds, SQL Server will print a warning in the error log. It's not until 40 seconds have elapsed that SQL Server will declare the log drive offline and take the database offline as well, triggering a mirroring failure.
- A database page becomes corrupt. In this case, if a regular query hits the corruption, nothing at all happens to mirroring. However, if a transaction is rolling back and encounters the page corruption, the database will become SUSPECT, as I described in the previous answer, and that will trigger a mirroring failure immediately.
- If a file or filegroup goes offline in the principal database and the primary filegroup is not affected, then in Enterprise Edition, partial database availability will kick in and a failure will not occur. On Standard Edition, however, a failure will be triggered.
As you can see, the speed with which a mirroring failure is detected really depends on what kind of failure occurs and whether the mirroring partner timeout has occurred.
Many thanks to Kimberly L. Tripp of SQLskills.com for technically reviewing this month's column.
Paul S. Randal is the Managing Director of SQLskills.com, a Microsoft regional director and a SQL Server MVP. He worked on the SQL Server Storage Engine team at Microsoft from 1999 to 2007. Paul wrote DBCC CHECKDB/repair for SQL Server 2005 and was responsible for the Core Storage Engine during SQL Server 2008 development. Randal is an expert on disaster recovery, high availability and database maintenance and is a regular presenter at conferences worldwide. He blogs at SQLskills.com/blogs/paul, and you can find him on Twitter at Twitter.com/PaulRandal.