How can you prove that hardware is the problem? Use stress to reduce stress...

Article
06/21/2006

I'm sure you've been there - something causes corruption in the database. You blame the hardware, the hardware guys blame the software. There's no smoking gun and the hardware diagnostics come back clean. What can you do?

SQLIOStress

This is the hardware diagnostic you really want to run. It simulates a very heavy SQL Server workload and should do a far better job of discovering flaws in your hardware setup than individual hardware vendors' diagnostics will. We recommend that you run it before installing a system as well as using it to expose hardware as the problem in difficult-to-diagnose corruption problems.

Youc an find info on it at https://support.microsoft.com/default.aspx?scid=kb;en-us;231619 and there will be an updated version released in the next month or so (I'll blog when it gets out).

Page checksums

This is a cool new feature of 2005. It's a per-database option that will write an error-detection checksum on each page when it is flushed from the buffer pool. When a page is subsequently read again, the checksum is recalculated and checked against that stored on the page..

Here are some points to note about page checksums (they debunk a bunch of common misconceptions):

Upgrading a database to SQL Server 2005 and turning on the option does not automatically protect all the pages, as a page has to be changed and written to disk after the database option is enabled to have a checksum written on it. Only when a page has been through this process is it protected. There is no tool or automatic way to force all pages to go through this process (we may provide one in future).
In benchmarking tests with a TPCH workload, we measured approx 2% performance degradation as a result of having checksums enabled.
The checksum cannot be used for error correction. Generating an error-correcting checksum would be a more complicated algorithm and so would be slower to compute.
The checksum is validated when a page is read for checking by any of the DBCC CHECK* commands (regardless of whether the PHYSICAL_ONLY option was used) so all existing page checksums can be checked by issuing a DBCC CHECKDB command.
Any existing checksums are checked when pages are read as part of taking a backup. In addition, the restore logic will also verify the page checksums as the pages are restored from the backup media, so there is solid assurance that the data from the backup is consistent.
Page checksums do not prevent in-memory corruptions from memory scribblers (where the page is read in, corrupted in memory from some rogue process, and then written out with a new checksum). In Enterprise Edition, there is a ‘checksum sniffer’ that runs constantly, randomly picking unchanged pages in the buffer pool and validating their checksums to see if the page has been scribbled on.

Bad page checksums will result in IO errors being reported:

IO failures will trigger read-retry logic, which will re-read the page several times to see if the error clears itself (if it does, a message is written to the SQL error log)
IO failures that persist through read-retry are logged in the error log and Windows event log, so monitoring these will allow you to be alerted to hardware problems quickly.
If an IO error is reported, the page is not taken offline in any way. Anything that subsequently touches the page will get another IO error. The exception to this is if the IO error is encountered during transaction rollback. In this case, the database is taken offline and must be brought back online manually.

Page checksums should vastly reduce the number of undiagnosed corruption problems (saving time and hassle for you and our support team). Now all you need to do is make sure you have that excellent backup strategy to go with it...

How can you prove that hardware is the problem? Use stress to reduce stress...

Additional resources