SQL Q&A: Dynamic Data and Disaster Recovery

This month’s solutions for SQL success run the gamut from expanding a tempdb and clustering a conundrum to half-baked disaster recovery plans.

Paul S. Randal

Filling up Space

Q. One of the production servers for which I’m responsible has a problem. The tempdb grows really large every few days. This is a relatively new problem. I don’t see any difference in the number of connections to the server or memory usage. How can I monitor the situation to determine what’s using all the tempdb space?

A. There are quite a few reasons tempdb usage could grow:

  • Use of the versioning system (for snapshot isolation or online index operations, for example) could cause the version store in tempdb to grow.
  • A query plan could change because of out-of-date statistics, which in turn would cause a query plan operator that results in a large spill of memory into tempdb.
  • Someone may have deployed new application code that uses temporary tables to store partially processed data.

Whatever it is, there are some simple ways for you to track down what’s going on. The first thing you should do is examine the overall tempdb space usage with the Dynamic Management View (DMV) sys.dm_db_file_space_usage. If you capture the results of this DMV every 30 seconds, for example, you’ll be able to tell whether the extra space usage is from the version store, user objects or objects created to assist query processing.

If it’s the version store that’s taking up all the space, you can drill down further using the DMV sys.dm_tran_top_version_generators. You’ll need to join it with sys.partitions and sys.indexes to get truly useful information out of it, but that will let you know which tables have the most versions being generated.

If it’s anything else taking up space, you can drill down by capturing the results from sys.dm_db_task_space_usage at a similar frequency. Then join the DMV with sys.dm_exec_requests to find out which connections and queries are taking up space.

If it turns out to be a long stored procedure, you may need to instrument the procedure to periodically output the amount of tempdb space it’s using so you can work out which statements within the procedure are the culprits. I’ve had to do this several times on client systems.

You can find a lot more information on using these DMVs in the white paper, “Working with tempdb in SQL Server 2005.” (This paper also applies to later versions of SQL Server.)

Good Clusters

Q. I’ve been asked to design the schema for a database that will store data for a new application. I’ve read all kinds of advice about choosing “good” clustered index keys for my tables. Can you explain what makes a “good” clustered index key and why it matters so much?

A. This is a complex question, and nearly impossible to comprehensively answer here. In a nutshell, a “good” clustered index key is one that’s been carefully chosen to minimize poor performance and wasted space. The four qualities of a good clustered index key are: narrow, static, unique and ever-increasing:

  • Narrow (taking up as few bytes as possible): All non-clustered index records include the clustered index key. The larger it is, the more space duplicate information in non-clustered indexes takes up.
  • Static (unchanging): Changes to key values are expensive. SQL Server does a key update as a delete+insert operation (see my blog post here), and any time a clustered index key is updated, all matching rows in non-clustered indexes also need to be updated. Key changes can also lead to empty space on data file pages if that key position in the index is not used again.
  • Unique: This avoids having SQL Server add a hidden four-byte column to “uniquify” duplicate key values—thus making the key wider.
  • Ever-increasing: The insert pattern of new records does result in random inserts into the clustered index that can cause expensive page-split operations. This leads to logical fragmentation and wasted space on data files pages.

Given these qualities for a good clustered index key, there often isn’t a natural key that fits (for instance, one derived from the table data), so you have to use a surrogate key (for instance, an artificial table column). A BIGINT IDENTITY column is one example of a good surrogate key. Read more in-depth explanations and justifications in Kimberly Tripp’s blog category Clustering Key.

Prepare for the Worst

Q. In the wake of the recent earthquakes in New Zealand and Japan, I examined our disaster recovery plan and found it’s really outdated. I’ve been unsuccessfully trying to get our company to revamp and test the plan. They simply don’t think we’ll ever have a disaster. Can you give me some hints on how to approach this with management?

A. I’m glad to hear you’re proactively analyzing your disaster recovery (DR) strategy in the wake of these recent disasters. Many companies are complacent and have the attitude you describe in your question. Although large-scale natural disasters are relatively rare, more localized issues like building fires or power outages are relatively common and a company shouldn’t assume it’s immune to random failures.

Even if you can’t get management on your side, there’s a lot of testing you can do yourself, like restoring copies of databases from backups. This tests your backup integrity and backup strategy, and you can ensure that the restore time meets the maximum allowable downtime requirements for any particular database. Very often, this is the first problem found during DR strategy testing. Data volumes grow over time and restore time increases commensurately.

Other portions of the DR strategy are much harder to test yourself, such as failing over a database mirroring partnership or a failover cluster. Both of these require some application downtime (both to failover and to failback).

As far as convincing management is concerned, ask them whether they’d rather find out the DR strategy doesn’t work during a planned test with all staff on-hand to assist with the recovery, or when disaster strikes for real at 2 a.m. on a public holiday when only a bare-minimum staff is on-hand.

There are plenty of highly publicized incidents of companies suffering outages because a DR strategy was insufficient. Does management want their company to be the next one in the news? That may sound melodramatic, but it’s a fair point.

Disaster recovery is all about minimizing the cost to the company and its clients. If clients suffer because of an outage or lose faith in the ability of the company to recover quickly, they may take their business elsewhere. This obviously hurts the company’s bottom line.

As technologists, we need to ask management to think of IT disasters in terms of the financial impact on the company. I’ve found this to be an effective tactic in persuading management to invest time and money in revamping and testing the DR strategy. Read more about this in my recent blog post here.

Compressing Costs

Q. I’d really like to use the data-compression feature in SQL Server 2008 to reduce our storage costs, but I’ve read it’s only for data warehouses and that I’ll incur huge performance problems if I try to use it in an online transaction processing (OLTP) system. Is this true?

A. You’re correct that the data-compression feature was originally intended for data warehouse use. Data compression reduces the size of table and index records. This means more records fit on an 8KB data file page, and so fewer data file pages are required to store the data on disk. This translates into smaller disk space requirements for the database holding the compressed data, which in turn can lead to significant cost savings as less enterprise-class storage is required.

The trade-off, of course, is that the data needs to be decompressed before use. The data is not decompressed when it’s read into the SQL Server buffer pool (in-memory cache of data file pages). It’s only decompressed when it’s actually needed to satisfy a query. Decompression uses CPU resources, so one trade-off is space usage against CPU resources.

A typical data warehouse has a large amount of data (think hundreds of gigabytes to multiple terabytes). The access pattern for that data is typically a large amount of data read into the buffer pool, processed once and then not used again for long enough that it’s aged out of memory.

With this access pattern, it makes sense to minimize the number of read I/O operations by compressing the data to a much smaller size. This would require fewer SQL Server data file pages to store it and fewer I/O operations to read those pages. This usually leads to more rapid completion of these types of queries. So another trade-off is query speed against CPU resources (for decompressing data).

If you consider an OLTP workload, there’s usually much higher data volatility than in a data warehouse. This means if you use data compression, you’ll incur a high CPU cost because of the constant decompression of data being read and compression of data being inserted or updated. You’ll have to look more carefully at the trade-offs when considering data compression for an OLTP database.

Getting back to the question, although data compression was originally aimed at data warehouses, many SQL Server customers have found they have a large amount of CPU “head room” on their servers. They can afford the extra CPU usage and potentially longer query execution times to get the large space savings and storage cost savings associated with using data compression. Data compression can be useful for OLTP environments. Just ensure you evaluate the space savings and performance costs for your workload before going into production.

For space savings, you can use the sp_estimate_data_compression_savings procedure to give you an idea of the percentage savings you can expect. It’s important to do this because enabling (or disabling) data compression is done using a rebuild operation. This can be expensive in itself. For more information, see the white paper, “Data Compression: Strategy, Capacity Planning and Best Practices.”

Paul S. Randal

**Paul S. Randal**is the managing director of SQLskills.com, a Microsoft regional director and a SQL Server MVP. He worked on the SQL Server Storage Engine team at Microsoft from 1999 to 2007. He wrote DBCC CHECKDB/repair for SQL Server 2005 and was responsible for the Core Storage Engine during SQL Server 2008 development. Randal is an expert on disaster recovery, high availability and database maintenance, and is a regular presenter at conferences worldwide. He blogs at SQLskills.com/blogs/paul, and you can find him on Twitter at twitter.com/PaulRandal.