SQL Q & A: Bottlenecks and Transaction Logs
Sometimes it’s easy to determine the nature of a performance bottleneck—other times, not so much. The same goes with configuring transaction logs.
Paul S. Randal
Q: I’m trying to determine the nature of my system’s performance bottleneck. I’m convinced it’s the storage layer as I’m seeing the disk queue lengths go higher than two. I’ve read this is a good way to prove that SQL Server is overloading the storage. Is this true? If so, what can I do about it?
A: Unfortunately, what you’ve read is an urban legend. SQL Server is designed to use asynchronous IO and will happily push the disk queue length above two. Each thread that issues an IO then continues doing something else (potentially) until the IO is complete. SQL Server tries to maximize throughput from the IO subsystem by issuing concurrent asynchronous IOs. It will also perform operations like read-ahead when scanning large volumes of data.
In fact, some operations like DBCC CHECKDB will saturate the IO subsystem. It’s not uncommon to see disk queue lengths of several hundred. You can read more about this urban legend of disk queue lengths in this blog post.
The question then becomes, “What do you look at to determine if there’s an IO subsystem bottleneck?” There are two performance counters in the Physical Disk performance object. You should pay attention to these:
- Avg. Disk sec/Read
- Avg. Disk sec/Write
These give the time in milliseconds it takes to complete an IO. If these numbers are consistently higher (or have regular higher spikes) than the norm (which should be between 5ms and 12ms), then the physical disk is the IO bottleneck. Of course, that physical disk may be a SAN LUN, but you can’t dig deeper from Windows.
If you have multiple SQL Server data and log files on that physical disk, you may need to determine which files are causing the IO load. Use the dynamic management view (DMV) sys.dm_io_virtual_file_stats and perform some simple time-series analyses on the results.
If the DMV results don’t indicate a heavy load on that physical disk, a storage administrator may have placed files from other applications on that part of the IO subsystem. That workload could be what’s monopolizing the IO bandwidth. In that case, you’ll need to ask that particular administrator to move the SQL Server files to a dedicated portion of the IO subsystem.
If it’s purely SQL Server files on the subsystem and you can identify which ones are causing the excessive IO, consider these strategies:
- Look at the database query workload and determine if it’s performing excessive table scans because of an incorrect indexing strategy, or bad query plans caused by out-of-date statistics.
- Move some of the files to a different portion of the IO subsystem.
- Add more memory to the server to allow for a larger SQL Server buffer pool (in-memory cache of data file pages) and avoid so much read IO.
If none of these work, and it’s really just a case where the workload has outstripped the IO subsystem, move to a more capable IO subsystem. You could also consider enterprise-class flash-memory storage like Fusion-io.
Q: I’m laying out the storage requirements for some new servers and I’m having trouble determining how large to make the transaction logs. In the past, I’ve tried to base it on transaction size. Sometimes, though, it seems to double. Can you explain how I can arrive at a decent estimate?
A: There’s no easy formula for calculating the optimal transaction log size. Unfortunately, it’s also difficult to prevent transaction log growth unless you’ve turned off auto-grow for the log file. However, turning off auto-grow is never advisable.
The transaction log should always be able to auto-grow. This is particularly true for emergencies when your log file size monitoring fails. For example, say you have a SQL Server Agent alert on the Percent Log Used performance counter going higher than 90 percent, but the emergency contact notified by e-mail/pager is out sick. If the log can’t grow, all currently running transactions modifying the database will stop and roll back. This translates into downtime for your workload.
However, saying there’s no easy formula is a bit misleading. There are numerous operations that take up transaction log space. You can use the size of these operations to estimate your transaction log requirements. These may be operations occurring as part of your day-to-day workload, or more infrequent actions like database maintenance. You have to consider them all. The operation list includes:
- The largest single insert/update/delete transaction your workload performs (whether a single-statement implicit transaction affecting millions of table rows, or an explicit transaction that performs many operations).
- The largest bulk-operation your workload performs, such as a BULK INSERT. If you’re using the FULL recovery mode, you may be able to reduce the amount of transaction log generated by using the BULK_LOGGED recovery model. If you use the BULK_LOGGED recovery model for minimal-logging for some operations, it can impact your ability to recover after a disaster. See my blog post, “A SQL Server DBA myth a day: BULK_LOGGED recovery model.”
- An index rebuild of your largest clustered index. You may be able to use BULK_LOGGED here as well.
With all these operations, it’s not just the amount of transaction logs you need to consider, you also need to take into account the space the transaction log management system will “reserve” to allow for proper transaction rollback. If a transaction generates 100MB of transaction log records, the system will reserve approximately 100MB of empty space in the transaction log to guarantee it can abort the transaction and correctly roll back. It’s a safety mechanism to prevent a database becoming inconsistent. This is why you may have seen the transaction log grow, even though you think you’ve given it enough space for the largest transaction.
Another thing to consider is whether there are any reasons why transaction log records must remain in the log. This could lead to a growing transaction log having to grow still more. Some of the potential reasons include:
- The database is using the FULL or BULK_LOGGED recovery models and not performing transaction log backups (or performing them infrequently). You must backup log records before discarding them.
- There’s an unusually long-running transaction. This will prevent discarding all transaction log records generated since the long-running transaction started.
- Database mirroring is in use and some transaction log records haven’t been sent from the principal server to the mirror server. You can’t discard these until they’ve been sent.
- Transactional replication (or peer-to-peer replication) is in use and there are some transaction log records the Log Reader Agent job hasn’t processed.
If you’re seeing the transaction log growing and you’re not sure why it’s happening, ask SQL Server. Issue the query:
SELECT [log_reuse_wait_desc] FROM sys.databases WHERE [name] = 'dbmaint2008'; GO
The result will be the reason you can’t discard some log records and reuse the log space (called either “clearing” or “truncating” the log).
As you can see, there are quite a few things that can affect the size of the transaction log, even more if you’re considering the tempdb database as well. You can read a bit more on this topic in my blog post, “Importance of proper transaction log size management,” and in the TechNet Magazine article “Understanding SQL Server Logging and Recovery.”
Q: Can you explain why I can’t make SQL Server operations non-logged? I’ve read that table truncation is non-logged—why can’t there be a setting to make all operations non-logged so SQL Server runs faster? What if I don’t care about being able to recover after a disaster? I’d especially like to be able to ignore the tempdb transaction log.
A: What you’ve read about a TRUNCATE TABLE operation is not true. All operations on all databases are logged to some degree. Some are “minimally logged,” such as table truncation. Simply stated, a minimally logged operation is one where only the allocation and de-allocation of data file pages is logged. Any operations on table/index records on the pages aren’t logged. This speeds up operations and means less transaction log is generated—but there’s still some logging.
A table truncation is always minimally logged in all recovery models. The other minimally logged operations (such as index build/rebuild and bulk loads) are only minimally logged when using the SIMPLE or BULK_LOGGED recovery models.
The only truly non-logged operations in SQL Server are those affecting the version store in tempdb, which supports features like snapshot isolation and online index operations. These can be non-logged because there’s never a need to roll back a version store operation or run crash-recovery on the tempdb database.
This gets to the crux of your question. Why can’t operations in SQL Server be non-logged? SQL Server always has to be able to roll back an operation if something goes wrong. If there was no description (like the transaction log records) of what the operation had done, how would SQL Server know what to do during rollback? You can only achieve this through logging.
Even if you don’t care about crash-recovery, SQL Server still has to be able to roll back operations if the database runs out of space or encounters a corrupt sector, or if the query runs out of memory. If SQL Server can’t roll back an operation, the database becomes unusable and the workload stops.
This also applies to the tempdb database. Although logging is simplified and reduced in tempdb, you can never remove it entirely for the same reasons. Also, SQL Server has to be able to run crash-recovery after every crash to ensure each database is consistent. Otherwise the database is unusable. Bottom line: There’s no way to make operations non-logged in SQL Server and I wouldn’t expect that to change.
Paul S. Randal* is the managing director of SQLskills.com, a Microsoft regional director and a SQL Server MVP. He worked on the SQL Server Storage Engine team at Microsoft from 1999 to 2007. He wrote DBCC CHECKDB/repair for SQL Server 2005 and was responsible for the Core Storage Engine during SQL Server 2008 development. Randal is an expert on disaster recovery, high availability and database maintenance, and is a regular presenter at conferences worldwide. He blogs at SQLskills.com/blogs/paul, and you can find him on Twitter at Twitter.com/PaulRandal.*