SQL Server: Dig deep for performance issues
There are several tactics for determining what’s causing the most wait times in SQL Server, so you can direct your performance tuning.
Excerpted from “SQL Server DMV Starter Pack,” published by Red Gate Books (2010).
Glenn Berry, Louis Davidson and Tim Ford
Let’s get right down to the OS level. We’re looking at the worker threads that carry out the tasks required by our transactions. The Dynamic Management Objects (DMOs) in this category—all of which start with the name sys.dm_os_—provide extremely detailed information about the way SQL Server interacts with the OS and the hardware. You can use these DMOs to get answers to all sorts of performance questions:
- The wait_stats DMO tells you what kinds of things the SQL Server OS threads have been waiting on.
- Performance_counters tells you the SQL Server performance counter values and how they’re decoded.
- Ring_buffers, os_schedulers or wait_stats will alert you to a CPU-utilization concern.
- Sys_info will give you the characteristics of the machine upon which SQL Server is running.
- Sys_memory or process_memory will tell you how your memory as a whole is being used.
- Memory_cache_counters or buffer_descriptors tell you how the cache memory is being used.
All these queries work with SQL Server 2005, 2008 and 2008 R2. They also all require View Server State permission. Note: The objects, collectively, should be referred to as DMOs. However, using DMOs tends to cause some confusion with the entirely unrelated “Distributed Management Objects,” so it’s still quite common for database administrators to refer to DMOs collectively as “Dynamic Management Views, or DMVs.”
SQL Server waits
Arguably the most significant DMV in the Operating System category is sys.dm_os_wait_stats. Every time a session has to wait for some reason before the requested work can continue, SQL Server records the length of time waited, and the resource for which SQL Server is waiting. The sys.dm_os_wait_stats DMV exposes these wait statistics, aggregated across all session IDs, to give you a summary review of where the major waits are on a given instance.
This same DMV also exposes performance (PerfMon) counters, which provide specific resource-usage measurements (disk transfer rates, amount of CPU time consumed and so on). By correlating wait statistics with resource measurements, you can quickly locate the most contested resources on your system and highlight potential bottlenecks.
Waits and queues
Using waits and queues is an excellent basis for performance tuning, as explained in the white paper, "SQL Server 2005 Waits and Queues," by Tom Davidson. Essentially, each SQL Server request results in a number of “worker tasks” being initiated. A SQL Server Scheduler assigns each task to a worker thread. Normally there’s one SQL OS scheduler per CPU, and only one session per scheduler can run at any time.
It’s the scheduler’s job to spread the workload evenly across available worker threads. If a session’s worker thread is running on the processor, the status of the session will be Running, as exposed by the Status column of sys.dm_exec_requests DMV.
If a thread is ready to go, but the scheduler to which it’s assigned currently has another session running, it will be placed in the “runnable” queue. This simply means it’s in the queue to get on the processor. This is called a signal wait.
The signal wait time is exposed by the signal_wait_time_ms column. This solely refers to CPU wait time. If a session is waiting for another resource to become available, such as a locked page, or if a running session needs to perform I/O, then it’s moved to the wait list. This is a resource wait and the waiting session’s status will be recorded as “suspended.” The reason for the wait is recorded, and exposed in the wait_type column of the sys.dm_os_wait_stats DMV.
The total time spent waiting is exposed by the wait_time_ms column, so you can calculate the resource wait time, as follows:
Resource waits = Total waits – Signal waits (Or (wait_time_ms) - (signal_wait_time_ms))
Signal waits are unavoidable in online transaction processing (OLTP) systems, as these are made up of a large number of short transactions. The key metric, with regard to potential CPU pressure, is the signal wait as a percentage of the total waits.
A high percentage signal is a sign of CPU pressure. “High” often refers to more than 25 percent, but it depends on your system. Values greater than 10 percent to 15 percent can also be a worrying sign. Overall, wait statistics are an effective means to diagnose response times in your system. In very simple terms, you either work or you wait.
If response times are slow and you find no significant waits, or mainly signal waits, you know you need to focus on CPU. If you find the response time is mainly comprised of time spent waiting for other resources (such as network, I/O and so on), then you know you need to focus your tuning efforts on those resources.
Our first script in the OS category uses the sys.dm_os_wait_stats DMV, which returns information about all the waits encountered by executed threads. You can use this aggregated view to diagnose performance issues with SQL Server overall and with specific queries and batches.
This simple query calculates signal waits and resource waits as a percentage of the overall wait time, in order to diagnose potential CPU pressure:
-- Total waits are wait_time_ms (high signal waits indicate CPU pressure) SELECT CAST(100.0 * SUM(signal_wait_time_ms) / SUM(wait_time_ms) AS NUMERIC(20,2)) AS [%signal (cpu) waits] , CAST(100.0 * SUM(wait_time_ms - signal_wait_time_ms) / SUM(wait_time_ms) AS NUMERIC(20, 2)) AS [%resource waits]FROM sys.dm_os_wait_stats ;
This is useful to help confirm CPU pressure. Because signal waits are time spent waiting for a CPU to service a thread, if you record total signal waits above roughly 10 percent to 15 percent, that’s a good indicator of CPU pressure. These wait stats are cumulative since SQL Server was last restarted, so you need to know your baseline value for signal waits and watch the trend over time.
You can manually clear out the wait statistics, without restarting the server, by issuing a database consistency checking (DBCC) SQLPERF command, as follows:
-- Clear Wait Stats DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR) ;
If your SQL Server instance has been running for quite a while and you make a significant change, such as adding an important new index, you should consider clearing the old wait stats. Otherwise, the old cumulative wait stats will mask whatever impact your change has on the wait times.
Our second example script (see Figure 1) shows how using the sys.dm_os_wait_stats DMV will help determine the resources for which SQL Server is spending the most time waiting.
Figure 1 This script generates a report on the top causes of waits.
-- Isolate top waits for server instance since last restart -- or statistics clear WITH Waits AS ( SELECT wait_type , wait_time_ms / 1000. AS wait_time_s , 100. * wait_time_ms / SUM(wait_time_ms) OVER ( ) AS pct , ROW_NUMBER() OVER ( ORDER BY wait_time_ms DESC ) AS rn FROM sys.dm_os_wait_stats WHERE wait_type NOT IN ( 'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK', 'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE', 'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH', 'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE', 'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN' ) ) SELECT W1.wait_type , CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s , CAST(W1.pct AS DECIMAL(12, 2)) AS pct , CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct FROM Waits AS W1 INNER JOIN Waits AS W2 ON W2.rn <= W1.rn GROUP BY W1.rn , W1.wait_type , W1.wait_time_s , W1.pct HAVING SUM(W2.pct) - W1.pct < 95 ; -- percentage threshold
The script in Figure 1 helps you locate the biggest bottlenecks at the instance level. This helps you focus your tuning efforts on the particular type of problem. For example, if the top cumulative wait types are disk I/O-related, then you’d want to investigate this issue further using disk-related DMV queries and PerfMon counters.
The DMV that exposes the PerfMon counters is sys.dm_os_performance_counters. This returns a row per performance counter maintained by the server. This is a useful DMV, but it can be frustrating to work with.
Depending on the value for cntr_type for a given row, you might have to dig deep to get meaningful information from this DMV. It’s a replacement for the old sys.sysperfinfo from SQL Server 2000.
The script in Figure 2 helps you investigate unusual conditions filling up your transaction log. It returns the recovery model, log reuse wait description, transaction log size, log space used, percentage of log used, compatibility level, and page verify option for each database on the current SQL Server instance.
Figure 2 Determine what’s filling up the transaction log using this script.
-- Recovery model, log reuse wait description, log file size, -- log usage size and compatibility level for all databases on instance SELECT db.[name] AS [Database Name] , db.recovery_model_desc AS [Recovery Model] , db.log_reuse_wait_desc AS [Log Reuse Wait Description] , ls.cntr_value AS [Log Size (KB)] , lu.cntr_value AS [Log Used (KB)] , CAST(CAST(lu.cntr_value AS FLOAT) / CAST(ls.cntr_value AS FLOAT) AS DECIMAL(18,2)) * 100 AS [Log Used %] , db.[compatibility_level] AS [DB Compatibility Level] , db.page_verify_option_desc AS [Page Verify Option]FROM sys.databases AS db INNER JOIN sys.dm_os_performance_counters AS lu ON db.name = lu.instance_name INNER JOIN sys.dm_os_performance_counters AS ls ON db.name = ls.instance_nameWHERE lu.counter_name LIKE ‘Log File(s) Used Size (KB)%’ AND ls.counter_name LIKE 'Log File(s) Size (KB)%' ;
The query in Figure 2 is helpful for evaluating an unfamiliar database server. It’s also generally more useful from a monitoring perspective. For example, if your log reuse wait description is something unusual like ACTIVE_TRANSACTION, and your transaction log is 85 percent full, then there should be some alarm bells going off.
Using this type of monitoring is extremely helpful for identifying the sources and causes of waits and other performance issues. This type of investigation can help you better direct your performance-tuning efforts in the right place.
Glenn Berry works as a database architect at NewsGator Technologies Inc. in Denver, Colo. He’s a SQL Server MVP and has a whole collection of Microsoft certifications, including MCITP, MCDBA, MCSE, MCSD, MCAD and MCTS, which proves that he likes to take tests.
Louis Davidson has been in the IT industry for 16 years as a corporate database developer and architect. He has been a SQL Server Microsoft MVP for six years and has written four books on database design. Currently he’s the data architect and sometimes DBA for the Christian Broadcasting Network, supporting offices in Virginia Beach, Va., and Nashville, Tenn.
Timothy Ford is a SQL Server MVP and has been working with SQL Server for more than 10 years. He’s the primary DBA and subject-matter expert for the SQL Server platform for Spectrum Health. He’s been writing about technology since 2007 for a variety of Web sites and maintains his own blog at thesqlagentman.com, covering SQL as well as telecommuting and professional development topics.
Learn more about “SQL Server DMV Starter Pack” at red-gate.com.