Tracing Considerations and Design

Although SQL Trace was both designed and tested with high efficiency in mind, users must remember that it can be made to do quite a large amount of work. Collecting millions of events per minute on an extremely busy server can be taxing to all server resources—from memory to store the buffers, to network bandwidth if the rowset provider is used, to I/O if the file provider is used. This does not mean that SQL Trace should not be used on an active system; to the contrary, SQL Trace is a fantastic tool for tracking down issues even on the busiest of database servers. However, application of a bit of forethought and planning is necessary in order to ensure that your tracing session successfully answers your questions and does not end up causing you new problems.

The SQL Server Profiler Question

By far the easiest way to make sure that your trace activity does not cause issues on your busy production server is to avoid use of the SQL Server Profiler tool except as a scripting utility (to script trace definitions) and for very small trace jobs, such as when filtering for activity done by a single spid.

Advice against using SQL Server Profiler on a busy system is something we’ve heard for quite some time, but we were not sure just how much impact it would really have. SQL Server MVP and load testing expert Linchi Shea agreed to help us find an answer, and conducted a few tests using a TPC-C benchmark tool that he has written.

Linchi ran the tests on an HP DL585 G1 server with four single-core 2.6-GHz AMD Opteron processors. The server had 16 GB of RAM with 12 GB allocated for the SQL Server 2005 SP2 (9.00.3042) instance. The test database was 15 GB in total size, with 10 GB allocated for data and 5 GB allocated for logs. The actual data size was approximately 9 GB at the start of each batch of test runs and grew to approximately 12 GB over the course of the test. Workloads ranged from 20 to 300 users over the course of each test, with processors in the 75 to 85 percent range and I/O saturated at high user levels.

Three tests were run: one baseline test with no tracing enabled, one test using the default template with results returned to SQL Server Profiler, and a third test using the default template with a server-side trace (results output to a file). The results are fairly astounding, showing an almost 10 percent penalty in transactions per second throughput for the Profiler-based trace, with even a small number of emulated users, as shown in Table 2-4.

Table 2-4 Results of Comparing Server-Side and Profiler-Based Traces


Linchi noted that the bandwidth utilization resulting from the Profiler trace was consuming over 35 percent of the 100 megabits available to his server, but he was unable to determine whether that was the actual cause of the slowdown. The good news is that Linchi found absolutely no performance difference between his baseline case and the test with the server-side trace. This test clearly indicates that server-side tracing is the correct approach for an already busy production environment, in order to avoid causing more issues.

Note The numbers and data shown here tell only some of the story. For a complete writeup on these tests, please refer to Linchi’s blog post on the topic, which can be found at the following URL:

It’s important to remember that this test shows only a narrow view of a broad topic. It does not answer the question of when a trace will break a system that’s already at the edge, or how much additional data can be collected before problems start. And it should be stressed that the results of this test should not be read to indicate that DBAs should go overboard and trace everything, server-side, without discretion. There will certainly be performance penalties, even with server-side traces, as the amount of data collected increases. However, the results do indicate that you should probably not worry too much if you do need to use SQL Trace to solve a problem, even on a production system.

Reducing Trace Overhead

Beyond simply not using SQL Server Profiler, there is quite a bit you can do to reduce the overhead of your traces. As shown in the preceding section, server-side traces are quite efficient, but that doesn’t mean that you can go too crazy with them. The main thing you should try to avoid is overstressing the I/O system by collecting too much data, which will end up causing blocking conditions.

SQL Trace provides the ability to filter your traces—make sure to take advantage of it. Filters naturally reduce I/O by limiting the amount of data that is collected. When working on designing your traces, try to think about how you can filter them ahead of time—as part of the trace definition—instead of afterwards, once you load the data into a trace table.

Sometimes the best way to filter your trace to avoid excessive overhead is to actually create multiple traces, each with a different set of filters. For example, in the “Debugging Deadlocks” section, the trace shown was impossible to filter by spid because the Deadlock graph event may be fired on any number of system spids. Rather than creating a single monolithic trace, you might instead create one trace that includes the RPC:Starting and SQL:BatchStarting events, as well as Lock:Acquired and Lock:Released, and Lock:Escalation. This trace can be filtered based on the two spids you’re focusing on.

A second, unfiltered trace can be simultaneously created, which includes only the Deadlock graph event. After capturing the required data, the events from both traces can be inserted into a single trace table, and treated as if they were captured by a single trace. If you also happened to capture the EventSequence column for all involved events, it will be even easier to piece things back together—and the combination of these two traces will not contain any results that you did not intend to capture.

By thinking somewhat laterally, it is possible to vastly reduce the I/O overhead of server-side tracing. Getting out of the mind-set of trying to squeeze everything into a single trace enables much more flexible use of filters and the ability to greatly narrow your trace’s focus.

Max File Size, Rollover, and Data Collection

Another I/O-related consideration is the maximum file size parameter available when creating a server-side trace. When working with a new system that you suspect is especially busy, be very cautious about server-side traces because of the very real possibility that if you haven’t been quite smart enough about your filters, the trace will quickly consume the drive on which you’re collecting. Some systems are busy enough to capture 100 or more megabytes per second, just from the TextData columns of the RPC:Completed and SQL:BatchCompleted events.

In order to avoid this problem and test a trace, we generally start with a maximum file size of 50 megabytes, with rollover files turned off. We enable the trace and watch it, making sure that it doesn’t immediately consume the entire file and stop. If it does do so, it’s time to rethink things and try again; if not, we keep going, perhaps setting a slightly larger maximum file size or configuring rollover files.

Rollover files can be incredibly useful if you’d like to run a longer-term server-side trace but still have some delayed visibility into data as it’s collected. To set this up, configure your trace to use a small enough maximum file size that it will roll over on a regular basis—say, every 20 minutes, assuming that’s the interval you’d like for your delayed data. Also configure the maximum number of rollover files to a large enough number to support the length of time you would like to run the trace.

Set up a SQL Server Agent job to periodically get the name of the current file, by querying the Path column of the sys.traces view. Check the folder that you’re tracing into for older trace files—perhaps using xp_cmdshell—and insert these into the trace table. Don’t forget to delete them when you’re done.

Running the trace in this way will give you delayed visibility into the data, while keeping things fairly efficient and ensuring that you don’t lose any events. Make sure to monitor the data collection from the trace tables and ensure that the insert itself isn’t taking an excessively long time or tying up the I/O system too much. It can sometimes be difficult to balance the needs of production databases with the requirement of being able to monitor them.

< Back      Next >



© Microsoft. All Rights Reserved.