SQL Server 2005 - RDTSC Truths and Myths Discussed
I posted some updates to my blog entry about the RDTSC and drift warnings in May. See the following link http://blogs.msdn.com/psssql/archive/2006/11/27/sql-server-2005-sp2-will-introduce-new-messages-to-the-error-log-related-to-timing-activities.aspx Because the CPU drift issue has continued to generate questions let me try to answer these directly.
Myth: Performance Problems
The statement I have seen frequently is that the drift warning equates to performance problems. This is not the case and extensive testing by the SQLOS team has shown that the even under forced RDTSC variations the Microsoft SQL Server database scheduler is not affected. The performance characteristics of the SQL Server are not affected by RDTSC drift, only the performance timing outputs.
RDTSC test is a utility developed by the SQL Server Escalation Team to help us determine computers that may have inaccurate performance data results. The utility runs with the standard SQLDiag/PSSDiag collection that Microsoft SQL Server support collects when troubleshooting performance cases.
Microsoft SQL Server support has encountered a small subset of issues were the RDTSC drift lead to incorrect performance tuning. Support used at the inaccurate query durations and tuned the query with the longest duration. Only later to find that the durational calculation was incorrect and tuning the query did not fully help the customer. To avoid this we examine the RDTSCTest output as a guideline for how accurate the data should be.
The RDTSCTest can easily show drift. How much drift; is the common question so let me try to answer the question as best as I can.
There is no exposed way to snapshot the RDTSC values for all CPUs on the computer at the same time. To do this the utility has to snapshot the current RDTSC. Then affinitize the thread to another CPU and snapshot the RDTSC. It continues this movement across all CPUs. Then it sorts the returned values and prints out the results. The problem with this is that every time that you move to a new CPU you generally have to take a ride through the NT scheduler. On many computers this can cause an 8 to 12ms context switch of the thread. So if you have a single socket, dual core system drift of 12 to 24ms is easy to encounter. If you take the calculation to a 64 core system you are at 768ms and that only accounts for context switching.
Generally the Microsoft SQL Server support team considers drift less than several seconds, noise. Again, it is only a guideline for administrators and support to be cautious when looking at the durational values from various performance outputs. (These areas are outlined in KB931279 - http://support.microsoft.com/kb/931279/en-us)
We also look for sustained drift patterns. For example, if the utility records a drift delta of 3 seconds for CPU #5 on the computer will it continue to show this when it is executed on successive runs and at the same delta rates. If it does not it is likely a noise situation.
There are also conditions on NUMA systems that the CPUs within a node may remain in sync but doing cross node, RDTSC comparisons is not valid. Make sure that you carefully examine the NUMA configuration information when you analyze the output also.
I have seen several e-mail messages and posts that the administrator should add the /usepmtimer to the boot switches. Use of the /usepmtimer does not affect the RDTSC instruction. Instead it modifies the source the Windows QueryPerformanceCounter data. This has NO APPLICATION for the SQL Server RDTSC issues. Microsoft SQL Server makes the direct read call to retrieve the RDTSC timer.
Truth: Manufacture Updates
KB931279 currently mentions an AMD update to synchronize the RDTSC timers of all CPUs. AMD has indicated that dual-core optimizer is not targeted at server applications.
The sentiment of the statement still stands. If you are receiving drift warnings it is a good practice to make sure that you have the latest firmware, micro-code and BIOS updates.
Truth: Power Now, Speed Step, …
When doing performance tuning of Microsoft SQL Server queries it is a good practice to disable the CPU stepping features. This may require BIOS or similar settings and a restart to accommodate the configuration. This is best way to stabilize RDTSC timings. This enables you to retrieve accurate and meaningful performance data from the Microsoft SQL Server.
Truth and Myth: Affinity Mask
The Microsoft SQL Server affinity mask can reduce the probability of encountering inaccurate performance data but does not prevent it. By setting the affinity mask the SQL Server worker threads stay on the same CPU so that the start and stop RDTSC values are from the same CPU and do not become polluted by thread CPU switching activities.
Consider the case in which stepping is still enabled. The CPU could change frequencies, one or more times between the start and end of the query. The duration is based on the common frequency of all CPUs. When dividing the elapsed duration by a constant frequency it can result in incorrect duration.
Truth: I/O Stall Warnings
SQL Server 2000 SP4 added the I/O stall detection. I have outlined this in detail in my whitepapers and KB articles. See http://support.microsoft.com/kb/897284/en-us for more information. Microsoft SQL Server 2005 updated Stalled I/O tracking to use the RDTSC counters also. When an I/O is posted the starting RDTSC is captured. When the Lazy Writer processes executes it checks the status of the I/O and if still pending and more than 15 seconds has elapsed the warning is recorded. Because the RDTSC counter is in play and the Lazy Writer can execute on a different CPU the RDTSC values may diverge. In an extreme case of drift 15+ seconds the warning could be falsely produced.
Trace Flag 8033
Trace flag –T8033 can be used to suppress the drift warnings. Do not enable this trace flag on Microsoft SQL Server 2005 unless you fully understand the ramifications of ignoring the drift warnings.
Bob Dorr - Microsoft Senior SQL Server Escalation Engineer