Disk Performance Internals
Storage is the slowest component of most computer systems. As such, storage is often a performance bottleneck. This article discusses the disk performance kernel provider, partition manager. By understanding how the disk performance provider works we can understand how disk performance is tracked internally in Windows and how disk related counters are calculated, which will be helpful for diagnosing storage performance issues.
Disk Performance Architecture
There are two sets of public interfaces to query performance counter data – PDH (Performance Data Helper) or the registry interface. The registry interface to the performance data is older than the PDH interface and has more extensive functionality. However, the PDH interface is easier to use for most performance data collection tasks. The PDH interface is essentially a higher-level abstraction of the functionality that the registry interface provides.
Windows performance monitor leverages the PDH interface to get performance data. The performance data helper (PDH) interface calls the registry interface.
Perflib is one key component integrated in the registry interface, which is responsible for translating the request from the application and calling the collect procedure exported by a performance extension DLL. The extension DLL does the real work of data collection and returns a standard data format to perflib.
Extension DLLs should expose Open, Collect, and Close functions to be called by perflib. We can find these functions’ name by checking the registry:
Value Name: Close, Collect, Open
Here is a user mode call stack when an application uses the registry API RegQueryValueEx() to collect performance data:
0017f7ec 004e0000 perfdisk!CollectDiskObjectData+0xf8
0017f964 7702eaa9 advapi32!QueryExtensibleData+0x577
0017fd48 7702e962 advapi32!PerfRegQueryValue+0x5d8
0017fe38 770576f5 advapi32!LocalBaseRegQueryValue+0x313
0017fe9c 004011fc advapi32!RegQueryValueExW+0xa2
0017fec8 00401153 getperfdata!GetPerformanceData+0x3c
0017ff38 0040322a getperfdata!wmain+0x93
0017ff88 773deccb getperfdata!__tmainCRTStartup+0x15e
0017ff94 7798d24d kernel32!BaseThreadInitThunk+0xe
0017ffd4 7798d45f ntdll!__RtlUserThreadStart+0x23
0017ffec 00000000 ntdll!_RtlUserThreadStart+0x1b
Disk performance kernel device stack
Figure 2 shows the I/O manager stack to gather disk performance statistics. The volume manager underneath the file system driver gathers Logical Disk statistics. On Windows 2008 or above, volmgr.sys handles Logical Disk statistics for both dynamic and basic disks. The partition manager, partmgr.sys, gathers physical disk statistics. These statistics are measured and collected for each request that passes through the I/O manager stack.
Physical Disk Statistics
Partition manager (partmgr.sys) saves performance information in the device extension’s counter context.
Logical Disk statistics
Volume manager (volmgr.sys) also saves performance metrics in its device extension.
How to track disk performance?
Performance information is tracked in the read and write dispatch routines and in the IO completion routines. There are 5 kinds of counter data tracked by partition manager:
1. Queue depth - Total concurrent IOs still in process and not yet completed.
2. Total counts of read and write requests.
3. Total read and write time for all IO requests. For example: There are total 2 write IOs completed since disk counter is enabled, one takes 1 sec and the other takes 2 sec. Then, this write counter will be 1 sec + 2 sec = 3 sec.
4. Total Idle time.
5. Total split IO (fragmented IO).
Let’s talk about them separately.
When a new IRP is sent to partition manager it will increment the queue depth. Partition manager will decrement the queue depth when completing an IRP. Therefore, the value indicates how many concurrent IOs are still in process.
Total read and write count:
When any read or write IO has been completed the partition manager IRP completion routine will get called. Then the read or write counter will be incremented. Note we only track completed IOs here.
Total read and write time for IOs:
When any read or write IO starts, partition manager’s dispatch routine will record the current time stamp in the IO stack location of the IRP. When an IRP is completed the completion routine will use this time stamp and the current system time to calculate the time taken to complete this IO. Partition manager will then add this value to the appropriate counter in the device extension.
Total Idle time:
When completing an IRP and decrementing the queue depth partition manager will check if the queue depth reaches 0. If yes, it indicates the disk state has been transitioned from busy to idle. Then it will save the time stamp to Last Idle Clock in the counter context.
When a new IRP is sent to partition manager it will increment the queue depth and will check if queue depth reaches 1. If yes, it indicates the disk state has been transitioned from idle to busy. Then Idle time counter will be increased by (current time stamp – Last Idle Clock).
Total split IO (fragmented IO):
When completing an IRP, partition manager will check if the IRP is marked as IRP_ASSOCIATED_IRP. This flag is usually set by the file system driver when a large IO is split into multiple smaller IOs. Typically, when an IO contains several runs and each run will contain continuous block of data, NTFS will create an associated IRP for each run and send this IRP to the lower level driver. Therefore, this counter usually can be used to track fragmented IOs.
Note: Disk performance statistics are saved to an array whose index corresponds to each processor. Most of the counters are saved to the index corresponding to the processor the IRP was completed on.
How to convert to performance counter?
Now we understand how the kernel keeps tracking of these metrics. We need to map those metrics in kernel to the performance counter as shown in performance monitor. The counters visible in performance monitor are calculated based on the metrics from kernel. Each counter has a counter type and each counter type has a different calculation. The counter type determines how the counter data is calculated, averaged, and displayed.
For example, Avg. Disk sec/Transfer has counter type of PERF_AVERAGE_TIMER. The formula of PERF_AVERAGE_TIMER is: ((N1 - N0) / F) / (D1 - D0), where the numerator (N) represents the number of ticks counted during the last sample interval, F represents the frequency of the ticks, and the denominator (D) represents the number of reads and writes completed during the last sample interval. N1 - N0 are returned from kernel as ReadTime + WriteTime in ticks. D1 and D0 are returned from partition manager or volume manager as read counts + write counts.
Avg. Disk Transfer/sec:
Counter type: PERF_COUNTER_COUNTER
Formula: (N1- N0) / ( (D1-D0) / F), where N1- N0 are returned from partition manager or volume manager as read counts + write counts. D1-D0 are the number of ticks counted during the last sample interval. F represents the frequency of the ticks.
Avg. Disk Queue Length:
Counter type: PERF_COUNTER_LARGE_QUEUELEN_TYPE
Formula: (N1 - N0) / (D1 - D0), where the numerator (N) represents queue depth and the denominator (D) represents the time elapsed during the sample interval.
Current Disk Queue Length:
Counter type: PERF_COUNTER_RAWCOUNT
Formula: None. Shows raw data as collected. It’s Instantaneous value of queue depth.
Counter type: PERF_COUNTER_BULK_COUNT
Formula: (N1 - N0) / ( (D1 - D0) / F, where the numerator (N) represents the total ReadBytes + WriteBytes, the denominator (D) represents the number of ticks elapsed during the last sample interval, and F is the frequency of the ticks.
% Idle Time
Counter type: PERF_PRECISION_100NS_TIMER
Formula: NX – N0 / D1 – D0, where the numerator (N) represents the Total IdleTime and the denominator (D) is the value of the private timer. The private timer has the same frequency as the 100 ns timer.
Note: Programmers should avoid calculating counters manually and should instead use pdh.dll. An example of what can go wrong when calculating this data manually is described in Performance Monitor Averages, the Right Way and the Wrong Way.
How to measure disk performance?
In this section we are going to discuss which counters are the key to measuring disk performance. Generally we have 4 counters used for performance measurement: Disk Bytes/sec, % Idle Time, Disk sec/Transfer and Avg. Disk Queue Length.
From the formula, Disk Bytes/sec is actually how many bytes have been completed in every second. There are two things could impact this counter value:
1. How much stress is generated to the disk or volume?
Let’s assume if there are no problems with disk performance and stress has not reached the storage bottleneck. Then, this counter value will be determined the stress IO load generated by the application such as a stress tool.
2. Disk performance
If the IO load has exceeded the storage bottleneck, this counter value will not be able to be increased with load increasing.
Conclusion: Since this counter value could be affected by IO load from an application we cannot use it as the key to determine disk performance.
% Idle Time
This counter value indicates how long the disk is in idle status without outstanding IO. It can help to determine how busy the disk is. However, even if the disk is busy with 0% Idle Time, we cannot say it suffers from a performance issue as it may still be able to complete all IOs in time.
Avg. Disk Queue Length
This counter indicates on average how many IOs are outstanding. If the disk can always complete IO immediately, the value should be 0. Therefore, it’s also a value to determine how busy the storage is. But it does not impact the application directly as the application does not care how many total IOs are outstanding. The application is concerned with how fast every IO can be completed. In practice, if we see the queue depth is more than 10 we may say the storage is busy and could delay the IO in the queue. However if every IO can be completed fast there will be no impact to the application, which means the delay is still acceptable.
This counter indicates how fast the IO is completed on average. This is one of the keys to an application’s performance as discussed above.
Dynamic counter loading feature
On Windows 2008 or above the disk counter in the kernel provider can be dynamically enabled or disabled. If there is no one open handle to HKEY_PERFORMANCE_DATA the kernel provider will disable IO performance trace by setting a flag in the device extension. Here is the Call stack when the counter is being dynamically disabled:
Since the sample app from MSDN tries to close the handle every time after calling RegQueryValueEx(), it will disable and enable the disk counter intermittently. The impact to any app using registry API will be that some IO is started with counter disabled with no time stamp recorded and later completed with counter enabled, then generate a huge time difference for such an IO and charge to sec/transfer. KB 2470949 was released to address this issue on Windows 2008 R2.
Disk Subsystem Performance Analysis for Windows
How to Calculate Your Disk I/O Requirements
Disk Partition Alignment Best Practices for SQL Server