CLR Inside Out

Measure Early and Often for Performance, Part 2

Vance Morrison

Code download available at:CLRInsideOut2008_05.exe(205 KB)

Contents

MeasureIt Recap
Some Quick Observations
Validating the Data
Caveats
The Uneven Performance Landscape

In the last installment of CLR Inside Out, I stressed that to reliably create high-performance programs, you need to understand the performance of the individual components you use early in the design process (see msdn2.microsoft.com/magazine/cc424899). To do this you need performance data. Thus, measurement has to be an integral part of the design process.

At that time I also introduced a tool called MeasureIt that makes it very easy to create new benchmarks so you can quickly gather the data you need to make good design decisions. The raw numbers that a tool like MeasureIt provides are critically important, but it is also important to develop a general understanding of what the numbers mean. This understanding allows you to predict the performance of something before you actually measure it. That is what I'll discuss here.

MeasureIt Recap

If you have not already downloaded the MeasureIt tool, I strongly encourage you to do so now. It's in the code download for this column, found on the MSDN® Magazine Web site, and it consists of a single EXE file. Running it will generate a Web page displaying the results of running a suite of benchmarks. Additional documentation can be accessed after installation by running the following command:

measureIt /usersGuide

MeasureIt comes complete with its source code, which can be conveniently unpacked using the /edit qualifier. This makes adding a new benchmark as easy as writing one or two lines of code (and supplying the code to be timed). See the user's guide for more on exactly how to do this.

MeasureIt benchmarks are associated with different performance areas, which can be indicated on the command line when the tool is launched. By default (with no command-line parameters), MeasureIt runs a suite of 50 or so benchmarks covering a broad variety of basic Microsoft® .NET Framework runtime operations. Abbreviated sample output is show in Figure 1.

Figure 1 Sample Benchmarks

Name Median Mean StdDev Min Max Samples
NOTHING [count=1,000] 0.000 0.037 0.110 0.000 0.366 10
MethodCalls: EmptyStaticFunction() [count=1000 scale=10.0] 1.000 1.103 0.496 0.857 2.577 10
ObjectOps: new Class() [count=1000 scale=10.0] 5.060 10.223 13.927 3.340 51.215 10
ObjectOps: new FinalizableClass() [count=1000 scale=10.0] 78.552 155.408 168.595 64.997 629.243 10
ObjectOps: (Class) Activa­tor.CreateInstance(classType)] 102.510 102.949 4.076 96.876 109.819 10
Arrays: localIntPtr[i] = 1 [count=1,000 scale=10.0] 0.713 0.664 0.076 0.574 0.773 10
Arrays: string[i] = aString [count=1,000 scale=10.0] 3.402 3.405 0.012 3.397 3.442 10
Delegates: aInstanceDelegate() [count=1,000 scale=10.0] 1.235 1.205 0.111 1.094 1.475 10
MethodReflection: Method.Invoke EmptyStaticFunction() 472.283 472.744 5.409 466.291 482.094 10
P/Invoke: FullTrustCall() [count=1,000] 6.184 6.254 0.793 5.469 7.599 10
P/Invoke: 10 FullTrustCall() (10 call average) 2.669 2.688 0.061 2.665 2.870 10
P/Invoke: 1 PartialTrustCall [count=1,000] 27.806 30.440 8.735 26.343 56.582 10

MeasureIt runs each benchmark 10 times and computes statistics based on the results. The reported values are scaled so that a single call to an empty method takes one unit of time. For example, because Figure 1 shows the median time of allocating a small object to be 5.06, you know it typically takes only a bit more than five times as long to allocate a small object as it does to call a method. But that is not the whole story. Notice that the maximum time for object allocation is more than 51 units. Thus there are times when it can take much longer than the average case. In fact, if the benchmark was unlucky enough to force a full garbage collection of the heap, it might spend considerably more time in the method than the maximum value that is reported here.

Already, however, you should be able to see the value of the MeasureIt harness. With almost no work at all, you can immediately get a rough idea of how expensive small allocations are. Just as important, because the harness collects multiple samples and computes statistics, you can also understand that some operations, such as object allocation, can have considerable variation. Having minimum, maximum, and standard deviation information lets you determine whether or not the measurements are stable enough to be relied upon.

Some Quick Observations

Figure 1 provides you with a rough idea of the cost of object allocation of small objects in the .NET Framework. However, you can quickly deduce quite a bit more. For example, you can surmise that objects with finalizers—declared in C# as ~ClassName()—are more than 10 times as expensive as objects that don't have finalizers. Worse, finalizers are inherited, which means that allocations of instances of any subclass will be similarly penalized. Thus, if at all possible, finalizers should generally be avoided on a class that will have many instances.

You can also infer that allocating an object using reflection (as in Activator.CreateInstance) is more than 10 times as expensive as allocating an object normally. This will be a common theme with reflection APIs: they are expensive compared with their static equivalents, and you should not use them on performance-critical paths.

Furthermore, you can also conclude from the data that calling a method using reflection (as in MethodInfo.Invoke) is more than 450 times slower than calling the method statically (the ratio is so high in part because calling a method normally is very cheap). Likewise, you are also able to deduce that array access is cheap (less than the cost of a method call), but setting an element in an array of object references (as in string[]) is more than 4 times as expensive as a normal array set. You can also see from Figure 1 that invoking a delegate (pointer to a method) is only about 20 percent slower than invoking a method whose target is known at compile time.

Finally, you can tell that when security checks are suppressed, invoking unmanaged code (P/Invoke) is not too expensive (6 times a normal method call), and the average cost can get even cheaper (2.6 times) if many calls are made from the same method. However, if security checks are used (which is the default), the cost is significantly more (27 to 30 times a normal method call).

Already, some useful observations have been made from the built-in MeasureIt benchmarks. Moreover, since the MeasureIt download includes source code, you can get all the necessary details to drill deeper into the data. For example, you may want to know exactly how to suppress the security checks for P/Invoke calls (or even how to make a call to native code in the first place). To do so, just unpack the source using the following command:

measureIt /edit

Then search for the P/Invoke benchmark. You'll get the exact code you need, which you can examine, debug, and even modify for your own purposes.

Validating the Data

While MeasureIt quickly tells you the costs of some of the basic operations performed by the .NET runtime, it will not tell you why the numbers are what they are. It is very possible that you are not measuring what you think you are measuring.

In last month's column I highlighted how easy it is to create misleading benchmarks (especially micro-benchmarks), underscoring the importance of validating performance measurements before relying on them. Remember that you must also validate the data from the built-in benchmarks.

Here is where some expertise in the internals of .NET runtime is helpful. By knowing exactly how operations within the runtime are translated into machine instructions, I can make estimates on how long various operations should take. This is summarized in Figure 2. Because of the optimizations that the hardware itself performs, instruction counts do not map precisely to execution time; however, they do provide a good ballpark estimate. If you ever find that the instruction execution time far exceeds your expectations based on the instruction count, it's more than likely that the data is incorrect. The data shown in Figure 2 is best thought of as a qualitative view of the operations, which nicely compliments MeasureIt's quantitative view.

Figure 2 Performance of .NET Runtime Operations from MeasureIt

Operation Instruction Count Comments
Integer Arithmetic 1 Simple unchecked arithmetic compiles to its corresponding machine instruction and often has a cost of less than one machine cycle.
Floating Point Arithmetic 1 Compiles to x87 instructions (for 32-bit). Currently, the runtime does not do vectorization or strongly leverage newer hardware instructions (SSE2), so typically the best unmanaged compilers will do better on floating point intensive apps.
Instance Field Fetch or Set 1-10 Most operations take 1 instruction. However, setting a field that is an object reference (not a primitive type) requires the use of a write barrier routine that takes 6-10 instructions.
Static Field Fetch 1-12 Only 1 instruction is required in the common case of just-in-time (JIT) compiled code that runs in one AppDomain. However, hosts like ASP.NET can ask the runtime to generate code that can be shared across all AppDomains (saving a lot of memory and JIT time) at the expense of making a static field fetch about 12 instructions. The JIT compiler does hoist common operations like these out of loops, partially mitigating this cost. Code precompiled with native image generation (Ngen) is generated in such a way that it will work regardless of the code sharing requested, which means that even in the best case, Ngen-generated code takes 6 instructions to fetch a static field.
Non-Virtual or Static Method Call 1 Compiles to a single call instruction and is the fastest kind of call.
Virtual Method Call 2 Uses dispatch tables just like C++. This is the fastest indirect kind of call.
Interface Method Call 4-20 Interface methods are dispatched through stubs that are very fast (4 total instructions) when the particular call site almost always has the same target. When a single call site dispatches to many targets, it takes 10-20 instructions to do a hash table lookup to dispatch.
Delegate Dispatch 4-15 If there is a single target method and it is an instance (non-static) method, which is the common case, it takes 4 instructions and is comparable in speed to any other type of indirect dispatch. When the target is static, the arguments need to be shuffled to remove the unneeded "this" pointer that was passed, which can take several instructions. When a delegate has several subscribers (events can do this), a dispatch must have a loop and is more expensive, but it is a less common scenario.
Object Allocation 10-1,000+ The code path in "new" for most objects takes 10-15 instructions; however, some object types (such as finalizable objects) cost much more. In all cases, the allocation incurs additional costs that are paid later in bulk. These include the cost of clearing the memory (proportional to its size) and scanning costs during some garbage collections (GCs) for as long as the object lives. Objects that live a short time incur significantly less GC overhead than objects that live longer; however, the cost is still high enough that allocation should be minimized on performance-critical code paths.
Array Access 1-25 Array access is typically two instructions: bounds check and fetch. On simple loops where you iterate over all elements in an array, the JIT compiler can eliminate the bounds check, making it just one instruction. In fact, as a result of this optimization, using unsafe pointer access instead of array access will often not improve performance significantly. If the element being set is an object reference, the set requires a write barrier and type check, which increases the size of the code path to about 25 instructions.
Casting 4-100+ Successfully casting an object to its exact type is fast (4 instructions); however, casting to a superclass is slower, and casting to an interface is slower still. Casting arrays is even slower. Failed casts (often the case when using the C# "is" or "as" operators) are also relatively slow.
Locking 20-1,000+ Entering and exiting a lock (System.Monitor.Enter or the c# lock statement) is moderately expensive (10 to 15 times a call-ret), even in the best case. If there is contention for the lock, it can be significantly slower.
P/Invoke Call 15-1,000+ Calling native code (P/Invoke) from managed code, in the best case when security checks are suppressed and no argument conversions are needed, is 15-20 instructions. It is much worse when strings are passed (requires conversion) and/or COM interop is involved.
Reflection 1,000-10,000+ Calling simple reflection APIs on types (such as System.Type.GetType) can be cheap (< 10 instructions); however, using reflection APIs to do almost anything else, such as call methods, set fields, or create objects, is significantly more expensive (10 to 100 times) than its non-reflection alternative. As a rule, reflection should not be used in performance-critical code paths.
Generics any Generic types can have performance similar to their non-generic counterpart. When the type argument of the generic type is a class (rather than a struct), the code is shared among all instances of the type. This sharing saves space, but it can mean that operations that depend on the type parameter may be slower than expected. If you use generics on a performance-critical path, you should measure.

The information in Figure 2 is useful for validating the performance numbers that MeasureIt provides, but it also allows you to make rough predictions about the performance of your programs and the costs of various design trade-offs. For example, you can predict the cost trade-off between two code factoring designs that differ in how often they cross a managed-to-unmanaged (P/Invoke) boundary. You can also make predictions about how much would be saved by avoiding reflection methods or the cost of adding locks to a program to make it thread safe.

Caveats

Personally, I have found the data from MeasureIt to be very helpful in making many performance decisions. However, the built-in benchmarks focus on CPU-intensive operations. If your application's performance is not limited by the CPU, a focus on CPU optimization is misguided. Three common scenarios where CPU is not the critical performance factor are when the application response time is dominated by I/O costs or network latency, when the application response time is dominated by memory caching costs, and when, in a multithreaded application, the application response time is impacted by thread serialization delays.

The performance of some applications is limited by the speed at which they can execute disk and network I/O, which is many times slower than actual instruction execution. This can happen on application startup when the application requests files that are not yet cached in memory (cold startup). It can also happen when your application consumes enough memory to cause many page faults. In general, if your application is not consuming 100 percent of a processor, then these other latencies are important. Most of the information I present here is not relevant to this case since this column focuses on identifying CPU-bound performance bottlenecks.

In other applications, response time is dominated by memory caching costs. This typically happens when you have large, in-memory data structures. The application looks CPU bound (it consumes 100 percent of a processor), but the CPU is actually waiting on the memory subsystem most of the time. You need a profiler with access to statistics on the memory subsystem to diagnosis this properly, but large memory consumption (> 50MB) and a large number of page faults are usually an indication that memory issues are a factor. Generally, your design strategy in this case should be to make things small (even if it increases the number of instructions executed).

In multithreaded applications operating on multiprocessor hardware, it is not uncommon for the threads to be delayed as they wait to enter critical sections of code—code that must be accessed by only one thread at a time. These serialization delays tend to grow worse as the number of concurrently executing threads grows. This lock contention can show up either as being non-CPU bound (when the wait time is long and the waiting thread goes to sleep) or CPU bound (when the wait time is very short but happens very frequently).

Sadly, explaining these non-CPU issues well enough to impart good design intuition would require another full column or two. Just remember that memory consumption does matter (especially for large applications), and you should try to decrease the size of data structures that are larger than 1MB if the trade-off in instruction count is modest.

However, if your application consumes 100 percent of one CPU when it is running (with very few or no page faults), consumes a modest amount of memory when CPU bound (hot data structures are less than 1MB), and is not intended to be a highly parallelized application that relies on shared data, then your performance issue is likely related to the number of instructions that execute. Therefore, the information in Figure 2 and the data MeasureIt provides will be useful in predicting performance.

The Uneven Performance Landscape

The .NET runtime team goes to some trouble to optimize operations whenever possible. Generally speaking, this is a great thing, but it does create an uneven performance landscape because there are cases where the optimization can't be applied. We have already seen an example of this: finalizable allocation. Here are some more examples.

First of all, interface calls are highly tuned for the case in which a particular call site tends to dispatch to the same target. For example, you could have a routine that takes a List<object> and calls ICompareable.Compare on each element. If this was passed a List of strings, the interface call would always go to String.Compare and would be fast. However, if the list contained many different types, then that same call site must dispatch to many different targets. This latter case is significantly slower than the homogenous case.

Second, when primitive types (such as int) need to be passed to operations expecting an object, the value must be converted to an object (this operation is known as boxing). This requires an object allocation and thus is more expensive than expected. Often this boxing is inserted by the compiler automatically, making it all too easy to fall into this trap.

Third, methods that use the variable argument params feature of C# (such as Console.WriteLine) need to allocate an array—and possibly box (allocate) objects for some parameters—for each call. This makes the call much more expensive than normal calls (by a factor of 10). There are dozens of more examples that I could easily list, but these will suffice.

While people who know about the internals of the runtime can warn you about some of these, realistically the list will never be complete. This is why the information in Figure 2 and the data gathered from MeasureIt should only be used as a rough approximation. You never know when you might fall into a bad performance code path. Thus it pays to be cautious: if there is even a possibility that your hot code path might not be optimized, you should not rely solely on deductions made from Figure 2 but instead write a quick micro-benchmark that very accurately represents your hot code path and measure it yourself.

Send your questions and comments to clrinout@microsoft.com.

Vance Morrison is the Compiler Architect for the .NET runtime at Microsoft, where he has been involved in the design of .NET since its inception. Vance drove the design for the .NET Intermediate Language and was lead for the just-in-time compiler team.