Performance Quiz #9 : IList<T> List and array speed -- solution

Last week I posted Performance Quiz #9 for discussion. Well the comments have been just awesome and many people went right ahead and measured the results.  Some of you have discovered the key point in this particular test:

Arrays are magic.

In the test case as given there's a great deal of repeated work extracting the length of the array and accessing the items.  This is because of the unusual property that arrays have -- they implement IList<T> for potetially more than one T even due to inheritance.  In the interest of economy alone then it is worthwhile to consolidate the IList<T> implementations into some kind of secret helper but this has some consequences.

Luckily there are also alternatives -- the example that I provided is perhaps the worst performing choice.  As one reader pointed out you could make it much better by extracting the count from the IList once and then using that in the loop.  Note that the compiler cannot do this because IList<T>.Count is a virtual method and so we cannot assume anything about the implementation (it could sent a letter to grandma every time it is called and grammy wants her email!)

Here are the results from my program (source code attached, see link at the very bottom)

Test Case Milliseconds
Test1: Array 54
Test2: List<> 8
Test3: ArrayWrapper 14
Test4: Array via foreach 9
Test5: List<> via foreach            11
Test6: Array via special 6
Test7: List<> via special 8

The first two rows are the problem as given and we can easily see that the array is performing much more slowly than the List.  Let's see where the cost is:

Excl% Incl% Name
0.00 51.27 ArrayTest.Program.Test1(uint16[])
1.64 51.27   ArrayTest.Program.Test.Sum(System.Collections.Generic.IList`1)
2.38 25.64     System.SZArrayHelper.get_Item(int32)
5.47 23.26         JIT_IsInstanceOfArray(...)
4.72 7.26           ArrayIsInstanceOfNoGC(...)
2.63 10.02           SigTypeContext::InitTypeContext(...)
0.49 22.40     System.SZArrayHelper.get_Count()
5.36 21.91         JIT_IsInstanceOfArray(...)
4.30 6.99           ArrayIsInstanceOfNoGC(...)
2.40 9.07           SigTypeContext::InitTypeContext(...)

Well it pretty much leaps off the page, the cost is in those helpers and you can see they're doing a good bit of internal work to verify that we are talking about a simple array of a non-collectable type (ushort) and then more work happens to get to the real data.  The tragedy here is that the work done to validate what we need to do far exceeds the actual job at hand -- which in both cases is just extracting one integer from a well known location.  The price of abstraction....  Though I'm pretty sure there is room for improvement in there.

Let's look at some of the other alternatives.  First how did the List<T> fare:

Excl% Incl% Name
0.00 8.62 ArrayTest.Program.Test2(System.Collections.Generic.List`1)
1.06 8.62     ArrayTest.Program.Test.Sum(System.Collections.Generic.IList`1)
5.82 5.82         CLRStubOrUnknownAddress
1.54 1.54         System.Collections.Generic.List`1.get_Item(int32)

In the list case there's still a good bit of dispatch code but you can see it's much cheaper.  We're on the sweet path for regular interface dispatch.  The actual count function calls are stil there if we were to look at lower percentages but I pruned anything with inclusive time below 1% in this example so it's not showing up.  We're doing a lot better.

Now what about some of the other results?  Well, now we're in the bonus marks zone and I don't want to drown you in perf results you can get yourself so I'll summarize a bit.

In Test 3 I made a generic wrapper class to hold the array.  This adds a level of indirection but it avoids all the magic array processing.  My wrapper is highly lame (it only implements count and get item) but you can see that it's doing fairly well performance wise.  Much better than the raw array and it's a cheap object.

But the fun continues:  What if we replace that for loop in the original test with a foreach?  We can't get the ultra-special array foreach construct but we can get a nice standard foreach over an IList in both cases.  Both Test4 and Test5 do introduce one allocation for the enumerator (which could be an issue if there were gazillions of these lists) but for our test case that's not even measurable (indeed I get no samples in the allocator).

And the result?  It's pretty sweet:  The array problem goes away entirely because we only have to do the array testing one time.  After that we've got a nice bound up enumerator that knows all the details of the array -- no more redundant computations.   The array works great.. but the list is a bit slower?  Why would that be?  Well there's some extra safety checks in the enumerator access code for List<T> to make sure the enumerator isn't being used on an List that is being modified for instance.  Those checks slow down the list access a bit.

And the last two tests:  What if we special case the array in the sum function; assuming that was a very common/important case.  You can see that test6 gives the best result of all -- now there isn't any enumerator and we're on the most optimized code path of any -- foreach over a plain array.

But why is the List case faster in test7?  Shouldn't it be pretty much the same as test5?  After all there's an extra test and otherwise it's the same.  Now there's a mystery...  I actually ran a variety of control experiments, doing the tests in different order and with different array members to make sure this wasn't an anomolous result -- you can imagine with all these iterations cache effects would be important.  But no matter what I did the code in test7 always went faster than the code in test5.  Adding more iterations actually magnifies the effect.

The profile isn't especially helpful either -- both Test5 and Test7 have exactly the same call shape but Test7 spends less time in CLR stubs apparently.  At the moment my best theory is that we happen to get better cache alignment in Test7 than Test5 -- less instructions split across cache lines or fewer unfortunate cache evictions because of happenstance.

Such is the life of the performance engineer -- we sometimes get bit by a secondary effect.

So shall we continue?  What is happening to make Test7 faster than Test5?

And this seems like a great place to thank my colleague Vance Morrison for chiming in with the foreach and special case approaches when I first was bouncing this analysis around.  Thanks a ton Vance :)