Variability in Benchmarks

I’m not especially a great fan of micro-benchmarks, they’re handy as a compliment to the larger tests but I often find that they lead to confusion because they are interpreted as reflecting reality when nothing could be farther from the truth.  The fact is that micro-benchmarks are intended to magnify some particular phenomenon to allow it to be studied.  You can expect that normal things like cache pressure, even memory consumption generally, will be hidden – often on purpose for a particular test or tests.  Nonetheless many people, including me, feel more comfortable when there is a battery of micro-benchmarks to back up the “real” tests because micro-benchmarks are much more specific and therefore more actionable.

However, benchmarks in general have serious stability issues.  Sometimes even run-to-run stability over the same test bits is hard to achieve.  I have some words of caution regarding the variability of results in micro-benchmarks, and benchmarks generally.

Even if you do your best to address the largest sources of internal variability in a benchmark by techniques like controlling GCs, adequate warm-up, synchronizing participating processes, making sensible affinity choices, and whatnot, you are still left with significant sources of what I will call external variability

The symptom of external variability is that you’ll get a nice tight results from your benchmark (a good benchmark might have a Coefficient of Variation of well below 1%) but the observed result is much slower (or, rarely, faster) than a typical run.  The underlying cause varies, some typical ones are:

  • Scheduler variability – the scheduler makes an unfortunate choice and sticks with it
    • The entire run is much slower than normal
  • Uncontrolled background activity external to the test
    • Part of the run is slower, maybe several consecutive test cases in a suite
  • Interference from attached devices
    • Incoming network packets or other sources of interrupts

These problems are often not the dominant sources of variability initially but as the benchmark is improved, controlling for other factors, they become dominant.

Statistically the underlying issue is this:  repeated benchmark measurements in a run are not independent: a phenomenon affecting a particular iteration is highly likely to also affect the next iteration.  A closer statistical model might be independent measurements punctuated by a Poisson Process that produces long lasting disruptions.  "Waiter, there is a fish in my benchmark."

My experience with this sort of thing, and I expect the battle-scarred will agree, is that reducing external variability is an ongoing war.  There isn’t really a cure.  But it does pay to not have hair-trigger reflex on regression warnings, and to screen for false positives in your results.