Using telemetry to reveal, prevent, and act on real problems in the wild

I hesitate to bring this up but there’s no sense hiding it.  For the last year I’ve been the development lead for the Internet Explorer Performance Team.  We’ve done some really cool things I’m super proud of.  Like recently this gem

However, things are usually not so rosy in my world.  There are lots of changes on any given day and many of them can affect things adversely.  So it’s super important to be watchful.  And it’s also important to be watchful in a smart way.  But even if you do all that, things still can get through the cracks.  This post is about one such incident.  But first a little background.

The IE performance team, amazingly, measures IE performance.   We do it in all of the development branches of IE (there are many) and in the master branch.  Additionally, our central performance team does measurements in literally every windows development branch, but they only do a subset of the IE benchmarks everywhere. 

Most *new* IE performance problems, “regressions”, originate in the branches where IE developers are working -- that is only natural.  However, despite the fact that upwards of 99% of changes not related to IE have no impact on IE benchmark it’s still the case that there is the 1% or so.  It doesn’t make sense to burden every other team with also checking every IE benchmark when 99% of the time they are unaffected, so my team also is responsible for finding cases where we have inadvertently been affected by something in Windows generally, and getting to the bottom of it.

We have a large suite of benchmarks to catch these things and on balance they do a very good job.  But they are inherently imperfect.  You just can’t cover everything in a modest set of benchmarks because there’s not enough hours in the day to run every test you might want in every configuration.  So sometimes things get through.  And also you can’t really count on your benchmarks to give you a sense of what customers are really seeing (even if those customers are still just Microsoft employees trying out the latest build).  The diversity of configurations you can test in your lab is nothing like the diversity in the wild.  To say nothing of the diversity of usage scenarios.   So while suites help a lot, you need to supplement this with telemetry.

Why is this so important?  Well, now consider this recent example.  A fairly modest mistake was made in semaphore handling that was largely unrelated to the usual things the IE engine has to do like formatting, layout etc.  And in fact none of the usual benchmarks detected anything abnormal.  So when it came time to commit our code to the master build we were happy to do so.  However shortly thereafter, we started getting reports of badness.

Unfortunately, even in an engineer-rich environment like Microsoft you get a lot of reports that are not very actionable.  Very early on two people reported that recent builds of IE were much slower and very near unusable.  However neither of these people had any data that could be used to track the problem and in both cases whatever it was went away by the next day.  This does not mean that there were not experiencing a real problem, but this is the kind of report that leaves you with little to do but cry about it.  Not especially helpful for improving things, though possibly therapeutic.  A few more days went on and others started making similar reports.   Most of the bug reports were just as un-actionable as the first – I won’t say unhelpful because at least we knew something very bad had happened. Thankfully though, now some very tenacious people were gathering traces for us and some of our automation was starting to notice the problem and so automated reports with more logging were starting to arrive.  Things were looking up.

Shortly after the first traces arrived we had a positive diagnosis.  The problem was fairly simple and had even been specifically logged previously, and even fixed, but not correctly.  Now those of us looking at this data and hearing various reports we also had the job of making sure that the reports had the same root cause – we couldn’t rule out the possibility that there were 2, or 3, or more different things people were experiencing that were resulting in this behavior.

Again lucky for us, this particular bug was not insidious.  While it was regrettable it was comparatively easy to fix and soon a fix was available internally.  But the whole experience made me think about how we might have caught the problem sooner.  And also how we could have determined that it really was only the one problem more affirmatively, with less manual examination of every trace we could get our hands on.

Here the telemetry really shines, and I wish we had looked at it more closely before we committed our changes because I suspect even with a small N there would have been enough data to show that something was wrong.

The following chart shows the distribution of navigation times in the wild (uncontrolled activity) for our internal users before the bug was introduced (in blue) and after the bug (in orange).  In this view the badness just leaps off the page – there is a giant spike at the 10s mark.  And indeed the nature of this bug is that it causes delays of exactly 5s during download of images, due to hitting a 5s timeout.  The data shows clearly that we never see just one such delay.  And it also shows that, disregarding this one (bad) problem, the new/orange version of IE was actually superior to the previous/blue version.

What’s more, the telemetry was able to provide us with two other interesting facts – all builds are reporting a spike around 125ms or so – why?  And why did the old build have a spike around 10s also?  You can be sure we’ll be looking into that soon.

Lesson for me: look at this telemetry sooner, it’s proving to be a useful tool in judging build quality.

Good lesson for everyone.

*Thanks to Amy Hartwig for providing me with the charts on a silver platter.