Knowing when to Ship: Part 2 - Interpreting Quality Metrics
In the first installment of this series, I detailed a list of key metrics used to determine when a product is ready to ship. Today, I'm going to talk about how I interpret these metrics.
Avoid out of context metrics
The first thing to remember when looking at quality metrics is to look at the data in groups (at least pairs). Individual metrics can be misleading if taken out of context. Let's consider pass rate, for example. Deciding to ship a product, based on pass rate only is a scary proposition. Without understanding how much of the product was covered by the tests, it is easy to have a false sense of security about the product's quality and ship it too soon. The same is true for code coverage. Just because code was called does not mean it behaved correctly. Always consider the pass rate as well.
- High pass rate (~100%) and low code coverage (<50%)
- High code coverage (ex: 90%) and low pass rate
Beware of large amounts of code volatility / churn
To me, the exception that proves the rule of "avoid out of context metrics" is code volatility / churn. Let me take that back... there is still contextual data to consider when looking at code volatility: time. When looking at code volatility, be mindful of where the product is in it's schedule. Is it in the initial development phase? In stabilization? The middle of the final test pass? Code changes. It's not a bad thing, its how bugs get fixed. The important thing to remember is that the more change in the code, the more testing is required to ensure the product meets the quality requirements.
- Large amounts of code change close to release (including betas)
Beware of low code coverage percentage
Though I spoke briefly about code coverage in the "avoid out of context metrics" section, it deserves some additional (pardon the pun) coverage here. In my earlier post, I mentioned that I consider code coverage to be a measurement of risk. It is this reason that I advise careful tracking of the code coverage of your product.
At MEDC, I said that I code that has not been covered cannot have it's quality quantified. I personally consider uncovered code to likely be buggy. Even if the code has been reviewed carefully by a group of very talented developers, there may be bugs hiding in the code. Race conditions are a prime example of a bug that arises in code that "looks right", but does not actually behave correctly.
Additionally, as Ryan Chapman has stated in his MEDC performance sessions, the larger a binary is, the longer it takes to load. If functionality is not necessary for the application, as possibly evidenced by the code not being covered, it acts as "excess baggage" and can lead to longer load times.
Interpreting performance data is, to me, an interesting topic. When I approach performance, I have the following questions in mind:
- On what device was this data collected
- How long did the test run / how many iterations of the scenario
- What reporting technique was used (fastest or "gymnastics" style)
In my mind, the most important of these questions is the first one (what device). When comparing performance data, it is very important to compare results on the same device. For comparing performance, I define "device" to be the pairing of hardware plus operating system. The same hardware running different versions of the operating system (ex: Windows Mobile 5 and Windows Mobile 6) can display noticeable performance differences.
I talk in more detail about performance testing, including reporting techniques, here.
It is also useful to couple code coverage data with performance results. By running the tests a second time, under code coverage, it helps to verify that the intended scenario is actually being measured.
- Code coverage data does not match intended scenario
- Measuring micro-benchmarks using too few (< 10000) iterations or too short (< 1 sec) total run time
This posting is provided "AS IS" with no warranties, and confers no rights.