Measuring test effectiveness
Before joining Microsoft, I was a developer at a call center and consulting company in Miami. I was developing applications used by the phone reps (or TSRs) used to take fullfillment orders for some of the big cruise lines such as Princess, Royal Caribbean, NCL, and Celebrity. I primarily used three different languages for the apps I developed: Visual Basic (3 and 4) and Delphi 1 for the database front-end type of applications (I loved DBF files :)) and C/C++ for the telephony-related projects. The database front-end apps were very typical, not all that exciting, but were one of the main tools of the business. The telephony projects were much cooler in a geeky sort of way. They mostly consisted of ActiveX controls that interacted with our digital phones (I believe they were Meridian phones) that were hosted by the database front-end apps. In the beginning, we worked on little advances, such as dialing a customer's number automatically via the digital phone instead of using a modem and good old ATDT. Towards the end we were working on caller-id based database lookups so that the a customer's record would be up on the rep's screen by the time the call was answered and eventually some predictive dialer type projects.
I worked there for about three years. Wrote tons of code throughout that period but never, ever, wrote a single test. Releasing a new version consisted of me trying out whatever the new change was on my dev machine and, a couple of minutes after, deploying the new executable. Deploying the new executable was pathetic. I'd drop it on a drive on our Netware network using one of two filenames we alternated on (e.g. app1.exe and app2.exe). Whenever a rep finished with their current call, they'd close out the app, and launch the new executable. Whenever I think back to those times, I think we were crazy. But then again, it actually worked pretty well, or at least I don't remember ever running into many problems. Granted, if I missed any bugs, I'd get back on my dev box, fix it, and drop a new EXE for everyone to switch to, and hope this one worked better. The bottom line was that we got away with it relatively unscathed without spending much time testing.
We were far from a real software house. The business was making and answering calls, lots of them, and sometimes setting up similar call centers for other companies (using the same software I developed). At Microsoft, it's a whole other world. Shipping is obviously a much bigger deal. Patching products after they've shipped costs tons of money for both us and our customers. The set of supported system configurations is huge. Far from the couple hundred 486sx machines I “shipped” to at that previous job.
Because of this, Microsoft invests a ton into testing its products. I was amazed when I first joined that there were people, and I was going to be one of them, who wrote code solely for the purpose of testing some other person's code. I thought it was great. I figured if I worked hard enough on my areas, they'd certainly ship bug free in the end. It clearly didn't go this way. Once I had been around for a couple of months, my biggest worry was knowing when I had done enough testing on a feature. One could pretty much go on forever developing test cases. Knowing when to stop is one of the hardest things to get a good feel for as a tester, especially early on. Over time, you get more experience, ship a few releases, and start to get a better understanding of how it all works. But you still struggle to find a way to help you better determine when you're really done. Developers rarely run into this. They're usually pretty confident about whether they're done coding (not counting bug fixes) or still have some work to do.
On the C# QA team, we're trying to find ways to get better at this. This problem doesn't just happen at the individual level, it happens at the team level as much or even more so. When you're making decisions at the team level of what testing has to be done to ship an upcoming release, it's easy to try to do as much as possible. The more you you verified works correctly, the higher your confidence is in the quality of the product. When you look back on a release though, trying to identify ways of being more productive and spending more of your time on the areas that give you the most returns, you find that there were a lot of things you could have done better, more efficient, and/or much less of.
We spend a ton of time automating tests that are far from cheap to maintain but rarely actually find product issues after they've passed once. Some obviously do, but the number that don't outnumber those that do significantly. Yet, we still tend to want to write more and more tests in general. The problem is it's really hard to figure out what tests we should spend time on. We do all sorts of testing, from very systematic, highly granular tests to user scenario based testing to exploratory testing to app-building to unit testing. We use data, such as code coverage data, to identify areas we need to focus on more. All of these things help but none of these really make it clear what's really working and what's not.
In the end, we have tons of data, but when you ask a lead/individual what the quality level of their areas is, you'll usually get an answer that's largely made up of a “gut feel.“ Depending on the person's experience, that gut feel could be very accurate or far from reality. If you ask the same question to developers, they'll immediately talk about their bug counts, incoming and resolved rates, etc. For some reason, bug data can't be used as easily on the QA side to determine where we stand. It has to do with what testing we've done so far, what we still have planned to do, and the neverending list of unforeseen things that always seem to come up along the way. Hard data is a lot harder to come by, or at least, it has been for us.
My manager (who has an entry-less blog at this point but keeps saying he's going to get around to it) has asked me, and the other two leads on our team, to come up with ideas on ways we can improve on in this space. Some of the questions we're looking to get better answers for are:
- When do we finish automating tests?
- How do we measure the QA team's effectiveness?
- What's the product's quality at a certain point in time?
- At what point do we think we've found all the high-pri bugs?
- How can we measure the value of one test versus another test?
- How do we know we've improved from one cycle to the next?
If I come up with anything, I'll probably share them here, in hopes of getting some thoughts/opinions/ideas from any of you out there willing to share your experiences.