Another uninformed analysis of the problems at healthcare.gov

It’s been interesting, and frustrating, watching the drama around healthcare.gov play out. Here in the real Washington, we’ve got a state-based exchange and while the first day or two was pretty ugly, it’s humming along pretty well now. Clearly things aren’t so rosy for the feds.

Lots of hand-wringing has been applied to the “how this could happen” question. To anybody who has built complex transactional systems, it’s pretty obvious that we’re simply seeing the consequences of an unfortunate gap between unit testing in isolation and integration testing at scale.

Systems like healthcare.gov involve a ton of different subparts: some manage your user account, others check eligibility with various agencies, still more crunch all your data to help search and filter and make plan recommendations, and so on.

All of these pieces ultimately have to work together, but trying to build them in one big clump is basically impossible --- it’s just too much going on all at once. So good engineers break big problems down into little ones and build each part separately. They also unit test each part on its own to make sure it behaves the way it’s supposed to.

Once the pieces are all built and have been tested in isolation --- you plug them together into a complete system. Theoretically, everything just snaps into place and you’re done. Of course, it never works out that way. Sometimes this is because the “rules” of how the pieces should snap together weren’t well thought through. Other times it’s because components share “dependencies” that nobody thought about.

Take this simple example. Team A builds a module and it uses a certain amount of computer memory, say two gigabytes (2gb). Team B builds another module that also uses 2gb. Your computers have 3gb memory total. Each module might work great on its own, but what happens when you put them together on the same computer? Together they break the system, because the combined memory load is too great.

And of course, it’s not usually remotely this clear cut. Maybe both modules only use 1gb under normal circumstances, so even the integration test succeeds. But when lots of requests start coming it at once, Module A starts using up more space, up to 3gb. The result is the same --- your site stops working --- but it only happens when both modules are put together and you run tests simulating a very high load on the system. The more complicated the interactions become, the harder it is to find the bugs in a test environment.

Things just get worse when the teams building each individual part get further and further away from each other. Remember when outsourced parts didn’t fit together on the Boeing 787? Same problem.

And wait --- we’re not done yet. Because you can’t even do real integration testing until everything is ready to put together, you’re always feeling squeezed for time when you get there, and maybe you rush a little, or a lot – and maybe everybody just hopes it’ll be OK, because there’s a deadline with political consequences.

So that’s why the site didn’t work. Yes, it was avoidable and some folks probably should lose their jobs --- but it’s also not a shocker --- this happens all the time. And it’ll get fixed, too --- in a couple of months, the exchanges will be working fine and most people will have forgotten about the annoyance. Do you really think healthcare.gov is more complicated than systems run reliably by Medicare or the IRS or frankly the NSA? No. Freaking. Way.

Which leads me to my last point for this post --- the idea that these website problems have any bearing at all on the viability of Obamacare is so unbelievably stupid I can’t believe that anybody spends one second seriously even talking about it. It’s like claiming representative democracy is a failed experiment because sometimes there’s a long line to vote.

Please.