Volume 31 Number 7
By Michael Desmond | July 2016
In my last two columns, I described lessons that large-scale calamities carry for software development teams. Tragic examples abound, from the misapplied training that impacted the response at Three Mile Island in 1979 (“Going Solid,” msdn.com/magazine/mt703429) to cognitive biases that caused people to sharply underestimate risk in events as diverse as the 2010 Deepwater Horizon oil spill and 2008 global financial meltdown (“Cognitive Bias,” msdn.com/magazine/mt707522).
Of course, these examples are external to software development. One incident closer to home is the Therac-25, a medical device that delivered radiation treatments to cancer patients. Over the course of 18 months between 1985 and 1987, the Therac-25 delivered major radiation overdoses to six patients in the United States and Canada, killing at least two. The manufacturer, Atomic Energy of Canada Ltd. (AECL), struggled to pinpoint the causes, even as incidents mounted.
The Therac-25 could deliver either a low-power electron or X-ray beam, depending on the prescribed treatment. But a race condition in the software made it possible to expose patients to a massive radiation dose, if the operator switched quickly from X-ray mode to electron beam mode before the beam hardware could move into position. Two patients in Texas died after they were exposed to the unmoderated, high-power electron beam.
A second bug could cause the device to produce a deadly electron beam while the operator aligned the device on the patient. The culprit: A one-byte variable in the software that would increment to a zero value on each 256th pass. If the operator pushed the set button at the instant the variable incremented to zero, the beam would activate unexpectedly. Several others may have been overdosed due to this software flaw.
Remarkably, these code flaws had been present for years in older Therac models (including the similar Therac-20) without incident. The difference: The Therac-20 employed hardware interlocks that physically prevented an overdose from being administered—the machine would simply blow a fuse. But AECL engineers had replaced those interlocks with software on the Therac-25. And the software was not up to the task.
Nancy Levenson, a professor of Engineering Systems at the Massachusetts Institute of Technology, performed a detailed study of the Therac-25 incidents (you can read the PDF yourself at bit.ly/2a98jEx). She found that AECL was overconfident in its software, unrealistic in assessment of risk, and deficient in defensive design elements such as error checking and handling. Remarkably, just one developer—of unknown credentials, AECL never identified the person—wrote and tested the Therac software, which was written in PDP-11 assembly language. Levenson found that AECL lacked a robust testing program and assumed that software reused from earlier Therac models would be free of flaws.
Therac-25 reads like a lesson in hubris. Unlike earlier models, the Therac-25 relied almost exclusively on software to ensure safe operation. And yet, AECL relied on a single (apparently, unmanaged) coder and aging, reused software in the device. Now 30 years later, it’s a lesson worth contemplating.
Michael Desmond is the Editor-in-Chief of MSDN Magazine.