Rerun: Bad Computer Bugs

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:04):
Welcome to tech Stuff, a production of I Heart Radios
How Stuff Works. Hey there, and welcome to tech Stuff.
I'm your host, Jonathan Strickland. I'm an executive producer with
iHeart Radio and I love all things tech and I'm
probably still running all over the United States right now
on special projects. And for that reason, we are going

(00:26):
to enjoy another classic episode of tech Stuff. This episode
is called bad Computer Bugs, and it's all about software
bugs that were some of the worst to ever go
out with shipped projects. So let's go back and listen
to this classic episode. I have to address a bit
of apocryphal history, and regrettably it's a story that we've

(00:50):
repeated on tech Stuff. So I'm sad to admit that
I was complicit, although unknowingly, in the spread of misinformation
and that all has to do with the origin of
the term bug to describe a flaw in programming. So
here's the popular story, the one that we have accidentally,

(01:13):
uh promoted on tech Stuff without knowing that we were
in the wrong. It goes that Grace Hopper, who was
an early computer scientist who rose to the rank of
rear admiral in the U. S. Navy coined the phrase
bug after discovering a moth coming up Harvard's Market two calculator,
a literal bug. Generally speaking, the story tends to be

(01:36):
set in nineteen, and there is even a note in
the log book that reads first actual case of bug
being found that's attributed to Grace Hopper. But there are
several points that are wrong in this story. First, the
year it didn't happen in nineteen. It happened on September.

(01:57):
We know because there's a log book. The log book
that marks the incident not only has the notes, it
actually has the moth taped into the book itself. It's
taped onto the page. Second, Grace Hopper wasn't the person
to discover the moth or make that log entry. She
did tell the story about the moth several times, but

(02:20):
it wasn't in the context of finding it or logging it.
She just told the story that, yeah, we really did
have a bug in the system. And most importantly, the
word bug had already been used to describe design flaws
for decades before the Mark two was even designed. In fact,
if you look at the log book, this makes sense.

(02:41):
It says first actual case of bug being found that
sentence doesn't make sense unless you've already used the word
bug to describe a flaw, because you wouldn't say, first
actual case of bug being found that the wording doesn't
make a sense, the context makes no sense. Sadly, there
are documented quotes dating back to the nineteenth century using

(03:05):
the word bug to mean a design fault, and it
could go back even further than that. So is with
much regret that I admit I have unwillingly contributed to
a bit of misleading folklore making the rounds. But I'm
glad I can take this opportunity to address it. All Right,
so let's talk about design bugs, and I'll be covering
several goofs mistakes, flubs, flaws, and outright catastrophes in this episode.

(03:30):
But one thing I'm not necessarily going to cover our
software vulnerabilities that were later exploited, either by opportunistic hackers
or white hats who are just trying to improve system security.
Those vulnerabilities are common in many types of software and
arise not just through mistakes but sometimes simple oversights, and

(03:51):
I think it might be more fun to look at
some real bugs, like stuff that made things go wrong,
stuff that may have rendered a program defunct or otherwise
cau headaches. Now I'm gonna make an exception to this.
I'm going to start off with the Ping of death,
and I only mention it because it has an awesome name. Now,
this flaw caused headaches back in ninety six. It was

(04:14):
a flawed i P fragmentation reassembly code, and it became
possible to crash lots of different types of computers using
different operating systems, although Windows machines were particularly vulnerable, and
this particular flaw would make a Windows machine revert to
the dreaded blue screen of death. And it all happened
by sending a special ping packet over the Internet. So,

(04:36):
for those of you who aren't familiar with what that is,
a ping is essentially a simple message that checks for
a connection between two computers. You send one ping from
a computer to another one and look for a response,
so that way you verify there is in fact a connection.
You can also tell other things like how fast is
that connection between those two computers. Now, in this case,

(04:58):
you would have to actually design a malformed ping request
and send that to a target and it would bring
that target down. That's the only security vulnerability story I
really wanted to focus on. The others are all just
design flaws. And let's begin with the bug that inspired
me to do this episode in the first place. That
Spotify bug I mentioned earlier. Ours Technica wrote a piece

(05:21):
on it in November two thousand and sixteen, but the
problem seems to date back at least as far as
June two thousand sixteen, and that's when a few savvy
Spotify users noticed some unusual activities on their computers. And
it took a little bit of detective work, but they
discovered that Spotify was apparently generating a huge amount of
data on a daily basis, like gigabytes of data per day.

(05:45):
And the culprit turned out to be a vacuum process
for a database file containing the string mercury dot d b. Now,
the vacuum process is the digital equivalent of vacuum ceiling.
It's meant to repack data so that it takes up
a space on a drive. Now, this involves building a
new file to maximize efficiency, which is a good thing

(06:06):
generally speaking. The problem was that Spotify's version was making
it happen way too frequently, like on the order of
once every few minutes, so that's not generally necessary. You
don't need to rebuild a database file every few minutes
to make sure it's the most efficient size it can be.
So each rebuild represented a relatively small amount of data,

(06:29):
but over time it added up, which meant that if
you had Spotify on on your computer, even if it
was just running in the background, it would be generating
gigabytes worth of information rewriting this file over and over.
Now it wasn't filling up a hard drive. It was
just overriding the same file. Now, if it had been

(06:50):
filling up a hard drive, people would have noticed much earlier,
and it wouldn't have just been savvy Spotify users, because
you would suddenly notice, hey, I don't I can't save
anything to my hard drive because things filling up. Instead, Again,
it was just sort of writing and deleting and writing
and deleting the same file over and over again. And
that probably doesn't sound like a big deal, but it
is a problem if you're using a solid state drive

(07:12):
or s s D. So one of the drawbacks of
an ss D is that over time it loses storage capacity.
Like you can store less data on an ss D
over time. Now by overtime, I generally mean over a
great deal of time and a lot of different data
being written to it and overwritten. Uh. Generally speaking, most

(07:33):
of us end up replacing our drives before we get
to a point where the loss of capacity is a
real issue. But similar in a way to how a
battery can lose its ability to hold a full charge
after you've gone through lots of charging and discharging cycles,
you know how a battery won't be able to to
hold as much even if it says it's up to

(07:54):
a percent, but a hundred percent doesn't last you as
long as it used to. That's because it's capacity need
to hold a full charge has decreased over time. But
let's say you've got a program that's just constantly overwriting
data to your drive, you might discover that your ss
D S useful lifespan has been drastically reduced. So as

(08:15):
I record this episode, Spotify has already rolled out an
updated version of its desktop application, and that, by the way,
is the only version of Spotify that was affected. If
you use web based Spotify or mobile Spotify, you're in
the clear already. If you use a desktop version, as
long as you have version one point zero point four
two or later, you are fine. But if you did

(08:39):
have that earlier version and you just had Spotify running
on in the background, chances are it was writing to
your hard drive like crazy. So what about some of
the other big bugs in computer history? Well, some of
the real doozies involve our attempts to explore the final frontier.
So we'll be talking about space a few times in
this episode, and we'll start with an early US satellite.

(09:01):
So first up is a nineteen sixty two blunder involving
the Mariner one. So some backstory on this one. Uh,
We're gonna talk a lot about the Soviet Union in
this episode too. It takes a couple of roles as
we go on. But in this case, the then USSR
had launched Sputnik into orbit in nineteen fifty seven, which
really kicked off the space race and also was a

(09:23):
big shot in the Cold War because of so Union
was essentially saying, hey, we can launch this into space,
we could also launch something at you in response to
the US and done sort of the same thing. They
had launched some satellites into space, and the Mariner one
was going to be a big, big, feather in the
cap of the US. The whole idea was to launch
a probe that would be a fly by probe and

(09:45):
it would go by Venus. So uh NASA, which was
newly formed in nineteen sixty two, was taking control of
this and the budget for this particular project was eighteen
point five million dollars, which if you were to adjust
for inflation, would be almost a hundred fifty million dollars today,
So a hundred fifty million dollar project to launch the

(10:08):
Mariner one and have it fly by Venus. But as
I'm sure you guys have figured out by now based
upon the topic of this podcast, not all went according
to plan. Not long at all. After the rocket launched
from the launchpad, it began to veer off course, and
neither the computer controls on the rocket or manual controls

(10:30):
back at HQ could correct for the problem. The rockets
course was such that it was going to take it
over shipping lanes, which meant there could be a potential catastrophe,
and so Arrange safety officer made the difficult call and
issued the command to blow the whole thing up just
shy of three hundred seconds after it launched. So what happened.

(10:52):
What why did it go off course in the first place, Well,
there was a flaw in the spacecraft's guidance software which
diverted the rocket, and no amount of commands from ground
control could correct for it. After a lengthy investigation, NASA
discovered the era error was the result of a mistake
transcribing handwritten notes into computer code. So someone just took

(11:15):
some handwritten notes and misinterpreted one of them, and that
one mistake was enough to crash the rocket or to
to necessitate it being destroyed. The great science fiction author
Arthur C. Clark wrote that the Mariner one was wrecked

(11:37):
by the most expensive hyphen in history, which isn't quite right,
but it's pretty funny. I mean, come on, it's humorous phrase.
So the actual punctuation mark that caused the problem was
not technically a hyphen. It was a superscript bar. Superscript
of bars, by the way, not a place where playwrights

(11:57):
hang out to get tore up. A superscript barge means
it's a horizontal bar that is above some other symbol.
In this case, it was a radius symbol, and that
was a symbol along with the superscript bar to describe
a smoothing function, which means the formula was meant to
calculate smoothed values over the time derivative of a radius. Now,

(12:21):
without the smoothing function, tiny deviations in course sent commands
to the rockets thrusters to kick in big time to
overcorrect for that problem. As an analogy, imagine you're driving
a vehicle and you see a pothole in the road
and you're approaching it, and instead of gently steering out
of the way, you wrenched the wheel really hard to

(12:43):
the left or to the right in order to try
and get around this pothole. That's kind of what was
happening with the rocket. It didn't have the smoothing function,
and so as a result, it was having these wild
deviations and course. So it wasn't a hyphen that caused
the problem, but is close enough. Our next space story
takes place in nine with the European Space Agencies ARIANE

(13:06):
five Flight five O one rocket. Now, this rocket was
to launch into space on June four nine and instead
the rocket disintegrated forty seconds after taking off. So what
the heck happened? Well, it largely had to do with
the E s A reusing old work. This actually becomes
a theme in this episode. One of the morals of

(13:28):
of this entire podcast is if you're designing something a
successor to an earlier product, and you'd want to reuse
some of the features that you created in your previous product,
test the heck out of it in its new form factor,
because it could be that things that worked perfectly fine

(13:49):
in the earlier model will go awry in the new one.
That's what happened here, So as you might guess from
the name, the ARIANE five marked the fifth generation of
launch vehicles under that name. The Arian four's inertial reference
system would convert sixty four bit floating point numbers into

(14:11):
a sixteen bits signed integer, and it worked just fine.
But the Arian five stats were beefier than its predecessor
with faster engines, and that was where the problem really started.
The engine output meant those sixty four bit floating point
numbers were significantly larger than the ones generated by the

(14:33):
engines on the Arian four. They didn't anticipate this, so
during the conversion process there was actually data overflow, and
that overflow caused both the backup computer and the primary
computer aboard the are N five to crash, and they
crashed in that order. The backup computer crashed first, followed
by the primary computer a couple of seconds later. The

(14:55):
whole thing took less than a minute to go from
launch to disintegration. Oops, now we're gonna stick with space.
But jumped forward to and the Mars Climate Orbiter. This
was an unfortunate problem. So this particular spacecraft was meant
to study Mars's climate, atmosphere and surface changes, and it

(15:18):
was also supposed to be a kind of relay station
for landers that would explore Mars's surface, but none of
that would last because of some pretty significant goofs. So
on September, the orbiter passed into the upper atmosphere of
Mars and did so at a pretty low altitude. And

(15:40):
this is what folks in the space industry called a
bad thing. The drag on the spacecraft was significant. It
began to fall apart and it was destroyed upon entering
Mars's atmosphere. That's what happened. So the software guiding the
orbiter was to blame, and it's a dumb, dumb mistake.

(16:02):
It was supposed to make adjustments to the orbiter's flight
in SI units specifically in Newton seconds. That's what the
contract with Lockheed and NASA said, Newton seconds use SI
units for all of your all of your calculations, but
the software instead made calculations in non SI units, namely

(16:23):
pounds seconds. So Lockheeds software gave information to NASA's systems
using the wrong units of measure. NASA systems then took
the information, assuming it was with the right units of measure,
and executed commands based upon that. Uh So, this is
why if you're ever in a math course and the

(16:47):
teacher makes you stop in the middle of writing a
problem on the board and says, where are your units?
This is why you have to make sure you're using
the right units, because if you're saying a number and
you don't associate a unit with it, someone could make
an incorrect decision based on that and it could be disastrous,
as it was with the case of this orbiter. The

(17:09):
thrusters fired at four point four or five times the
power they were supposed to, and the orbiter didn't stand
a chance. And this was a pretty expensive mistake. That
mission's cost came in at three seven point six million dollars.
But on the bright side, with all of these stories,
at least no human lives were ever in real danger

(17:29):
as a result of the mistake. Now I've got a
lot more to say about bugs, but before I get
into that, let's take a quick break to thank our sponsor.
All right, Now, let's make a switch to A T

(17:50):
and T, which is a company that had a pretty
big problem with switches once upon a time. I'm talking
about an issue that popped up on January. That's when
A T and T long distance customers discovered they were
unable to make any long distance calls. Why why could
they no longer reach anybody? Well, A T and T
s long distance switches, which control that and allow for

(18:15):
the actual connections to be made, were on the fritz.
They were trying to reboot over and over again. They
were just stuck in a reboot cycle. Now, initially the
company thought it was being hacked, But like I said
at the top of the show, I'm not covering stories
about hackers here. I'm talking about big design flaws that
caused problems. So they weren't getting hacked. That's not what

(18:39):
was going on with those D fourteen long distance switches. No,
there was a design problem at fault, so What had
happened was a T and D had rolled out an
update to the code that managed the switches, and it
was meant to increase the efficiency. It was meant to
speed things up. But the problem was it sped things
up so much that the system got caught up in itself.

(19:00):
It gets pretty technical, but I can give you kind
of a overview of what the problem was. Alright, so
each switch had a function that allowed it to alert
the next switch down the line if things were starting
to get harry. So imagine the switch number one is
handling traffic, but it's getting really close to capacity. So

(19:21):
it sends a message over to switch number two and says,
I can't take on any more work because if I do,
I'll be overloaded. Switch to then says, no problem, I'll
take on the any oncoming work for you and we'll
handle it from there. And if switched number two order
to get into the same source situation, it would say
the same thing to switch number three, and so on

(19:42):
and so forth. Now, eventually each switch will contact the
one below it and say, hey, how are you doing there?
And if the answer is okay, then everything switches back
and you go back to normal operation. That's how it's
supposed to work up. But A T and He's updated
code sped things up so much it caused some real issues,

(20:05):
and there was some poor timing, just coincidental timing that
made things worse. So switch number one starts to get
overwhelmed and sends a message over to switch number two,
But switched number two was just in the middle of
resetting itself. So switch number two goes into reset mode,
which says do not disturb. Sends a message over to
switch number three. That prompted switched number three to overload

(20:28):
and put up a do not disturb sign. Move that
down to switch number four. This whole thing goes down
the entire line of on switches. They all end up
getting overloaded as a result of this, and I'll go
into reset mode and they get stuck there. That problem
lasted for nine hours before A T and T was

(20:48):
finally able to address the message load on the entire
system and get the switches back to normal. The estimated
cost of lost revenue for that time was about sixty
million dollars in long distance calls, and there were a
lot of angry customers to boot, so to placate them,
A T and T offered reduced long distance rates on
Valentine's Day pretty ugly by the A, T and T

(21:12):
tried to handle it, at least in a way that
didn't turn it into a pr nightmare. Not so with Intel.
That's what it brings us to the Pendium problem. I
don't know if you guys remember when Penium processors first
came out, but they were a big deal. It was
a redesign of the architecture of the microprocessor and it
was meant to really speed things up. Well, Intel had

(21:34):
a massive nightmare in n thanks to a flaw in
the entire first generation of Pentium processors. Now, when you
break it all down, a CPU is all about performing
mathematical operations on data, so it's kind of important that
does this correctly. Unfortunately, the flaw and the Pentium processors

(21:55):
kind of messed that up. And the issue has to
do with floating point operations. So the predecessor to the Pentium,
the four six, used a shift and subtract algorithm for
floating point operations, which was effective but relatively slow compared
to what Intel thought they could do by totally redesigning
that structure and using a look up table approach. Now,

(22:19):
the table was supposed to have one thousand sixty six
entries programmed directly onto the logic array of the Pentium processor,
but for some reason only one thousand sixty one entries
made it. Five entries went missing and essentially returned an
answer of zero instead of what they were supposed to say,

(22:42):
so if a calculation accessed one of those missing cells,
it got zero, even though that's not the correct answer.
All the first generation pentiums went out with this error
because it was so minor that it wasn't even picked
up by Intel's quality control at the time. Now, processes
worked just fine up to the eighth decimal point. Beyond
that things got messy, But for most folks that wasn't

(23:05):
a problem because they weren't doing mathematical calculations that needed
that level of precision. It just wasn't a thing. In fact,
there was only a one in three sixty billion chance
that this error would cause a big enough problem to
reach up to the fourth decimal place. So most calculations
that were simple were bulletproof. You you were fine. But

(23:27):
if you needed that precision, if you needed that really
fine degree, that's when you would encounter the flaw, and
that happened because there are math professors in this world,
and one of those, Thomas Nicely, discovered in October that
he was getting errors because of this issue. He needed

(23:48):
the processor to work correctly, and so he contacted Intel
about the problem. And this is where we take a
moment to acknowledge there's a right way and a wrong
way to handle an issue. That's your fault until decided
to go the wrong way. My opinion is, if you
make a mistake, it's usually a good idea to just

(24:08):
own up to it and try to make it better.
But Intel's response was more along the lines of, yeah,
we didn't think it was a big deal. And then
Intel made other pr blenders. But because people began to hear, hey,
that pentium processor in your computer that you just bought,
it doesn't work properly. So people wanted to get replacements.

(24:29):
But Intel said, oh, we're only going to replace the
ones if you can prove that the mistakes that it
makes affect you in some meaningful way. So they weren't
They weren't denying that there was a problem. They were
just saying, hey, unless you can prove the problem affects you,
where we don't care that didn't go well. If you
create a product and you market it as the future

(24:51):
of computing, and then it's discovered there's a flaw on
the design, and then you say we'll replace it, but
only if you prove you deserve it, it doesn't end
to make your customer base very happy. So ultimately, until
reverse that decision and offered to replace the processor for
anyone who wanted it who had a first generation Pentium,

(25:12):
and that mistake ended up costing the company four seventy
five million dollars. Yikes. All right, now we're gonna switch
Gears over to Microsoft. First, I think you could claim
that all of Microsoft Bob product that was supposed to
be an easy, accessible computer interface was really just a

(25:33):
massive software bug. I mean, it introduced comic sands for
goodness sakes, The cluttered organization system, the lack of meaningful security,
and other numerous issues plagued that software. But we did
an entire episode of Tech Stuff about Microsoft Bob a
couple of years ago, so I'm not gonna dwell on
it anymore, but if you want to hear more about it,

(25:56):
go find that episode. It was a fun one. Now
into the seven Microsoft experienced a massive headache when a
bug on their servers notified thousands of Windows customers that
they were filthy, dirty software pirates and they should be punished.
These include people who actually had legitimate, legal purchase copies
of Windows XP or Vista. So the problem here was

(26:20):
Microsoft had an initiative called Windows Genuine Advantage, and it
was a nice name for a strategy meant to curtail
operating system piracy. Essentially, it was a component in Windows
that would allow Microsoft to figure out if the copy
of Windows on any given computer was legit. In other words,

(26:41):
it was a d r M strategy. But in two
thousand seven, a buggy install of software on a server
misidentified thousands of legitimate, law abiding customers as pirates for
nineteen hours. The software just laid down the law, and
so people began to receive sternly written warnings about their
choice to indulge in bad behavior. And if you were

(27:03):
a Windows Vista customer, you had it the worst, not
just because you were using Windows Vista, which I think
we all agree was not one of the bright points
and Microsoft's operating system history, but also because Microsoft had
built in the ability for Windows Genuine Advantage to switch
off certain operating system features in Windows Vista if it

(27:27):
determined that the copy someone was using was a pirated version,
so it was misidentifying real versions as pirated ones, turning
off features, and these are for people who have bought
legitimate copies. This, by the way, is one of the
big arguments people have against DRM. It has the tendency
to punish legitimate customers. And you feel like you're stupid

(27:51):
for buying a copy of a piece of software rather
than just stealing one that has had those features or
those defenses removed. Like why you're you're creating more incentives
for people to go outside and get a pirated copy. Alright,
so imagine you've purchased this legitimate copy of Windows Vista.
First of all, you you already feel bad. Then you're

(28:13):
told you're a thief, so you feel worse. Then someone
remotely switches off several features of your operating system. That
was not a great pr message, So that was a
real issue. They did eventually fix it after that nineteen hours,
but by then people were already very upset. Also, I
don't wanna just, you know, pile lots of abuse onto Microsoft.

(28:34):
I gotta talk about Apple here too. So the company
prides itself on a high standard of quality, and in
general it's pretty good about living up to that standard
of quality, depending upon your point of view of their
various products. But that hasn't stopped a few clunkers getting
through and into the public hands. And that was the

(28:55):
case in two thousand twelve with Apple Maps. If you
owned an iPhone back in two thousand twelve when Apple
Maps came out, you may remember this problem. It's pretty
well publicized. Maps were inaccurate, sometimes leaving out important details
like you know, a river or a lake between you
and your destination, things that might be important if I
don't know you don't drive an amphibious vehicle, might not

(29:17):
have a road on there that's important. Might misidentify the
location of a historical landmark. For instance, that thought the
Washington Monument was across the street from where it is,
But nope, it's just where we left it. Despite all
of roland emericks best attempts to move it or destroy it,
it's still there. The real problem here was that the

(29:41):
Apple software just wasn't ready for public unveiling. It was.
It needed a lot more testing. It was trying to
play catch up to Google Maps, but Google had the
advantage of working with companies that have been doing mapping
software for years. Google acquired those companies and acquired the
expertise of people who have been working that software, and
Apple was really just trying to create their own version

(30:05):
and get it out as fast as it could. But
it got out a little too early, and the company
spent the next several months tweaking maps and trying to
keep control of the situation. But by that time, many
of Apple's fans, even the most devoted ones, had kind
of given up and switched over to Google Maps instead. Well,
that's most of the fun stuff. I've got some really
serious bugs to cover. But before I do that, let's

(30:28):
take another quick break and thank our sponsor. Now I'm
going to transition into some serious bugs. These are ones
that either threatened the lives of people or they contributed
to people dying. The ones I've talked about now up

(30:53):
to now rather have cost companies millions of dollars, but
no one's life was truly threatened on Fortunately, that's not
the case with all software bugs. Now, a couple of
bugs had the potential to kill millions of people. One
of those happened in nineteen eighty a famous famous bug,

(31:14):
or at least a faulty circuit, and that was a
faulty circuit in nora ADS computer system which caused it
to mistakenly conclude the US was under nuclear attack from
the Soviet Union. So displays on nora AD systems showed
seemingly random attacks, and they didn't correspond with each other.
So the display might show, Hey, there two missiles heading

(31:34):
over from the Soviet Union. No, they're two hundred. No
they're fifty. No there's three, And it wasn't consistent, and
command posts around the US all had conflicting information, which
led leaders to conclude the whole thing was a regrettable
computer error, and they were right to do so. To
be fair, they were kind of prepared for this because

(31:56):
there was another incident that it actually happened in nineteen
seventy nine that was a scarier and in that case,
someone mistakenly inserted a training scenario into the computer system
that made it seem like the Soviet Union had launched
an all out nuclear attack on the US. But that
wasn't a bug. That was a mistake on the part
of a human who had accidentally uploaded the wrong or

(32:16):
rather executed the wrong command. It didn't have something to
do with a flaw in the computer system itself. However,
because that thing happened and everybody was freaked out and
then was able to determine that, in fact it was
a false alarm, it meant that calmer heads could prevail.
In the nineteen eighty incident, so the Soviets also had

(32:38):
a close call just a few years later. It was
a bug in the early warning detection software that the
USSR was using in the early eighties, and on September
twenty three, Night three and so Union received an alert
that the US had launched a nuclear attack in the
form of five nuclear warheads UH technically two different attacks.

(33:00):
The first would have been a single nuclear warhead and
the second was four nuclear warheads, and this was during
a particularly stressful period in the history of both countries
and their relationship with each other, at the height of
the Cold War nine three now fortunately UH Soviet Air

(33:20):
Defense Forces Lieutenant Colonel Stanislav Petrov suspected that this report
was an error and that there was some sort of
bug in the software or a mistake in the reporting
system that caused this. He gave a command to hold
off on any sort of retaliatory strike, which would have
initiated a full scale nuclear war had it happened. Petrov

(33:43):
was the officer in charge of a a bunker that served
as the command center for this early warning system, and
he he had said afterward that his reckoning was any
real attack would consist of hundreds of warheads, not five.
No one would start an attack with just five warheads,
so it was more likely to be an error than
a genuine attack. So he gave the command to wait

(34:06):
until the reported missiles would pass into the range of radar,
which only extended as far as the horizon, so if
it had in fact been a real attack, it would
have potentially limited the Soviet Union's ability to respond. But
no missile showed up, and he was vindicated in his decision. Now,
the cause of the false alarm in this case was

(34:28):
a combination of factors that the designers didn't anticipate, uh
which largely consisted of sunlight hitting high altitude clouds at
a particular angle from a particular perspective of the satellites,
So the satellites misidentified that reflection as a warhead. Now

(34:49):
they were the subviys were able to address this error
in the future by adding another step in which these
satellites would cross reference data from other geostationary satellites to
make certain and that they are identifying actual rockets as
opposed to high altitude clouds. Now, there are several cases

(35:10):
of software bugs leading to actual deaths. For example, the
the RACK was such a case. Now that was a
radiation therapy machine that could deliver two different modes of
radiation treatments. The first was a low powered direct electron
beam and the second was a mega volt X ray beam. Now,

(35:30):
the x ray beam was far more intense and it
required physicians to provide shielding to patients to limit exposure
to the beam. But the therac had inherited its code
from its predecessor, which had different hardware constraints. Now the
new machine meant that these constraints weren't there, and it
created a deadly problem if operators changed the machines mode

(35:53):
too quickly from one to the other, it would actually
send two sets of instructions to the processor, one for
each mode of operation, and whichever set of instructions reached
the processor first, that's what the machine would switch to.
So let's say you've been operating the THERAC in the
mega volt X ray mode, but now you're going to

(36:15):
have a patient come in. You need to administer radiation therapy,
so you want to switch it to low electron. Being
you switch it too quickly, it sends two sets of
instructions to the processor, and the one that arises the
mega volt x ray instruction, so instead of switching it,
you confirm to stay on the more intense, deadlier radiation.

(36:37):
The tragic news is this did happen several times. Six
patients were documented as dying from complications due to radiation
poisoning from THERAC twenty machines between night five and nineteen
eight six, and while the machine would send error messages
when these conditions were present, the documentation for the machine

(36:58):
didn't explain what the errors meant. It didn't say, hey,
if you get this error, it means that you've switched
modes too quickly and you need to address this. So,
since operators weren't told that this was necessarily a hazardous condition,
they would just clear the error and proceed, and there
were deadly results. In a similar vein in Panama City, Panama,

(37:22):
there was an incident involving a Cobalt sixty system, actually
several incidents involving this Cobalt sixty system that was running
therapy planning software made by a company called Multi Data
Systems International. Now, the software's purpose was to calculate the
amount of radiation that cancer patients should receive in a
radiation therapy sessions. During these radiation therapy sessions, the therapists

(37:46):
were meant to place metal shields on the patient to
protect healthy tissue from radiation damage. And the software would
allow therapists to use a methodology to show where those
shields were on the patient, to indicate where the shields
are present. But they could only draw up to four shields,

(38:07):
and the doctors in Panama wanted to use five shields
for particular therapy sessions. They were overloaded, they had a
long waiting list of patients, and they were trying to
make things more efficient, and they discovered that they could
kind of work around this limitation of four shields by
drawing a design on the computer screen as if they

(38:27):
were using just one large shield that has a hole
in the middle of it. And so what they would
do is they would arrange the five shields to essentially
be in the same sort of shape with the middle
of it being open so that they can have the
radiation therapy passed through it. Uh, But they didn't realize
that the software had a bug in it, and that

(38:48):
bug was if you drew the hole in one direction,
you get the correct dose of radiation, but if you
drew it in the other direction, so like clockwise versus counterclockwise,
the software would recommend a dosage twice as strong as
what was needed, and the result was devastating. Eight patients
died as a result of this, and another twenty received

(39:11):
doses high enough to potentially cause health problems. Later on,
the physicians were actually arrested and brought up on murder
charges because they were supposed to double check all calculations
by hand to ensure that they were going to give
the proper dose of radiation treatment. So while the software
was calculating the incorrect dose, the physicians were responsible for

(39:34):
making sure that any dose that was calculated was in
fact the correct one, and they failed to do so,
or at least that was the charge. There are also
bugs that involved military applications that have resulted in the
loss of life. During the Persian Gulf War in Iraqi
fired scud missile hit a US base in Saudi Arabia
and it killed twenty eight soldiers. Now the base had

(39:57):
detected the missile and had launched and fired a Patriot
missile in return. The purpose of the Patriot missile was
to intercept and destroy incoming missiles, and the way a
Patriot missile did this was to use radar pulses to
guide trajectory calculations so that it would end up getting
close to the incoming missile. This is harder than it sounds,

(40:17):
because both missiles are moving very very quickly, so we
need a very precise information in order to adjust its
trajectory properly and make sure it was on target. Now,
once it gets within range, which is between five and
ten meters I think uh, it would then fire out
thousand pellets from the Patriot missile at high velocity with

(40:39):
the goal of causing the incoming warhead to explode prematurely.
In this case, the Patriot missile missed, and the military
investigated the issue in the wake of the loss of
life and found a problem with the software guiding the
Patriot missile. And it was a problem that actually the
military kind of knew about already. So one of the
processes in the Patriots programming was to avert time into

(41:01):
floating point operations for increased accuracy. But not all subroutines
that depended on tracking time did this. Some of them
remained UH clock units rather than floating point operations, which
meant that they would get out of sync after a while.
There'd be a disagreement in various subroutines as to what

(41:23):
how much time had actually passed. And like I said,
the military was aware of this issue and they had
a work around, which was not ideal. The workaround was
you would occasionally reboot the system, which would reset the
clocks and synchronize them, but over time they would fall
out of sync because they're not tracking time the same way.
And since there was no hard and fast rule as

(41:44):
to how frequently you'd reset the system, problems like this
one where possible, and in fact, in this case it
did happen. So prior to this particular incident, that specific
Patriot system had been running for one hours without a reboot,
and the clock disagreement amounted to about one third of
a second. Now that seems like it's no time at all.

(42:05):
One third of a second is so so short, But
a scutt missile's top speed is about one point one
miles per second or one point seven kilometers per second,
which means if you take a third of a second,
the missile could travel more than five And since the
patriot needs to be within ten meters of a target
to destroy it, that resulted in a catastrophic failure. So

(42:27):
software bugs can be a matter of life or death.
It's not all just Hey, this irritating thing meant people
couldn't make long distance phone calls or uh, this issue
caused my computer to start writing massive amounts of data
to its hard drive. And this is why it's so
important to have really qualified QA personnel go through code

(42:51):
and make sure it's doing what it's supposed to do,
because the problems that can arise can be non trivial
and in fact life or death situations depending upon the
application of technology. So technology is a fascinating thing. It's
a wonderful thing. It has benefited us in ways that
I can't even begin to describe. It's just too broad

(43:11):
a topic, and it's something I've been tackling for, you know,
eight years, and I haven't haven't even gotten close to
getting toward the finishing point. So I don't want to
suggest that technology is bad, but we definitely have the
need to check, double check, and triple check all this
work to make certain things are working properly before we
release them out into the wild. That particularly applies if again,

(43:35):
you are reusing old code or old components in a
new way, because you have to make absolutely certain that
there's not going to be some unintended problem that results
when a new form factor is using old code. And
that wraps up that classic episode of tech Stuff. Hope
you guys enjoyed it. If you have any requests, questions, comments,

(43:57):
you can email me the addresses tech stuff at how
stuff Works dot com, or you can reach out on
Facebook or Twitter. The handle for both of those is
tech Stuff h s W. Don't forget to go to
our website that's tech Stuff podcast dot com. You'll find
a link to every episode we've ever recorded, plus a
link to our online store, where every purchase you make
goes to help the show. And we greatly appreciate it,
and I'll talk to you again really soon. Text Stuff

(44:24):
is a production of I Heart Radio's How Stuff Works.
For more podcasts from my heart Radio, visit the i
heart Radio app, Apple podcasts, or wherever you listen to
your favorite shows.

All Episodes

Episode Transcript

TechStuff News

Follow Us On

Hosts And Creators

Oz Woloshyn

Karah Preiss

Show Links

Popular Podcasts

Crime Junkie

24/7 News: The Latest

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Rerun: Bad Computer Bugs