Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Hannah Clayton-Langton (00:05):
Hello
world and welcome to the Tech
Overflow podcast, where we havesmart tech conversations for
smart people.
I'm Hannah Clayton Langton andafter nearly seven years working
in tech companies, I decidedthat I wanted to understand a
little bit more about what wasgoing on around me, so I
enlisted my co-host, Hugh, totake me on a technical journey.
Hugh, how are you?
How's the jet lag?
Hugh Williams (00:24):
I am well, hannah
.
It's day six, so it usuallytakes me nine days to get over
the jet lag, but it's fabulousto be here in London with you,
actually in the same room,without a 200 millisecond time
delay.
And for those of you who don'tknow me, hugh Williams is my
name.
I'm a former vice president atGoogle.
I was a vice president also ateBay and I was a senior engineer
(00:44):
at Microsoft, so my job heretoday is to help demystify tech
topics for Hannah.
Hannah Clayton-Langton (00:50):
Perfect,
okay.
So I'm super excited fortoday's episode because it feels
like a proper under the hoodtopic, which is what we're
calling bugs and outages.
So what happens when it goeswrong and what sort of systems
are in place so that thingsdon't go wrong?
Because, as end users, wedefinitely notice when things
don't work, but we expect themto work all the time.
Hugh Williams (01:10):
I can feel that
sort of stressful feeling that I
used to get when I was a vicepresident in my stomach already
of an outage or a bug in the CEOcalling me.
But I'm looking forward to theconversation Amazing.
Hannah Clayton-Langton (01:21):
Well,
we'll try not to stress you out
too much, so let's start withprobably the most notable outage
that I reckon most of thelisteners will have been aware
of it was about 11 months ago, Ithink which was the CrowdStrike
outage.
So maybe, before we get intothe sort of nuts and bolts of
bug and outage management, canyou just talk us through what
(01:41):
happened there, because I thinkthat's a really good real-life
example of when tech goes wrong.
Hugh Williams (01:45):
I remember when
it happened and as an engineer,
I think we're a bit like abrotherhood or sisterhood at
some level.
We always feel for each otherand I remember thinking
somebody's having a really,really bad day and gee, I'm glad
that's not me.
We'll probably talk about someof my stories later on, of
things that have gone wrong, butI definitely felt for the
(02:05):
engineering team over there atCrowdStrike, so let's pull it
apart, it'll be fun.
Hannah Clayton-Langton (02:09):
Yeah, so
when an outage happens, that's
basically like an engineeringfault or mistake, right, like my
understanding is, it's probablysomeone's deployed something
new like an update to the codeand there's been an unintended
consequence of what they rolledout and it breaks something.
Is that like a fair, genericassessment?
Hugh Williams (02:28):
I think that's
fair.
I think one thing to remember,though, is it's not always the
fault of the folks that youthink it is the fault of right.
So so let's imagine that one ofyour favorite websites goes
down tomorrow.
Could indeed be those folks.
So it could indeed be the folksthat are, you know, building
Instagram or whatever it is thatyou're using.
It could also be the folks thatare hosting the service that
(02:52):
that runs on.
So you know, let's imagine thatInstagram runs on AWS.
That's built by Amazon.
It could be an AWS outagethat's causing it.
So not the fault of the folkson Instagram.
Right and then there's all sortsof internet infrastructure in
the way, right.
So, for example, there's thesethings called DNS servers.
We'll talk about that someother time but that's how your
computer, when you're using yourweb browser, figures out
(03:13):
exactly where the machines arethat it needs to talk to.
So you're used to typing in anEnglish thing like Instagramcom
and pressing enter.
That gets turned into somenumbers behind the scenes, and
there's these things called DNSservers that do that conversion.
So if the DNS server isunavailable, your browser can't
turn the words into numbers, andthen you'll think that
Instagram's down, but it's, youknow, might in that case be
(03:34):
absolutely nothing to do withthe folks at Instagram.
But definitely, yes, you'reright.
I mean, coming back to yourfirst point, a lot of the
outages are caused by folksmaking mistakes who are actually
building the products.
Hannah Clayton-Langton (03:45):
Yeah,
well, that's what's kind of
terrifying right, because I makemistakes all the time and I
don't think I've ever quite hadas bad a day as whatever that
engineer at CrowdStrike did Froma technical perspective.
Like is it clear now whatactually happened?
Hugh Williams (03:57):
I think it's
reasonably clear.
Like a lot of these outages andbugs, a whole bunch of things
went wrong.
So let's start with whatactually was the thing that
broke.
So CrowdStrike has a productcalled Falcon, and Falcon is a
product that's installed onusers' computers mostly.
It can also be installed onmachines in a data center that
(04:18):
are running software that'scritical to an organization, and
what Falcon does is itbasically inspects the internals
of the computer to see ifnefarious, malicious things,
patterns of those things, arestarting to happen.
Hannah Clayton-Langton (04:32):
Because
CrowdStrike is at its core, like
a security software securityservice provider Okay yeah.
Hugh Williams (04:37):
And this is one
of the most popular products.
So you can imagine if we workedat a large company, a company
that uses a lot of tech, atech-enabled kind of company, we
might say.
In our security department weshould get CrowdStrike Falcon
installed on all of ourcomputers and it will monitor
behaviors that are going on onall of our computers and then if
it detects something that hasthe potential to be malicious,
it'll take some action right.
Hannah Clayton-Langton (04:58):
Okay, so
this is like super important
fundamental software that I'mguessing a whole bunch of
companies are using, based onthe variety of companies that
went down on that day, so it'spretty commonly used, right.
Hugh Williams (05:09):
Very popular
software and you can imagine us
making the decision that we wantall of our end users in our
business to have this softwareright, so that if they click on
a malicious website or they openan email they shouldn't open or
they try to install somesoftware they shouldn't install,
that this system's sittingthere deep inside their Windows
machine making sure that there'ssome extra protection there.
(05:29):
So the first thing to know aboutthis Falcon software and
Microsoft Windows is that theFalcon software runs really I
guess in layperson's terms aspart of Windows.
It needs to run really deepinside the machine because it's
got to inspect the whole machine, so it's looking for all sorts
of behaviors that might beoccurring within the machine.
So it's not like something likeMicrosoft Word that you install
(05:51):
on top of the operating system,which means it runs in a very
safe kind of way.
This is actually somethingthat's running deep inside
Windows and actually has a lotof control over what's going on
inside the computer.
So that makes it very, verydangerous.
So if something goes wrong inFalcon, something is going to
likely go wrong inside Windowsand you in this case, ended up
(06:11):
with the blue screen of death,right.
Hannah Clayton-Langton (06:13):
Okay, so
sorry not working.
End of story.
Hugh Williams (06:18):
End of story.
A file was installed deepinside this Falcon software, and
this file didn't have thecontents that the Falcon
software expected.
The CrowdStrike folks havedeployed this file onto all of
the computers in the world thatrun this software.
The Falcon software has openedup the file.
It's expected the file to havecertain contents.
(06:38):
It didn't have those contents,and so the Falcon systems
actually crashed.
Hannah Clayton-Langton (06:43):
Okay,
but that file contents.
It wasn't a malicious content,it was just like literally
different to what the computerwas expecting and that sort of
caused like a fault.
Hugh Williams (06:54):
I know many of
our listeners will be familiar
with things like Microsoft Excelor Google Sheets or maybe even
common delimited files.
I think lots of folks importcommon delimited files into
Excel and Sheets, and so thekind of thing that happened here
was that there was a file andthis file was expected to have a
certain number of fields, whichI think was 21.
But the software was onlyexpecting 20.
(07:18):
It hadn't been updated toexpect the full 21 fields, and
so it opened up the file.
It found it had 21 fields.
The software is expecting 20,and all sorts of bad things
started to happen.
Hannah Clayton-Langton (07:28):
And so
because it's so deeply embedded,
it wasn't just like error,please restart, or error
couldn't read file type.
You ended up sort of trippingthe whole system and blue screen
of death right means can't useyour computer, that's right.
Hugh Williams (07:42):
Exactly Because
if something like Word had a
problem, like this, word wouldcrash.
Hannah Clayton-Langton (07:47):
Yeah.
Hugh Williams (07:47):
And you'd say,
huh, word's crashed.
I'll try starting it again.
Huh, it keeps crashing.
Maybe I'll try downloading anew version, or I'll wait till
tomorrow, till Microsoft updatesit.
But because this Falconsoftware runs deep inside the
operating system, it actuallytook down the operating system,
this error, and so all theseblue screens of death started
happening.
So CrowdStrike folks deploythis file and they basically
(08:08):
shut down every Windows machinethat this software is installed
on.
They all get the blue screen ofdeath.
Of course, what happens afterthe blue screen of death is a
lot of folks will try and rebootthe machine.
So they say, oh, reboot.
But the problem was when itbooted back up, the same thing
happened again.
Hannah Clayton-Langton (08:24):
And
every Windows system across the
world that had Falcon installedbasically went black or went
blue.
Hugh Williams (08:32):
And was unusable
and would not boot up again.
Hannah Clayton-Langton (08:35):
And so,
in case any listeners don't know
exactly the incident we'retalking about, I remember it
because it was like Britishairways went down, like
healthcare systems were goingdown and people were really in
crisis, like the first thing.
I remember my WhatsApp groupchat sliding up and everyone was
saying it was this sort of hugeinternational hack.
Obviously it wasn't a hack, butthat's sort of you end up in a
(08:56):
panic state when everythingstarts going down around you,
right?
Hugh Williams (08:58):
Yeah, one of my
good friends always says you
have to choose betweenconspiracy and incompetence.
Pick incompetence every time.
Hannah Clayton-Langton (09:04):
But
conspiracy is much more
interesting.
Hugh Williams (09:06):
Conspiracy is
much more interesting.
Hannah Clayton-Langton (09:07):
Okay,
and so I think I also read two
facts that I found interesting.
On this one, 5% of all flightsthat day didn't go, which is a
huge proportion of flightsglobally if you think about the
economic impact, which I'm surewe'll talk about later, and the
single biggest outage in thehistory of computing and IT.
So really, really bad day.
Hugh Williams (09:28):
I think some
countries have even tried to
compute how many billions ofdollars of damage this caused.
I think you'd need a majorconsulting firm to figure it out
because of the cascading issuesof all the economic damage that
it did, but it's probablyeconomically the most
significant outage that there'sever been.
Hannah Clayton-Langton (09:42):
Yeah,
and, as you say, it was a
CrowdStrike error.
But of course most of thecompanies affected by this
basically didn't have a plan forif things went down.
They sort of trusted that itwould work 100% of the time.
Hugh Williams (09:57):
Yeah, which is
pretty naive, right?
So if we were let's not pick onany particular airline but if
we're a major airline, you know,one of the top companies in the
world hundreds, if notthousands of planes in the air
you would expect, I think, yourchief information officer, or
whoever runs your technologyteam, to probably have some
processes that make youresilient against these kinds of
(10:19):
issues right Like it ispossible that Windows gets into
a state where it continuallyreboots and so you would think
that they'd have some processwhere they could, you know,
remotely re-image the machinewith a safe version from last
week or whatever it is, or youknow there'd be processes in
place that could actually getyou back into a known state and
you could recover from.
But I think just about everycompany that went down kind of
(10:41):
pointed, quite reasonably, Iguess, at CrowdStrike and said
what have these folks done to us?
Went down kind of pointed quiterecently, I guess, at
CrowdStrike and said what havethese folks done to us?
Some folks also pointed atWindows and said hang on a
minute, like why is thissoftware effectively running as
part of Microsoft Windows?
Like really You're letting thathappen.
So there was a bit of pointinggoing on, but I'm not sure quite
enough.
Companies looked at themselvesand said hang on, you know we're
(11:01):
responsible for providing thisservice.
You know why aren't weresilient enough against these?
Hannah Clayton-Langton (11:06):
kinds of
issues.
Yeah, that makes sense.
And if we take a step back tobugs more generally for a second
or unintended consequences of arollout at work, I often see us
talking about rolling back thechange.
So you deploy something.
It doesn't work as expected andyou can do two things.
You can fix it really quicklyor you can roll back, and
(11:26):
sometimes, if it's in a reallike panic state, it's just like
the quickest thing is going tobe to roll back, which I think
essentially is like hitting undo.
Right, you just go roll backthe change, get things to how
they were before, but it soundslike that wouldn't have worked
in this instance, that's right,and I think you know there's
different levels, I guess, ofhow much control you have.
Hugh Williams (11:43):
So if we're
working at a major internet
company, let's go back toInstagram.
We're deploying our software onmachines that we control, so we
can be a little bit more freeand loose, right, because if we
mess something up, these aremachines that we control.
We can rectify whateveroccurred on those machines by
rolling back or fixing forward,and we'll talk about that a
little bit more, I'm sure, in amoment.
But remember, this is asituation where this company is
(12:07):
putting an update out there andevery Windows machine that's out
there that's running thissoftware is effectively sucking
that update down onto thatmachine and CrowdStrike doesn't
have access to the machines.
Hannah Clayton-Langton (12:19):
So it's
like a one-way street.
Hugh Williams (12:20):
It's a one-way
street, so you would think in
this situation, that you knowthe bar, if you like, for the
quality of the updates and thecare that needs to be taken
needs to be very, very high,because it's a one-way street.
Hannah Clayton-Langton (12:32):
Well,
that's my next question, because
I did some research on thisahead of this episode and I read
something that said they'd onlytested the happy path.
And I wanted to bring up thehappy path for the non-technical
listeners because I find it tobe quite a neat concept and so,
from what I understand, like thehappy path is like when
everything works, so you'rebasically testing that the code
you're shipping performs asexpected in a situation where
(12:55):
it's basically encounteringeverything as it should be
working the happy path and itsort of makes sense to me that
you would want to test a few ofthe less happy paths, because
you know things happen and and Iread that that was one of the
sort of diagnoses as to whatwent wrong was they hadn't
tested this code rollout in asituation where everything
around it wasn't functioning asexpected.
Is that?
Hugh Williams (13:13):
right, that's
fair.
That's fair.
And you know, if I go back tothe mid 2000s, when I was at
Microsoft, I mean we hadsoftware engineers and we had
software engineers in test.
So then there were two separatedisciplines.
So the software engineers builtsoftware and the software
engineers in test would try tobreak software.
It's quite different DNAactually.
I think people are born asbuilders or breakers, and the
folks who end up in the breakinghalf of the house are pretty
(13:36):
special people.
I remember, you know, being atMicrosoft catching up with one
of the software engineers whowas in test and we were going to
go down to the cafe and havelunch.
True story this person put abook on top of their keyboard
and I'm like why'd you put abook on top of the keyboard?
And they're like I just want tosee what happens.
if you know, random charactersget entered into this form for
an hour and see what breaks, andthen off we went and I think
(13:56):
it's a special kind of DNA rightTo sort of have that mindset of
I will just do things to tryand break things, and so you've
sort of got to have two halvesof this story right.
You've got to have people whobuild, and people who build
don't necessarily think asclearly about breaking as the
second half, which are thepeople who break.
You know, I grew up as asoftware engineer who builds
things, so I wouldn't callmyself an expert in breaking
(14:20):
things.
But the folks who do thebreaking, you know, treat this
very much as a discipline.
So let's imagine you and I arebuilding a calculator.
The folks who are empoweredwith breaking the calculator are
going to do all sorts of funnythings to our calculator.
So first thing they're going todo is try dividing by zero.
So does dividing by zero causethe calculator to crash or does
dividing by zero cause it tocome up and say, oh, undefined
(14:42):
if you divide something by zero?
So they're going to do thingslike that.
You know they're going to typein numbers with a decimal point
but no numbers after the decimalpoint, and see what happens.
You know they're going to trymultiplying really, really large
numbers together that can't bedisplayed on the calculator and
say, well, is this going to?
You know what happens when thenumber's too big.
So they're going to think ofall the things that are outside
(15:03):
the happy path right of justnormally using a calculator, and
they're going to build softwarethat exercises that path.
And then, when the calculatorbreaks, a well-run company will
say awesome, we found a bug,this isn't a bad thing, it's a
good thing.
And then they'll file the bugin some system and you know that
will ultimately get rectified.
But these people really arethinking about breaking things.
Hannah Clayton-Langton (15:24):
Okay,
rectified, but these people
really are thinking aboutbreaking things.
Okay, and how in like asoftware because you've worked
in, obviously, some quitewell-known software companies,
but like, how long does thattesting phase take?
Hugh Williams (15:31):
A couple of weeks
Longer, so if you sort of go
back to our product managementepisode, where we sort of talked
about waterfall and we talkedabout the variants of agile if
this were a waterfall company,then there is a testing phase.
We're going to cost that outlike we're costing out the
building.
So we're going to say, well,what are all the scenarios that
we want to test?
If we're running a more Agileprocess you know we're running
(15:51):
sort of these one or two week orfour week sprints then testing
is going to be part of thosesprints, right?
So we're going to build somefeatures that are part of our
product and then we're going totry and break those features as
part of the product and whenthat's finished, then we'll say,
okay, crash all the bugs, andthen the software is ready to go
.
So it's just going to be partof these very, very short cycles
too.
Hannah Clayton-Langton (16:11):
Let's
talk a bit more about the
breakers, like breaking software.
Okay, this is sort of acoordinated function set up by a
business.
Have you ever seen that done ina particularly interesting or
useful way?
Hugh Williams (16:24):
Yeah, a couple of
stories for you, so maybe we
could talk about chaos monkeysand chaos engineering, which
sounds like it's a fun topic.
Hannah Clayton-Langton (16:29):
Chaos
monkeys, yeah.
Hugh Williams (16:30):
Another thing we
could talk about is fuzz testing
.
Okay, where do you want tostart?
Hannah Clayton-Langton (16:33):
Chaos
monkeys.
Hugh Williams (16:34):
Chaos monkeys.
So imagine.
So I want you to sort of getour listeners to kind of close
their eyes and just imagine.
Imagine there's monkeys letloose in a data center full of
computers and these monkeys'jobs is to run around randomly
turning off computers.
So imagine that as a concept.
That's actually an idea thatsomebody at Netflix had in the
early 2010s, and so the idea waswhy don't we build software
(16:57):
that randomly turns offcomputers and then we'll make
sure that our systems areresilient against that happening
?
Because if you go and look at amodern data center whether it's
a Google data center or aMicrosoft data center or an
Amazon data center they're madeup of very cheap computers.
In the old days we used to havevery reliable mainframe
computers.
Today we have very cheapcomputers that are very
(17:19):
unreliable.
So somebody at Netflix said,huh, I guess we should build
software that's resilientagainst these machines
effectively being turned off.
And so they wrote some softwarethat would randomly turn off
random computers at random times, and the expectation was that
the engineering team builtsoftware that was resilient
against that.
So there was a big AWS outagean Amazon AWS outage.
I think it was in 2015.
Hannah Clayton-Langton (17:40):
And AWS
is the cloud computing provider
that probably supports a lot ofNetflix, yeah, and.
Hugh Williams (17:44):
Netflix runs on
top of that.
Yeah, when that outage happened, it was chaos.
So all over the globe, lots ofthese companies are built on top
of AWS.
Hannah Clayton-Langton (17:55):
And so
suddenly, all of your favorite
services stop running.
Most tech companies right arebuilt on top of AWS.
Hugh Williams (17:58):
And guess who
didn't go down Netflix?
Those tech companies right arebuilt on top of.
Aws.
And guess who didn't go downNetflix?
Okay, because they'd now hadfive years of history of these
chaos monkeys turning offcomputers, and so, guess what,
they're really good.
When data centers went down,sure yeah.
They then had other chaos ideas.
So what happens?
Why don't we go and fill updisk drives?
We'll have a chaos engineeringtool that goes and just randomly
fills up disk drives and thenwe'll see what happens.
Hannah Clayton-Langton (18:18):
What
does that mean in practice?
Hugh Williams (18:20):
That means you
can't save anything to the
machine.
So suddenly the machine is full, has no further capacity.
So now, what?
Now, what do we do now that thecomputer is full?
Hannah Clayton-Langton (18:29):
And have
other companies followed suit,
like have they now set thestandard for this chaos
engineering?
Hugh Williams (18:33):
Yeah, they
actually.
They open sourced it, whichmeans that they made it publicly
available all their chaosengineering tools, which means
that they made it publiclyavailable all their chaos
engineering tools, which issuper cool.
I mean great, great thing for acompany to do.
Good publicity for them, right.
Hannah Clayton-Langton (18:43):
Makes it
easier for them to hire
engineers.
Well, their whole thing is likeagile and they sort of lead the
way right.
Hugh Williams (18:47):
So I guess, it
fits with their brand.
Yeah, and they had this, allthis stuff around, how you could
take unlimited leave.
Hannah Clayton-Langton (18:51):
Yeah,
I've read the book no Rules,
Rules yeah yeah, so they made itpublicly available.
Hugh Williams (18:57):
It's open sourced
and then you can actually go
and use these chaos engineeringtools now from Netflix, and so I
think that's probably liftedthe resilience of the whole of
the internet now.
Wow.
Yeah, which is amazing, which Iguess everyone stands to benefit
from, right, yeah absolutely,and I think you know going
engineers if you go and talk toan individual engineer, they're
motivated by that stuff right,like they like to help other
(19:17):
engineers.
You know, as a sisterhood orbrotherhood I think, between
engineers and engineeringleaders, you know we're all
largely doing the same thing andI think helping each other is
something that most folks arepretty interested in.
Hannah Clayton-Langton (19:28):
And so
what's fuzz testing?
Is that a similar concept?
Hugh Williams (19:31):
Yeah.
So fuzz testing would havehelped the folks at CrowdStrike
for sure.
So fuzz testing is basicallygenerate lots and lots of random
data and see what happens.
So maybe let's put this in thecontext of Microsoft Word or
Microsoft Excel.
So imagine that on your diskdrive on your computer there's
all sorts of Word documentsappearing that are fictional,
(19:53):
right, so they're not structuredin the way that a Word document
should be, so there's somethingbroken about them.
Maybe the table or the headinghas a bug in it that might cause
Word to crash.
So instead of the file beingproperly formatted, it's got a
formatting issue.
And then when you try and openit, what happens?
Does Word gracefully deal withthat or does Word crash?
And so if you generate enoughof these sort of fictional fake
(20:17):
files, you might find someissues with Word when it tries
to open those files.
So imagine we're now atCrowdStrike with this Falcon
product.
We would have been generatinghundreds, thousands, tens of
thousands of different files andcausing Falcon to open those
files, and of course we wouldhave found the kind of issue
that they actually found inproduction.
(20:37):
So it's really just aboutgenerating random data and
having that data being inputtedinto the systems that we're
building so a very popular waythese days of testing the kinds
of issues that the CrowdStrikefolks faced.
Hannah Clayton-Langton (20:49):
So
they're both chaos engineering
or chaos.
Monkeys and fuzz testing areboth just about like robust
testing.
Hugh Williams (20:56):
Yeah, just chaos,
Chaos.
You know You'll even see insome companies.
You know I used to work onGoogle Maps at Google when we
had a room full of everypossible piece of hardware that
could run Google Maps, so everysmartphone you could ever think
of, like smart watches.
Every Apple iPhone watches everypossible Android device of
which there's a you know, it'sthe Wild West We'd have Google
(21:19):
Maps installed on all of thoseand we'd have all of those
carrying out certain actions inGoogle Maps and then we'd be
able to understand if there'sany particular issues with
Google Maps on any particulardevices.
So I think these large scalecompanies you know really are,
you know they've got chaos,they've got random data, they've
got, you know, environmentswhere they're constantly testing
all of the possible outcomesthat customers could have, and
(21:39):
that gives you a lot of sort ofdefence, if you like, against
issues arising in practice.
Hannah Clayton-Langton (21:44):
Okay,
I've got a few follow-up
questions.
So does better code have fewerbugs?
Like is that?
If you get a lot of bugs insomething, is it a sign that
you've done it too quickly ornot robustly enough?
Hugh Williams (21:55):
I think that's
generally fair.
I think you know one or twothings can be going on if you
don't find many bugs.
So one is the happy path, whichis wow, we're building really
robust software.
You know, look at us, go,that's fantastic.
The other thing that can behappening is we're not testing
it well enough or we've got aculture where bugs are bad.
Some companies have a culturewhere they say if we find too
many bugs, we need to punishsome people.
(22:15):
Well-run companies don't dothat.
Well-run companies say findingbugs is awesome.
That means that we're reallystress testing things in a way
that we should.
Our test team's working really,really well and it's really
about making sure that those getdealt with and dealt with in a
really systematic way, so I'mnot sure that bug count is
necessarily a good measure ofquality.
Hannah Clayton-Langton (22:37):
It's an
interesting balance between your
risk appetite and speed, orrisk appetite and innovation.
Hugh Williams (22:44):
I'd say probably
a better thing to track is do
you fix the bugs within some SLA?
So you're going to have someagreement in your company about
how fast bugs should be repairedand that's probably going to
depend on what we call theseverity of the bug should be
repaired and that's probablygoing to depend on what we call
the severity of the bug.
Hannah Clayton-Langton (23:05):
So at
one end there's probably P3 or
P4 bugs which we just franklydon't care about.
Hugh Williams (23:09):
P stands for
priority, priority, so really
low priority bugs where youmight say, look, it'd be nice if
we fix this at some point, butthis is so minor that let's just
not enforce an SLA.
As you kind of move up the tree, you know a P2 bug, you might
say, look, we have to fix thiswithin a month.
A P1 bug, we might say we haveto fix it within a day or a week
or whatever it is.
And then a P0 bug would bewe're not doing anything until
(23:30):
this bug's fixed.
So all tools down, nobody'sdoing anything until we actually
get this thing rectifiedBecause it's an outage right
yeah or or.
It's so significant that youknow it's impeding our customers
or our users in doing somethingsignificant.
Hannah Clayton-Langton (23:42):
You know
it's a legal issue or it's an
embarrassment to the company orwhatever it is right so it
sounds to me like, if I'mthinking about my flat as a
metaphor, a p3 is like I'vescuffed the wall.
We might never fix that.
And then like a p zero is likea pipe, is like actively
flooding and so we're not doinganything if the pipe's broken.
Hugh Williams (24:04):
I think that's a.
I think that's great okay.
Hannah Clayton-Langton (24:06):
So I'm
really interested in one.
It all goes wrong, like for thecrowd strike teams on that day
in, like july august last year.
Like I know that we haveon-call engineers.
I've got a friend who's asurgeon and like he has the
on-call phone and if someoneneeds an emergency, who's a
surgeon and he has the on-callphone.
And if someone needs anemergency surgery in the night,
can't wait till the morning, hegets called in.
And I'm guessing that's prettysimilar for the on-call
(24:26):
engineers.
Right, they're the ones thathave to crisis manage.
Hugh Williams (24:29):
Yeah, that's
right, and I think that health
analogy is pretty good too.
It's a nice analog, right.
So I think if a patient comesinto emergency, if this patient
presents in a certain way, let'stry steps one, two, three and
four and let's record whathappens, and then, if we're not
in a happy state after that,then-.
Hannah Clayton-Langton (24:48):
If you
can't stabilize, then you need
to bring someone in.
Hugh Williams (24:50):
Yeah, then we go
and find a specialist.
We bring in the surgeon youknow, admit them to intensive
care, whatever it is.
So there's a series ofactivities that are going on,
and those activities involvedifferent groups of people, and
so I think in tech companies,when something goes wrong, quite
often the first line of defenseis some type of operations team
.
They'll get alerted first, butthey may find it themselves.
(25:13):
They might say a dip in a graphor whatever it is some behavior
, and then they'll almostliterally pull out the plastic
card that says oh, if this isoccurring, then try the
following steps.
If that doesn't solve the issue, then they're going to wake
somebody up, and usually theywould wake somebody up who is
related to the area where theissue is.
(25:33):
So let's imagine we're runningan e-commerce site and customers
can't pay for things.
So that operations team isgoing to know that they need to
talk to the payments team andthey're going to know who's on
call, and they're going to wakesomebody up who's on call within
the payments team.
Now, that person is going to bea random person from the
payments team, so they might notknow all the subtle details of
(25:54):
how the master card paymentswork.
They'll try a set of activitiesand if that doesn't work,
ultimately they're going to gofind the person who's the expert
and get them involved.
And so, yep, there's this frontline of defense, yeah, there's
on-call folks, but ultimately,when things get really serious,
the expert tends to get involvedin the end.
Hannah Clayton-Langton (26:13):
Okay,
and so if you are a software
engineer, is it very typicalthat you'll have to opt into
being on call every week or oncea month, or is that something
that is reserved for only peoplewho volunteer?
Hugh Williams (26:26):
So it's not
usually reserved for folks who
volunteer.
So usually it's something thatyou know.
It has a rotor that goes aroundand around and around, but I
would say that this is probablyonly common now in sort of your
mid-tier tech enabled companies.
And so if I was the vicepresident of engineering and I
walked into a company in a newjob and I found out that we had
a pager rotation and thateverybody's spending time on
(26:49):
call, I would say our softwareis not resilient enough.
Hannah Clayton-Langton (26:52):
That
means you're expecting something
to go wrong.
Yeah.
Hugh Williams (26:55):
So it's going to
be something about how we've
built the software.
It's not defensive enough.
It's also going to be somethingabout sort of, perhaps our
software engineering processes,where we're not building things
to a level of quality thatallows us all to sleep well at
night.
So if you go to the big techgiants you know Google, or I
used to work or Microsoft you'renot going to find that these
days.
So these days people sleep wellin general.
Hannah Clayton-Langton (27:18):
In
general.
Well, because I once, manyyears ago, went out on a date
with a guy who was a softwareengineer at a really well-known
tech company that everyonelistening, including you, will
have heard of this podcast istaking an interesting turn now,
but anyway he mentioned at thebeginning of this date that he
was on call and I was like, well, you can't have, can I have a
drink Like you're on call?
And he laughed at me and he justwas, like I can have a drink
(27:38):
and still be on call.
And I was thinking, well, ifyou were a doctor, you
definitely couldn't have a drinkand be on call.
Yeah, I'm not sure he had quitea few drinks, and I can tell you
that by the end of the night,if something had gone wrong or
maybe his shift had ended Hugh,I'm not sure about it this
sounds like a mid-tier techcompany that's not run super
(27:59):
well.
Okay, well, I'll tell you afterwe start recording where he was
working.
But that always fascinated mebecause to me that's quite a big
responsibility and I presumepeople get.
Do they get paid more to be oncall, or is that Certainly?
Hugh Williams (28:12):
here in the UK.
Yes, and I've been working witha few companies in the UK, as
you know, and folks in some ofthose companies actually rely on
those payments to you know,make their mortgage and those
kinds of things.
So it's a real culture of theseextra payments and on-call
rotations here in the UK.
But I'd say that's an unusualthing in, you know, the large
tech companies in the US.
Hannah Clayton-Langton (28:32):
Here's
another scenario for you,
thinking back to CrowdStrike, ormaybe we can draw on some of
your experiences of outages,because they will happen, that's
as sure as anything.
Do people pile into a room?
How does it work?
It's all systems go.
What are we doing?
Hugh Williams (28:48):
Yep.
So there's almost always goingto be what they call a bridge
call some sort.
So there's going to be a callthat you can dial into and there
will be a set of people on thatcall.
There's probably acommunications channel that goes
with that, so something like aSlack or a Teams messaging group
.
So there's going to be acentral place where all the key
people are having conversationsabout what they're doing to
(29:09):
rectify the situation.
I've found in some of thosesituations that as a leader I've
had to take charge of the callor whatever it is, provide a
little bit of structure for thecall.
But generally as a leader youdon't know enough details to be
able to actually resolve thesituation yourself.
But it's a great way to kind oflisten to the team, understand
where we are in the resolutionof the issue and use that
(29:30):
information and report thatinformation out and about to the
rest of the organization.
Hannah Clayton-Langton (29:34):
Yeah,
because there's a lot of skills
that you need in that situation,right, Like you need someone
who's good at communication andstructure and who has a sense of
sort of momentum and urgency.
But they might not be the same.
It'd be great if they were, butthey might not be the same
person who's really technicallyskilled to be able to assess I
don't know the error messages orthe patterns of behavior and
think about what it could bethat's causing the issue.
(29:55):
So, and it's sort of likethere's this whole crisis
element of maybe if it's reallyserious, as everyone's in a
panic mode.
So you need a quite acollection of skills to be good
at that right.
Hugh Williams (30:05):
Yeah, absolutely,
and I think you know a good
operations team is good atputting structure around that.
You know, when I used to workin in tech in the US, often we'd
employ people who were ex-USMarines.
Wow so the operations team, youknow, used to always call me
sir all the time.
Hannah Clayton-Langton (30:19):
Because
they don't panic.
Hugh Williams (30:21):
They don't panic
and they're good at putting
structure around things.
They weren't the coders, orthey were the coders no,
typically the operations team.
So these are the folks who aresitting looking at tons of giant
screens on a wall 24 hours aday and looking for changes in
patterns or changes in behaviors, and they're the folks who sort
of will run these calls wake upthe right people, get the right
(30:42):
people on the call and putstructure around it.
But US Marines are very popularin operations teams in large
tech companies, for sure.
Hannah Clayton-Langton (30:45):
I once
worked with and I'm going to get
this wrong a guy who was anex-US Army helicopter pilot
Shout out to Matt if you'relistening and we were working on
quite a stressful deal and hewas like I'm not being shot at,
so I'm good.
And yeah, he taught me throughsort of how they managed
themselves in like real crisis,you know, when they were out on
(31:05):
the field in Afghanistan.
You know securing targets, so Ican see that they'd be pretty
good to have around in a sort offake emergency, when some
code's gone down.
Hugh Williams (31:14):
And a lot of the
builders, you know the software
engineers who are actuallybuilding the software, just
don't have that in their DNAright.
They're sort of creative typeswho are a little bit artistic, a
little bit scientific, sort ofyou know, trying things,
building things, playing withdata and whatever else you know,
which is a wonderful thing, butthey're not necessarily the
people who can put structurearound a crisis and provide
updates.
But you know, if you have amajor outage I worked at eBay
(31:36):
for a number of years.
I think six months into mytenure we had a nine-hour outage
of the search engine at eBayNine hours.
Disastrous, oh my God.
It cost the company many, manymillions of dollars.
Hannah Clayton-Langton (31:47):
So you
were?
What role at the time?
Hugh Williams (31:50):
So I was a vice
president of engineering and I
was in charge of search.
Hannah Clayton-Langton (31:52):
Oh dear,
you had a bad day.
Hugh Williams (31:55):
Yeah, it was one
of the worst things that could
happen six months into a job.
Hannah Clayton-Langton (31:57):
Six
months in, so you were sort of
accountable, but you could sortof play the card.
Hugh Williams (32:02):
I got my bonus
that year, so I think they
didn't hold me completelyaccountable, but I was probably
only two or three months offbeing completely accountable for
it.
You know, in that situation,what I found myself doing was
listening into the bridge call.
Hannah Clayton-Langton (32:14):
So the
bridge call is the sort of pile
on call.
Hugh Williams (32:17):
That's it where
all the key people have been
woken up and they're allactively working on it.
I think this thing happened ona Saturday and went through to
the Sunday before it was fullyrectified.
It's a little bit like theCrowdStrike Falcon outage,
actually, because the particularissue that happened on the
machines in the data centre ineBay actually caused the
machines to be so busy runningthe software that you couldn't
(32:39):
interact with them.
Hannah Clayton-Langton (32:40):
Wow.
Hugh Williams (32:41):
So the CPU usage
went to 100%, which means that
the computer's not capable ofdoing anything except the thing
it's doing.
So you can't type and have thecomputer recognize the
keystrokes.
It's so busy doing the thingit's doing.
So all these computers went to100%, which we call pegged.
We say the machine's pegged andso you couldn't interact with
the machine.
Hannah Clayton-Langton (33:00):
And is
that sorry that would be anyone
that was on eBay or that was theservers running eBay, like the
computers running eBay.
So it's the servers runningeBay.
Hugh Williams (33:07):
So you can
imagine that there's many
hundreds of computers that arethe search system at eBay and
all of these computers became sobusy that they weren't capable
of doing anything except beingstuck doing this erroneous thing
that they were doing.
So quite a difficult situationof how do you fix a computer
that won't talk to you.
But in that particularsituation we had the bridge call
(33:27):
.
We had all the key people onthe call.
You know they were all going tostay up all night and get this
thing sorted out.
Around every hour or so I wouldeither join the call or I'd get
an update from the director whoran the search team for me and
he'd tell me what happened inthe last hour.
And then I'd very literallycall every executive at eBay.
So I'd call the CEO, I'd callthe CFO, I'd call the head of
(33:48):
the commercial team, I'd callthe PR team, the comms team.
I'd give them like the hourlyupdate, I'd tell them what it is
we're going to do over the nexthour and then I'd say look,
I'll be in touch in an hour andof course all these people want
to know what's going on.
I mean, this is a catastrophicoutcome.
Hannah Clayton-Langton (34:01):
And were
they like effing and blinding
at you, like sort this effingthing out.
Or are they like you're ouronly hope to get it fixed?
So we better be nice.
Hugh Williams (34:16):
Look, I mean
after you know, after several,
showing that we understand whatthe issue is.
I'm showing that we have a pathto recovery.
I'm showing that we've got allthe right people working on this
and we're going to get there.
We're going to get there in theend.
Hannah Clayton-Langton (34:26):
And it's
like we understand that the
computer's pegging.
But, like you can imagine, ifyou can't figure out what's
going wrong, there's that panicperiod until you figure out what
it is, where you're just likeoverwhelmed with error messages.
Hugh Williams (34:39):
Yeah, and this
particular situation is really
hard because you know thecomputer you need to fix won't
talk to you.
So ultimately what you've gotto do is either reboot that
computer so it loses all of thethings that it knows about and
it's doing and try and get itback into a state where it's
operable, or you've got tocreate another computer that
does the same thing and thentake the one that's busy offline
(35:01):
and put the new one online.
But it's actually a reallydifficult situation to sort out.
Hannah Clayton-Langton (35:04):
And once
you've sorted it out, what kind
of sort of post-incident reviewlike?
Could people lose their jobsover doing something basically
irresponsible or careless withthe way they ship code?
Hugh Williams (35:16):
Yeah, great
question.
I think poorly run companieswill fire somebody when they are
doing their best, taking risks,trying to really get things
done, and they make a mistakefor the first time.
Like a poorly run company willfire somebody.
In that situation They'll saywell, you know, you made a
mistake, you're gone Guess.
What happens then is nobodywants to build software anymore.
(35:38):
So everybody's now very, verycautious, goes very, very slowly
, doesn't want to do interesting, risky things that could really
change the game.
They want to just keep theirhead down so they don't get
fired.
And so if you have a culture offiring people who make mistakes
, you end up with a pretty slowmoving company.
So I think that the art here isfirst of all, you've got to
have what we call blame-freepostmortems, so the situation's
(36:02):
over.
Great, we've got the serviceback up and running.
Let's sit down and just have areally structured conversation
about what went wrong and whatare the things that we could do
next time to ensure that thesekinds of problems don't happen
again.
Hannah Clayton-Langton (36:15):
And then
everyone learns something, I
guess right.
Hugh Williams (36:16):
We'll write it up
.
We'll write it up really wellin proper prose, we'll share it
around, we'll talk about it andwe'll sort of celebrate the fact
that we're a better company now, because we know we won't make
that mistake again.
Now, if the same person makesthe same mistake after all that
process, then I think we have tohave a harder conversation.
But we should just celebratethe fact that we're pushing the
limits, we're being the bestthat we can be and we're
(36:38):
learning and we're growing in awell-run company.
Hannah Clayton-Langton (36:40):
Even if
you cause eBay to be down for
nine hours, or even if youdowned half the internet with
the CrowdStrike issue, you thinkthat's still sort of like a
lessons learned before.
Hugh Williams (36:49):
Look, I think
that the CrowdStrike Falcon
issue I think there's a lot ofreally bad things that happened
there that show that that teamwasn't a well-run team.
Hannah Clayton-Langton (37:02):
And so I
think that one probably has to
go a little bit further.
Well, and there's a whole sortof external lens here, right, if
you're a public company oryou're a security company like
CrowdStrike and you down halfthe world, then you don't seem
like that's secure an optionanymore, right, and I think
their market cap and their shareprice massively dipped as a
result.
So got a lot to do to clean upafter those sorts of incidents,
right?
Hugh Williams (37:22):
Yeah, and I think
that's you know.
That's an 11 on a scale of oneto 10.
Hannah Clayton-Langton (37:24):
That you
know, like I think yeah well,
the worst in history, maybe notthe best example to use, yeah,
yeah yeah, but I think you knowthere's a lot of really hard
questions to ask.
Hugh Williams (37:31):
I mean think of
two or three questions that you
could ask You'd.
You'd say but did they test it?
Like, did the engineer and thetest engineer actually test this
thing?
Like, did they actually trydeploying this file onto a
machine or a couple of machinesand start those machines up and
see what happens?
I don't understand how theydidn't have some kind of test
client, test environment,progressive rollout, you know.
(37:52):
So if this was me, I would saywell, you know, we'll deploy it
to ourselves first, right?
So we're CrowdStrike, we'reobviously running CrowdStrike
Falcon on our machines.
Let's deploy it to ourselvesand see what happens.
And then we probably would havetaken down our company, but not
every company, and then wecould have gone okay, you know,
big mistake made, done thepostmortem, been smarter and not
taken down the whole of theinternet.
Hannah Clayton-Langton (38:14):
Well,
progressive rollout is something
interesting that I don't thinkwe touched on, which is
essentially, as it sounds, rightLike you might start with
yourselves and then you startwith a small tranche of
customers to make sure thatthings behave as expected.
Is that the same as somethingI've heard called canary testing
?
Is it called canary testing?
Hugh Williams (38:35):
Yeah, yeah, it is
.
I mean, let's just talk its ownwhat we call environment, and
there's probably lots of theseenvironments for the engineering
team.
So the engineering team canstand up a mini eBay and they
can just tested it themselves.
Hannah Clayton-Langton (38:46):
So it's
like a simulated version of your
external product, where you cantest things in isolation.
Yep.
Hugh Williams (38:52):
So you've got
your own one, completely
harmless.
It's probably got some slightlyfictional data in it.
It's probably a scale replica.
It's not quite as big as thewhole system.
Your testing team's probablygot its own environment the
engineers are never allowed totouch.
So they say, okay, when theengineers have done with their
sort of fiddling around, theygive it to us and we actually
test it and they'll test for thefunctionality.
(39:12):
They'll also test things likeload.
So they'll say can the systemhandle the load that we'd
normally expect?
Hannah Clayton-Langton (39:19):
The
traffic load you mean there,
right, the number of users?
Yep, yeah.
Hugh Williams (39:22):
So they'll do all
sorts of testing, and then
there's probably what's called apre-production environment,
which is an environment whereyou put the next version that's
going to go out to customers.
Like a beta almost yeah exactlyExactly, Sort of pre-release, so
pre-beta, and then eventuallyit'll actually go out to
production.
And when it goes out toproduction, coming back to the
canary idea, we're going toslowly roll it out.
(39:44):
So the first thing we'll do iswe'll make it available.
If you know some trick, right,so we'll get it running.
But you can only use thisversion if you maybe put some
extra characters in the URL inyour browser and then we'll say
okay, let's turn it on for 1% ofcustomers.
So we'll select a random 1% ofcustomers and they'll get a
consistent experience.
That is this new experience.
(40:04):
If that goes okay, then wemight ramp up to 2%, 5%, 20%,
50%.
So if your 1% of users crash,then you don't roll out to the
other 99%.
Yep, you say, well, something'sgone wrong here and that's
pretty unusual, right?
Because we've now gone from anengineering environment to a
test environment, to a pre-prodenvironment, out to production.
If something odd happens there,it's probably related to real
(40:26):
user interaction or real scalewith real users or real user
data or something that'sdifficult to simulate, like
payments, for example.
It's probably going to besomething that it's odd that's
happened.
That's only going to happen inproduction.
But again, if you're verycareful with this, you've got
this sort of clear path forreleasing software, then
generally the problems that youhave aren't catastrophic.
Hannah Clayton-Langton (40:49):
And so
if you're working in an agile
development environment, you'reshipping code like potentially
every day, so that must make itquite chaotic to test
simultaneously and roll outprogressively many different
things at once.
Is that how you end up withproblems where you just you
change too many factors at once?
Hugh Williams (41:07):
Yeah, I think
that's it.
And look, you know, I think atest team would, in an ideal
scenario, like everybody todeploy, get their next release
ready, and then that's in a testenvironment, and then that test
environment's stable for a goodamount of time so they can do
all their best work and thenthat, you know, sanely moves to
pre-production and out toproduction.
But the reality today is that,you know, really well-run
(41:30):
technology companies that are atscale are probably releasing
tens, hundreds or thousands ofdifferent pieces of software a
day.
So the test team doesn't quitehave the luxury of a stable
version of Amazon or eBay orGoogle or whatever it is,
because this thing's just movingthe whole time.
Hannah Clayton-Langton (41:46):
But I
guess that's another one of
those like balancing acts.
Right, because you don't wantto be slow, you want to be quick
and reactive and agile.
I guess that's where the namecomes from.
It's probably worth the tradethat, like in a minority of
instances, you've changed toomany things at once and then you
react rather than trying toplan for perfection and then
(42:08):
basically slow yourselves down,right.
Hugh Williams (42:11):
Yeah, that's it.
That's absolutely.
The argument for thesecontinuous releases is that you
can move really quickly.
Look, and I guess the argumentalso would be if you only change
a small thing, the chances ofit being catastrophically large
or difficult to rectify is low.
Hannah Clayton-Langton (42:25):
And I
presume you'd rather move
forward, unless you're in like apanic state and the easiest
thing is going to be to justclick undo.
Yeah.
Hugh Williams (42:32):
And I think you
know modern software engineers
today would say I'd rather fixforward.
I've got V1 out there and thenI make a change it's now V2, and
then V2 is something wrong withit.
I'd rather quickly go to V3than go all the way back to V1.
But if the failure iscatastrophic, so the whole
site's down, I'm like, well, Idon't know what to fix in order
to go forward to V3.
So I'm going back to V1 as fastas I can, if I possibly can.
Hannah Clayton-Langton (42:55):
Yeah,
well, yeah, and I think we said
earlier that CrowdStrikecouldn't go back.
So they had to figure outsomething forward.
Okay, and are there differentways that different companies
approach outages or bugs?
Like it sounds like if're supercautious and maybe in certain
industries that's the rightthing I'm thinking like banking
or security you want to be a bitmore cautious but are there
(43:17):
different ways that you cananticipate these things
happening, that sort of matchthe industry?
Hugh Williams (43:21):
Yeah, I think
you're spot on.
I mean, I think you have toassess what is your regulatory
environment, your legalenvironment, your risk appetite,
your governance environment,the state of your customers,
what your customers would expect.
I think you have to assess allof those things.
It's a bit qualitative, alittle bit subjective and then
you can kind of set the dial inthe right place as to how fast
you want to move versus how safeyou want to be, and those are
(43:44):
enemies of each other, right?
So if you and I work in a techcompany and we employ a chief
security officer, the idealsituation for the chief security
officer is we never changeanything ever.
System's completely secure.
We don't want all theseengineers releasing software.
I mean, that could cause allsorts of trouble, but that
obviously is not what weactually want.
And so I think you knowsecurity folks sometimes legal
folks, if I was going to pick onthem really want us to do
(44:06):
nothing.
Engineers want to.
You know, let loose.
I want to.
You know, let loose.
I want to build stuff, get itout as fast as I can.
I hate bureaucracy.
All this testing stuff'soverrated, like let's go, go, go
.
And so I think as a leader,you've got to really.
You got to really set the dialto the right point and I think
that's not just a, that's notjust an engineering and product
conversation.
You know that's a.
Who are our customers?
(44:27):
Regulatory environment you knowwho are our customers.
Regulatory environment you knowbanks are a great example.
I mean, if you continuallybreak things and take risks and
don't follow all the regulatoryguidelines, I mean the banking
authorities will actually putpeople in your building who
watch what you are doing.
So I know one of the sort ofneo-banks in the UK currently
has, you know, government peoplein their building looking at
everything that they're buildingand making sure that they start
(44:49):
to comply, because they've justmoved a little bit too free and
fast and loose in an industrythat you know I guess.
Hannah Clayton-Langton (44:55):
You've
got a lot to lose if the banks
go down.
Hugh Williams (44:57):
Got a lot to lose
and you know, I guess it's a
very, very highly regulatedindustry with a lot of controls,
right, because there's thingslike you know, money laundering,
fraud, crime, all these kindsof things that need careful
governance and control.
And of course, you know,obviously people don't want to
lose their money and you knowthe government doesn't like it a
whole lot if you're too freeand loose in the banking
environment.
Hannah Clayton-Langton (45:16):
Okay,
and let's talk a little bit more
about when the problems getthrough.
Like you had the eBay screw up,were there any other like big
outages that happened on yourwatch anywhere?
Hugh Williams (45:25):
Yeah, Look, I had
a pretty interesting issue when
I was working on Google Maps.
It was pretty problematicactually.
It made it into the Guardianand a whole bunch of other
newspapers.
Yeah, so what happened?
Let me maybe just give you alittle bit of setup first before
I tell you the particular issue.
Hannah Clayton-Langton (45:54):
So the
setup is there's a team in my
organisation and their job wasto pick, from all the possible
labels, of all the possiblepoints of interest, which labels
to show on the map at anyparticular zoom level, because
sometimes if I open a map itwill show you like certain cafes
or certain restaurants orcertain landmarks, but not all
of them right.
Hugh Williams (45:58):
Not all of them
right.
And if you zoom out enough, youstart seeing the labels of
states or counties.
You zoom out even further, yousee the labels of countries,
right, maybe continents, thesekinds of things.
And the more you zoom in thenyou'll start to see post office
boxes and labels for shops andall these kinds of things.
And of course you know, ifyou're trying to zoom in a long
way, you've got to really thinkabout which labels to actually
(46:19):
show.
And so some context about youas a human where have you been
before?
What are your interests?
Would help Some context around,sort of what are other people
interested in?
All these things would be veryhelpful in choosing which are
the labels that are most likelyto be useful to you as a human
right.
So there's, a team of engineersthat work on this.
It's actually a really hardproblem.
If you're ever on a plane andyou're watching the map on the
plane, it does a terrible job ofit.
(46:40):
It will serve these randomcities and random trenches in
the ocean.
Not super useful it's not reallygreat geography education it's
because their label selection ispretty poor.
Okay, Right, so it's a hardproblem.
We had a bug.
Bug was a pretty simple bug andthe issue was at certain Zoom
levels we were selecting thewrong labels.
Simple as that.
(47:00):
Right so simple bug.
You know, supposed to beshowing a certain set of labels,
we're not showing the right setof labels, and the particular
manifestation of this bug wasthat the labels West Bank and
Gaza were removed from the mapof the Middle East.
Hannah Clayton-Langton (47:16):
Okay, so
that's like a political
statement without meaning to bea political statement, and so
the Palestinian Authoritynoticed this.
Hugh Williams (47:24):
They actually
jumped to a bigger conclusion,
which was Google has removed thelabel Palestine from the map.
We'd actually never labeledPalestine.
Hannah Clayton-Langton (47:34):
Yeah,
you know I can't comment on
those kinds of issues, but Iknow the.
Hugh Williams (47:38):
US doesn't
recognize Palestine, and there's
a whole bunch of issues thatare way above my pay grade.
Hannah Clayton-Langton (47:43):
Yeah,
yeah, yeah, we'd actually never
labeled Palestine.
Hugh Williams (47:46):
What we had done
was accidentally with a bug,
remove the labels West Bank andGaza, and so this caused a major
international incident.
We fixed the bug.
Hannah Clayton-Langton (47:58):
The
labels came back, and I presume
that was happening in loads oflocations, not just in that bit
of the Middle East, but it justso happened that where it's
sensitive it's sort of surfacedmuch more quickly than if it was
just me.
Hugh Williams (48:09):
Exactly, yeah
exactly and quite rightly.
There's lots of folks who arevery, very sensitive about what
labels appear and you know it's,you know Google.
Certainly, when I was there, Iwas having lots of conversations
with governments andauthorities around the world
about what labels were present.
We often got lots of requeststo take down things, to fuzz out
certain images of certaininstallations or whatever it is.
(48:31):
So there's a lot of sensitivityaround what exactly is Google
showing and not showing?
And because Google Maps hasover a billion monthly active
users, there's a huge user baseout there looking at the product
and reacting to anything thatthe product does.
So, yeah, that happened on mywatch.
It was an honest, simplemistake in our label selector
(48:51):
and we fixed the bug.
We rolled forward, we moved on.
Hannah Clayton-Langton (48:53):
Yeah,
wow.
It's crazy to think about theimpact of some of this stuff,
particularly when you think it'sa free service on your phone
and, like all of these processesand teams and you know
communications with governmentshave to be put in place for
something that I would beoffended if I had to pay Google
like pounds to download on myphone.
I use it every single day.
Hugh Williams (49:14):
Yeah.
Hannah Clayton-Langton (49:15):
Okay, so
basically, screw ups happen.
Anything else you think we needto cover when it comes to
outages and bugs?
Hugh Williams (49:22):
I think we've
done a pretty good job.
The one thing I would say is,if I was giving some advice to
the listeners about how to thinkabout it within your companies,
I'd say, look, engineering isabout making the trains run on
time.
Right, it's a lot about process, it's a lot about structure,
it's a lot about rigor, and thatsounds really boring, but if
you get that right, then that'llfree the company up to really
(49:43):
move fast and build greatsoftware.
And so I think taking thesekinds of topics really seriously
, treating them with theimportance that they have,
making it sort of a fabric ofhow the company works, will mean
that ultimately you can do alot more as an organisation.
So I'd say always invest inthis stuff.
Hannah Clayton-Langton (49:59):
Well,
that has been the Tech Overflow
podcast.
I'm Hannah.
Hugh Williams (50:03):
And I'm Hugh.
If you'd like to learn moreabout our show, you can always
visit us attechoverflowpodcastcom.
Hannah Clayton-Langton (50:10):
We're on
LinkedIn, Instagram and X as
well, so Tech Overflow Podcast.
Hugh Williams (50:19):
Yeah, and we'll
link into the episode show notes
a whole bunch of resources thatyou'll find useful, as always.
Hannah Clayton-Langton (50:22):
As
always, okay, great.
Well, looking forward torecording with you again,
probably virtually, next time.
Hugh Williams (50:27):
Yeah, that'd be a
shame.
Being in person has been so, soawesome.
Hannah Clayton-Langton (50:30):
Yeah,
it's been awesome.
I need to get to Australia more.
Hugh Williams (50:32):
Yeah, you should.
You should just move there andwe can just take this podcast
seriously.
Hannah Clayton-Langton (50:35):
Well,
I'll pick that up with my
husband.
Okay, thanks so much, hugh.
I'll talk to you soon.
Hugh Williams (50:39):
Thanks, hannah,
bye Take care.