Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Courtney (00:00):
Greetings, fellow
incident nerds and welcome to
season two of the Void Podcast.
The main new thing for this newseason is we're now available in
video.
So if you're listening to thisand prefer watching me make odd
faces and not a lot, you canfind us on YouTube.
The link is in the show notes.
The other new thing is we nowhave sponsors.
(00:22):
Don't worry, these folks helpmake the podcast possible, but
they don't have any say over whojoins us or what we talk about.
So fear not.
This episode's sponsor is UptimeLabs.
Uptime Labs is a pioneeringplatform specializing in
immersive incident responsetraining.
Their solution helps technicalteams build confidence and
expertise through realisticsimulations that mirror real
(00:44):
world outages and securityincidents.
When most of investment thesedays in the incident space goes
to technology and process,Uptime Labs focuses on
sharpening the human element ofincident response.
All right.
In this episode, we talked toSimon Newton, head of platforms
at Canva.
They just published their firstpublic incident report.
(01:05):
It's not their first incident,but it's the first time we get
to delve into the details ofwhat happened, and we really
appreciate them sharing thatwith us and the world.
So let's get into it and talk toSimon.
Simon, thank you so much forjoining me on The Void Podcast.
Will you do our listeners afavor and introduce yourself and
tell me where you work and whatyou do.
Simon Newton (01:26):
Hi.
I'm very happy to be here.
I'm Simon Newton.
I.
of platforms at Canva.
I've been there about threeyears or so, and, platforms at
Canva consists of a bunch ofdifferent areas, I guess most
interesting today.
that includes the teams that runour edge and gateway, the, front
door into Canva, as the teamsthat, manage all of our cloud
(01:49):
resources.
So yeah, Canva can be, in thecase you haven't heard of it, is
a visual communication company.
what we do is make it reallyeasy for people to express their
creative ideas, to others, in away using visual content.
we have a bunch of differentproducts, built into what we
call the, editor.
so we have whiteboards andpresentations and responsive
(02:11):
docs and websites.
and drawing tools for makingsocial media posts.
so yeah, there's a very, there'sa huge amount you can do.
Courtney (02:20):
Yep.
That's how I'm gonna make thesocial media posts for this
podcast.
So, thank you all very much atCanva.
that's why I was very excitednot for you really, but when I
saw this incident report, I.
go by.
I'm in a Slack group, thisresilience and software group,
and we have a channel that'sjust called cases.
(02:41):
and that's, honestly, I justlike, I could sit and watch that
all day.
Like, and so whenever, you know,sometimes we know because things
happen to us.
But I found out about this, thisincident report and the outage.
I did not experience the outagepersonally.
but I did find out about theincident report from the Slack
group that I'm in.
And I think also I noticed, Thatit's the first one you've ever
(03:02):
done, right?
I don't recall ever reading aCanva incident report before.
So.
So you've had, and in thereport, this report, I think you
say you've had an internalprocess since we're gonna get
into the incident, but you'vehad an incident process since
2017, and I'm always so curiousabout what it is that pushes an
(03:22):
organization over the edge or, Imean, I don't want it to sound
so negative, but like whatbrought you to the point where
you said, okay, we're gonna goto the effort of publishing our
first public incident report.
Simon Newton (03:33):
Yeah.
And it is a bit more effort,right?
we've been publishing, incidentreports internally, like you
said, for many, years.
but there is a differencebetween, a report for internal
consumption and then a polishedreport for external consumption.
so in terms of the why behindit, It's really a reflection of
how we're evolving as a company.
so Canva, stuck or got itsstart, with a bunch of,
(03:57):
individuals and smallbusinesses.
that was typically the, usergroups.
and they'd, be creating socialmedia content and marketing, for
their small businesses, andthat, that sort of material.
but in the last couple of years,we've really seen increasing
adoption within enterprises, andenterprises bring their own.
Like new and different set ofcustomer requirements, And
(04:18):
customer expectations.
and so it is part of us wantingto show our commitment to
transparency, to thoseenterprise customers.
and, also because we believe,and I very much believe, that
sort of being open aboutincident reports does benefit
the broader industry.
and I think
Courtney (04:36):
I agree.
Simon Newton (04:37):
important.
Yeah.
and I, love, I, love what, theVOID is doing here.
my, my take is that I thinksoftware engineering generally
as a field does a fairly poorjob at learning from previous
failures.
if you look at other engineeringfields.
you often they're like morehighly regulated and there's
like processes around thiswhere, where, failures are like
(04:58):
understood and investigated.
but in software engineering, we,seem to make the same mistakes,
many times, right?
And there's not that, that not,that's not, that's learning, as
a broader field.
and so yeah, very keen on, onlike voids efforts to see,
improved education around this,right?
So that we can all, uplift theindustry as a whole.
Courtney (05:21):
Thank you for the
shameless plug.
I appreciate it.
I wanna ask you the samequestion I ask everyone at the
start of this podcast, which Ithink is cruel, and yet I do it
every time.
Can you give a brief summary ofwhat happened?
Simon Newton (05:33):
I will, try and
keep it brief.
the a as sure you are aware,right?
The, large and interesting, are,because of a confluence of
factors, right?
So in this particular case, theywere, there were multiple
contributing factors that sortof lined up in a way that caused
this much, much, largerincident.
But maybe a little bit ofbackground.
(05:54):
so the Canva editor is a singlepage app.
we deploy that multiple times aday.
and amongst other things, aspart of that deploy, we, build
the JavaScript assets, and thenwe publish those into an S3
bucket.
and so then when clients, reloadthe page, or the editor self
updates, those clients then goand, download the new assets,
(06:16):
via our edge provider.
and once those assets areloaded, and the editor is like
functional and running, then itstarts making API calls to, to
load in all the content.
API calls also go via our edgeprovider, to what we call our
Canva API Gateways.
and then those gateways routethose requests to the various
services, within, our systems,which then will, handle them.
(06:40):
and I guess the, other key bitof information is those gateways
run as auto scale groups, withinour cloud provider, AWS.
so in this incident, the first,the first like factor that
contributed towards it was wedid a deployment of the editor,
at the same time that they werenetwork issues that were
occurring within our Edgeprovider.
(07:01):
and now normally, these networkhappen.
Issues happen all the time,right?
Like the internet is a messyplace.
this is happening day in
Courtney (07:09):
Yep.
Simon Newton (07:10):
And so normally
what happens is our Edge
provider has automated software,that can detect these issues and
can mitigate them and routearound them.
unfortunately in this particularcase, and again, we only learned
this later, they had a bit ofmanual stale configuration in
place that have prevented thatautomation from running.
so first contributing factor isthere's a network issue, and it
(07:32):
hasn't been mitigated,automatically.
Courtney (07:37):
As, as often happens
with those things,
unfortunately.
Yeah.
Yep.
Simon Newton (07:42):
And so then maybe
the next bit to, to discuss is
if you consider how the Edgeproxies work, right?
You've got a lot of clients,making requests for resources.
and if that resource isn't inthe local case, if the proxy, it
all has to go and fetch it fromthe origin server.
and it's very likely thatthere's multiple clients
requesting the same bit ofcontent, right?
(08:03):
and pretty inefficient to havethe, or the proxy server then go
and turn around and requestmultiple times from the origin.
So what you wanna do is you wantto coalesce all those requests,
do a single origin fetch, andthen send that content back out
to all the clients.
and you can probably see wherethis is going, but, with a bunch
of clients requesting the sameJavaScript asset,
Courtney (08:25):
Is that the sound, is
that like the sound of a
thundering herd I hear in thedistance?
yeah, that's, yeah.
Simon Newton (08:31):
but the
interesting bit here was if
everything behaved as it was,as, it would normally, latency
would go up because the originfest fetches are taking a bit
longer.
and you know that, that's fine.
it's just a bit of increasedlatency for the clients.
but where we got really unluckywas one of those origin fetches
(08:53):
didn't just go out from measuredin milliseconds to seconds.
The origin fetch actually took20 minutes to occur.
And, as I'm sure 20 minutes incomputing time is an
Courtney (09:04):
It is like geological
time.
Yeah, it's, yeah, it's geologictime, so yeah.
That's rough.
Simon Newton (09:09):
And then when we
got particularly lucky again,
was that this one, oneparticular asset, was the
JavaScript that loads, what wecall the object panel in Canva.
so this is the sort of mainplace where you interact with
content, in your designs, right?
And so without the object panelloading, the editor is,
essentially, frozen and dead.
and what it also means is thatonce that one bit of JavaScript
(09:31):
loads, it then triggers a bunchof API calls because
Courtney (09:35):
Yeah.
Simon Newton (09:36):
yes, if you've
got.
A fetch taking 20 minutes.
got all these clients likecoalescing and waiting for it.
and then suddenly it becomesavailable.
that turns into a large numberof a API requests, hitting the
gateways.
Courtney (09:52):
Yeah.
it is pretty ugly.
At that point things fell overin terms of like the autoscaling
and whatnot, right?
At some point, it all just likecollapsed on top of itself
there, right?
Simon Newton (10:02):
Yeah.
And so that turned into, about,I think it was about one and a
half million, API requests in avery short amount of time.
and that was enough to, yeah,overwhelm the gateways.
Auto scalers can't react fastenough to that.
gateways went out of memory, andthen that started the cascading
failure due to overload.
Courtney (10:22):
Okay.
That's the shortest version ofthis incident, which is hard.
and so I think, we'll, we'lltalk a little bit more about
some of these details.
but how did you all find outabout this?
Simon Newton (10:35):
I can run you
through that.
so the, first signs thatsomething was wrong, was about
10 minutes after the deploy,because we started noting,
noticing a drop in searchtraffic.
and so our search on call teamwas paged, for a drop in
traffic.
I think they went down around20% or so.
and so they startedinvestigating.
(10:56):
and at first they thought oh,maybe this is something like
very isolated to search.
as they started, communicatingin an incident room, it was a
sev one, which is our, secondhighest tier of, incident At
that point, as they startedcommunicating, other service
owners started getting involvedand saying oh, actually, like
we've also noticed a drop intraffic.
(11:19):
And as they were doing that,that was about the time that the
fetch itself completed.
and then our gateway and edgeteams got paged for, massive
gateway failures.
at that point, the incident wasupgraded to a sev zero.
The gateway teams were pulledin.
I was pulled in and we, theincident, incident coordinator
was, activated.
(11:40):
and then, we went from there.
Courtney (11:41):
Yeah.
What is your incident responselook like?
How, you know, do you all havesort of some, you know,
protocols or what does that looklike and how, how many people
ended up being involved?
If you can tell me, because itsounds like it was a doozy, and
a lot of people got called in.
I'm always really curious, likehow many hands are on deck?
Simon Newton (11:58):
Yeah, I would
guess probably at least 20 or
so.
so let me, describe the process.
typically, alerts will page,service on call teams.
I think it's a, very typicalprocess, right?
Those service teams will triage,an open incident, if it's a SEV
(12:18):
zero or SEV one, that'llactivate the incident
coordinator rotation, which isa, it's a small group of people,
that have a lot of training andskills around managing, and
coordinating these large scaleincidents.
the Sev Zeros will page myself.
and then it, really depends onthe incident as to the structure
(12:40):
of it, right?
If it's a smaller incident, youmight have only a handful of
people, and they'll befulfilling all the roles, right?
if it's a large incident, likein this one, we'll have, we'll,
set it up so that there'sdedicated people running each
particular role.
I think maybe one, one thingthat's important as well.
Is that, the larger or thehighest severity incidents, will
(13:02):
automatically assign arepresentative from our customer
support organization.
Courtney (13:08):
Oh, okay.
Simon Newton (13:09):
all of the user
reports, and, looking for signal
there to not overwhelm the, sortof the responding team.
and
Courtney (13:17):
Mm-hmm.
Simon Newton (13:17):
the user support,
so that they can feedback look,
this is the state of things.
If people are, contacting orreaching out to Canva, this is
what we should be telling them.
If there's anything they can do.
to mitigate it from a userperspective, that as well.
yeah.
And then, post incident, like Imentioned, we'll write those
incident reports.
and then we also.
(13:38):
we've just, recently startedusing AI to try and extract
common themes, across all ofthose reports as well, to, look
for areas where we can, where wecan improve.
Courtney (13:49):
I have so many
questions.
so the IC is getting pulled.
Sort of automatically, if it'slike sev one, sev zero, how is
that team?
Are they, are they, alsoengineers on, on different
teams?
Are they kind of a, a dedicatedSRE esque function?
And when they're not ic, whatare, what are they doing?
(14:10):
I'm kind curious what the shapeof that looks like.
Simon Newton (14:12):
Yeah.
So early on, say our, ICS werejust a bunch of very
experienced, battle hardenedengineers at Canva.
and then over time, of course,as we get bigger, we need to set
up a dedicated function.
and those, because those, folksare also, they've got their own
projects, right?
(14:32):
They their own deliverables,that they're working on.
So yeah, IC is a dedicatedfunction at Canva.
when they're not on call,they're doing a bunch of other
different things.
some of those is improving, theincident process itself, looking
for those patterns in incidents.
plus also there's a bunch ofwork that goes into our large,
launch events.
(14:53):
So we do two, two main launchesa year, and we've got one coming
up now in April.
and so they're, like doing abunch of planning for that.
there's a bunch of capacityplanning involved, like risk, un
understanding the risk andmitigating, the risks that are
unique to each one of thoselaunches.
And so yeah, that's a, that'slike a dedicated reliability
function.
Courtney (15:13):
When you have like a
pretty big incident like this
one, do you have execs in the,in the channels, in the, active
incident channels?
You, or like, or do you have adedicated role who's talking to
folks and kind of coordinatingback or forth?
What does that look like?
Simon Newton (15:29):
Yeah.
So in this particular case, thefounders, our founders, did join
the incident channel.
I.
typically what will happen,sometimes they're there,
sometimes they're, they're busywith other things.
typically like myself or one ofthe other senior leaders will
act as that channel, to the,founders, to give them updates
and, help'em understand what'sgoing on.
Courtney (15:50):
So you mentioned you
found out later, so was was the
CloudFlare wrinkle of theirtraffic, and you didn't mention
this explicitly, but I thinkthis is what you're getting at
with the network configuration.
So their traffic was going overthe public internet, not over
their private backbone.
Is that a piece of it that youfound out later?
And so that was sort of like,you didn't know why that was
contributing to the problem.
Simon Newton (16:11):
yes.
we don't have visibility intothat traffic between, cloud
CloudFlare and
Courtney (16:16):
Yeah.
Simon Newton (16:17):
in A-W-A-W-S.
so yes, that was a detail thatwe, didn't find out until, I
think maybe one or two daysafter the incident as we were
working with CloudFlare tounderstand what happened.
Courtney (16:27):
At, at some point, did
anyone have the intuition that
this was gonna get worse in theway that it did?
Like once the files weredownloaded from the origin
server and, and there was, didyou see that, hear the
thundering heard coming?
Like, I'm always a littlecurious what the, what the
existential dread might be like,like in an incident or was that,
(16:47):
did that catch you by surpriseas well?
Simon Newton (16:50):
yeah, it all
happened, like fairly quickly in
terms
Courtney (16:54):
Okay.
Simon Newton (16:55):
the unfolding
incident.
yeah, by the time that folkswere in the room saying oh, this
is, broader than search, thatwas about the time that origin
fetch completed.
and then the gateway startedstarting, falling, over.
Courtney (17:10):
and so that, so even
so the, like the, all the
clients hitting the API gateway,you've got a known performance
issue.
So that all happened reallyquickly.
And the the, or were piece ofthose only aware after the fact
as well.
Simon Newton (17:23):
Yes.
I didn't actually touch on theperformance side.
Courtney (17:28):
get into that.
Then I jumped ahead.
Simon Newton (17:32):
Gateways are now,
now crashing.
we have a bunch of people on thecall and that's where we started
breaking up into different workstreams, right?
there were a set of people thatwere responsible for, contact
taking our vendors, and gettingthem involved.
That process differs acrossvendors, right?
Like some of them, will be like,we'll go to our portal and
sevmit like a, a P zero request.
(17:54):
Others will be like, email thisaddress and it will page sort of
your on-call account managers.
and so that process there, needsto be well practiced, right?
And I think this is actuallysomething that, we improved
afterwards as well.
we have internal docs on how toescalate to each of our vendors,
but we tied it up thatdocumentation and just made it a
bit clearer because when youare.
(18:15):
you're doing that, you arealready in a high stress
environment, right?
And so you don't want to
Courtney (18:20):
Absolutely.
Simon Newton (18:21):
and have four
pages of workflow to go through.
It needs to be a very clear, dothese four things, right?
Courtney (18:27):
Yeah.
Simon Newton (18:28):
had a set of
people contacting vendors, We
had one of the engineers who wasnot on call at the time, but
was, saw the incident and jumpedon, as, many people at Canva do.
We, there's a very good, cultureof, banding together, especially
for these large incidents.
(18:49):
so we had one, one engineer whowent off and started profiling
the gateway.
this is he was reporting back insaying oh look, actually,
profile looks different.
I think, if I remembercorrectly, actually had, he'd
done this before, other times.
And so he had past profiles tocompare it to.
(19:12):
and he was like, look, when Idid this two weeks ago, this,
one looks different, right?
And I'm seeing lock contentionhere.
And so that was, like one, oneavenue that we thought, might
have been the contributingfactor.
and so we had a bunch of peopleexploring that, that turned out
to be, a change in our metricslibrary.
(19:32):
what we were doing was changing,making some changes to the way
that metrics were collected.
and so we were integrating a,like a third party library
there.
that had done is, inadvertentlyput metric registration behind a
lock.
and so the capacity of ourgateways were reduced.
which would then also contributeto the sort of the overload
(19:53):
situation, and what we saw withthe cascading failure.
but that was all emerging inparallel to the, to the,
Courtney (20:00):
Yeah.
Simon Newton (20:01):
activity that was
happening on the call.
Courtney (20:02):
So you got the
performance issue, which must
have been.
And if I'm not wrong, there wasthis really juicy detail in
there.
these are the kinds of thingsthat I love it when companies
share, where it was a knownissue and there was a fix for it
in like the deployment pipelinefor that day.
That had not gone out yet.
(20:23):
And in many organizations therecould be a lot of blame and like
post hoc, like, I can't believewe didn't do that.
We should have, like what wasthe general sense when you all
realized that that also wascontributing to this incident?
Simon Newton (20:41):
Yeah.
definitely, there, there wasnever any blame.
and it was just very unfortunatethat yes, the metrics team had
already identified this, they'dalready fixed it, it was already
merged.
and, I guess what would thathave been in, in 12 hours later?
the new gateway would've beendeployed and, this have very
(21:02):
different.
Courtney (21:04):
yeah.
Simon Newton (21:04):
just getting
unlucky.
we do have a project underway.
to, move to incremental deploys.
right now I'd say most backendcomponents, are deployed on a
daily cycle, and then, and thenthe components individually can
opt into incremental deploys,and deploy on merge.
(21:25):
but we do want to move more andmore of our components to like
over to that incremental model,which would've, ideally caught
this, in this case.
Courtney (21:35):
And so then the last,
but not least, certainly I
believe, again, if I've got thetimeline right here, so you've
got that going on, and thenbasically you had like the load
balancers on your ECScontainers, just they couldn't.
Like they were just gettinghammered at this point as well.
Right.
And was that also, a matter ofsort of, this was reaching out
(21:55):
and trying to figure out whatwas happening with AWS or like
what, what was that last pieceof the puzzle like?
I.
Simon Newton (22:02):
Yeah, so we did,
we had a set of people,
escalating CloudFlare.
we had a set of people reachingout to AWS because, we were at
that point, this was just afterthat Origin Fetch had completed.
we were concerned that theremight be something going on the
AWS side, but thankfully, wehave, I would say like
(22:24):
reasonably good observabilityinto all of this, right?
and a cascading failure due tooverload is a very distinct
pattern in a graph, right?
Where you get this, sawtoothpattern, like if you're looking
at
Courtney (22:38):
Yeah.
Simon Newton (22:38):
uptime, you get
this sawtooth.
That sort of offset with everyinstance.
and I've certainly seen thatbefore, a number of other people
on the call had seen thatbefore.
and so it was very clear at thatpoint.
It's oh no, this is, this is a,an overload situation.
and then, a again, we've, facedthis before, right?
(22:58):
I've, seen, quite a few ofthese.
and there's, really only oneoption, which is you've got to,
bring the load down and get thesystem back into a stable state,
right?
You can, try and add capacity.
The problem is un unless you cando that in a, an atomic flip
fashion, any new additionalcapacity that you bring up is
just gonna get hammered into theground.
(23:19):
and so you've gotta cut thatload back.
In this
Courtney (23:22):
Yeah.
Simon Newton (23:23):
case, you know
what would be really nice if is
if we had a sort of a, leverwhere we could just say oh,
okay, great, like dial the loadto 20% of demand.
we didn't have that sort ofcontrol within the cloud
providers systems.
so instead what we did was reachfor, the, sort of country level
controls, right?
(23:44):
And so we put in a block thatsaid, For any country in the
world, want to display thisstatus message instead of
forwarding the requests ontothe, backend.
So that was a, that, that wasthe lever that we had at the
time.
and
Courtney (23:58):
Yeah.
Simon Newton (23:59):
doing was, once,
once we got that block in place,
I.
We saw the traffic drop to zero,we saw all the gateways come up
and stabilize.
and then we started admittingmore load.
I think the, maybe theinteresting bit here is that, we
added Europe back in firstbecause that was where the peak
load of the time was, right?
This happened in Australian,like early to mid evening time,
(24:21):
would've been European daytime.
The US was mostly asleep at thatpoint.
and so we Europe to be admittedback first, and then, rolled
out, out to the rest of theworld.
Courtney (24:33):
Yeah, I mean that was
an interesting piece of what I
like to think of as hard earnedexpertise where instead of just
turning everything back on againand literally recreating
potentially the same problem,which I've definitely seen
incidents where they're like, ohno.
And it like literally happensagain.
the only other piece I wascurious about is in terms of
(24:54):
what that incident responselooks like, is like, are people
remote?
Is everything in a Slackchannel?
How does that work?
Is everybody in the same timezone?
How complicated does that lookfor you all?
Simon Newton (25:04):
Yeah.
Canva is a hybrid work setup.
so I'd say the bulk of ourengineering is in the New
Zealand to West coast ofAustralia.
Time, zone.
and so yeah, this was like earlyto mid evening for them.
but yes, as part of.
The automation.
(25:25):
When an incident is triggered,it automatically creates a slack
room.
it ties that slack room to themain, incident slack room, so
people, can go in there and findit, and it sets up the Zoom call
as well.
and so then I'd say the majorityof the discussion occurs on
Zoom.
maybe another interesting pointis that we, have deliberately
set up that zoom in a way that,that Zoom chat is disabled.
(25:48):
so it funnels all of the chatthrough the slack room.
so
Courtney (25:51):
Ah, okay.
Yeah.
Simon Newton (25:52):
I think that you
don't have two, two different,
Courtney (25:55):
Two sources of, yeah.
Yeah.
Simon Newton (25:58):
and so typically
people will share, will be
sharing links to dashboards orlogs or et cetera in the Slack
channel.
and then we have a way ofannotating within the Slack
channel sort of key moments, ina way that it's very easy to,
when you're going back latertrying to write that incident
report, you can pull out the keyevents and say okay, this is
(26:18):
what we knew at this particulartime.
Courtney (26:20):
homegrown tool that
you all use, or is that a,
that's a third party tool forthe, for the incident sort of.
Simon Newton (26:28):
yeah, I think it's
mostly a bunch of, a bunch of
tools developed by that same setof, incident coordinators.
Courtney (26:33):
So some companies say,
you know, for, for, I.
Sev whatever number, we're gonnaalways have an incident review
and we're gonna try to do itwithin this amount of time.
and do you have like a dedicatedincident analysis team and with
an analysts, and do they have,processes around that and, and
(26:54):
timeframes for when you wantthings to happen?
Or is that more ad hoc?
Simon Newton (26:58):
Yeah, so we don't
have a dedicated team of people
analyzing these.
the incident coordinators willwork with the.
The sort of service teams thatwere contributing towards the
incident.
and then we have a, a templatefor, the, PIRs, what we call
them.
I don't, I, don't think there isa sort of a set deadline.
(27:19):
but certainly I.
myself, I'll be, because I'll beinvolved in the Sev Zeros I'll
be looking for that report.
and if it hasn't shown up inmaybe two, two or three weeks,
I'll be, ping people sayingwhere, is it at that point?
no sort of system that, you
Courtney (27:39):
Yeah.
Yeah.
Simon Newton (27:40):
it is all linked
in the, in that automation, in
that slack room as well.
as the state.
Moves through from, likeresponding.
so I guess like triage,responding to mitigated, to
resolved, that's all as
Courtney (27:53):
Yeah.
Simon Newton (27:54):
and, fed into that
timeline.
Courtney (27:57):
Is there like a PIR
meeting anyone can come to?
Like how are the reports sharedinternally?
Simon Newton (28:04):
Yeah, so it's,
it's then published when it is
published.
it's done typically is like thelast message in that Slack room.
that, that then gets archived.
we have a weekly meeting as wellwhere we review, the Sev Zeros
plus, like of the, other,lesser, severity, but maybe like
interesting, failure modes.
(28:25):
and then we publish, in our sortof monthly engineering leads
meeting.
we, take the, list of incidents,and we go through them briefly
there.
that's so that all theengineering leads have the same
level of visibility, and thatwe're identifying common themes.
Courtney (28:42):
Do you find that the,
that those reports get
referenced in like planning orare they, are they used.
Do you see them being used afterthe fact?
you know, as sort of learningpieces in terms of either yes,
for planning or for architecturereviews or like any of those
kinds of things.
Simon Newton (29:02):
Yeah, so the
action items that come out of
those reports, get created inJira, which will then be sitting
on those teams backlogs.
and so then, next, next planningiteration, the team will be
looking at, looking at theirsort of like reliability
backlog.
and then like mixing that withthe other sort of various
categories of work on thebacklog, and then using that to
(29:24):
prioritize, what, what, workoccurs.
Courtney (29:27):
And is there, is there
anything outside of action items
in terms of like distributingwhat you learned?
or is there any other activityaround the incidents like that?
Simon Newton (29:38):
I'd say no, other
than the, sort of internal Slack
channels where,
Courtney (29:43):
Yeah.
Simon Newton (29:43):
right?
And so anyone in the company isfree to and look at those, and,
read through the reports.
Courtney (29:48):
What was the most
surprising thing to you about
this incident?
Simon Newton (29:52):
yeah, probably the
biggest one is that our edge
provider would coalesce requestsindefinitely.
I can, understand, I canunderstand why that was the
outcome.
And I can imagine, I can putmyself in the person writing
that code.
and it's oh, like if, fetch inflight, wait for it to complete
(30:16):
and then respond, right?
that was what allowed thatthundering herd to develop.
and I.
all of our internal RPC traffichas timeouts, which are
propagated and respected at eachhop in that service graph.
in this particular case, therewasn't, like any, timeout within
(30:37):
that, within that edgeproviders, layer, which makes
it, somewhat difficult from aclient code perspective.
To understand, is this an error?
is this just like a slow fetch,or is this I part of a
thundering herd?
(30:58):
In which
Courtney (30:59):
Hey, there's
Simon Newton (31:00):
would wanna do
something differently, add
Courtney (31:03):
right.
Simon Newton (31:04):
Yeah.
Courtney (31:05):
Yeah, it definitely
seemed like one of the themes
from this incident as well was,and I, I mean I see this a lot
and I write about it a lot, washow automation actually made it
worse.
Right.
and not because you're like, wewant the automation to make
things worse, but because it, itcan't always.
React to these kinds ofsituations and you know, like
(31:27):
it's model of the world is notas flexible as ours is.
and so, you know, I, Idefinitely saw like action items
out of that and everything inthe report, but was there, what,
was there any reaction to thatin terms of like where you had
these.
Automation surprises of, of, andnot just like, what was your
(31:50):
response to the sort ofautomation, making it harder
and, and how are you alladapting to that after the fact
for the future?
Simon Newton (31:58):
Yeah, so maybe a
good example is, again, I
didn't, mention this, but Partof what happened on that call,
was there was a personresponsible for freezing the
auto scaling.
because what can happen here is,your traffic go goes to zero.
your auto, especially if that's,like over a, sort of like long
(32:20):
amount of time.
the auto scalers are like, oh,great.
Like we can like down downscaleall of the groups, and
Courtney (32:26):
Yeah.
Simon Newton (32:27):
course they can't
Courtney (32:28):
And like, oh no again.
Yeah.
Simon Newton (32:30):
So we've got
processes in place where, you
know, for those particular typesof incidents, we have tooling
that we can run, again,developed, by those reliability
folks where we can say hey,freeze all the auto scalers.
and what we can do also is, goback and say oh, actually, set
them, reset them to where theywere like prior to the incident
occurring, right?
(32:50):
So if, any have down downscaled,we can get them back.
and so that's a, process thatwe've adapted.
given past experience andthinking through what's, like
bad outcomes that could happenin these sort of event events
that, turns the automation offfor a second and humans, take
control.
Yeah.
Courtney (33:11):
Yeah, I mean that's
definitely a theme of, of
obviously my work, but like in,instead of just saying, well,
we'll add more automation tomake the automation.
Better.
Y'all were like, well, let'sgive the humans better controls
over what that might look like.
I like, I like seeing thosekinds of, of adaptations, so
Simon Newton (33:30):
like I'm, I'm a
very big fan of modes within
automated software.
Where it just throws up itshands and it says, this, does
not match my, view, like my,what I've been modeled for and
Courtney (33:44):
yeah.
Simon Newton (33:45):
right?
when I've been building systemsin the past, I've built in
these, lockdown modes, right?
Where it's hey, like the inputsdon't match anymore.
I'm gonna go into lockdown mode.
I'm gonna flag it in mytelemetry.
human come and get me out ofthis, right?
Because rather than just likecontinuing down a path that sort
of, no one has prepared, thesoftware for, and, possibly
(34:08):
making things much, much worse.
Courtney (34:09):
And what do you think
was the biggest learning for the
team out of this incident?
Simon Newton (34:15):
I think definitely
practicing, those controls, on
the Edge Providers network.
when we were putting that blockin place.
we, added it with a very simplestatus message.
I can't actually remember whatit said, but it was definitely
not a user friendly statusmessage.
because at the,
Courtney (34:35):
All bad.
Good luck.
yeah.
Simon Newton (34:37):
at the, wasn't
quite that bad.
but because at the time, at the
Courtney (34:43):
I could come up with a
lot of bad error messages for
you if you'd like, but probablynot.
Simon Newton (34:47):
Yeah.
so, we, because at the time thepriority was like, get the load
down rather than, debate theerror message or, like how it
was, gonna be visuallypresented, et cetera.
Courtney (35:01):
Yeah.
Simon Newton (35:02):
we've learned from
that.
We now have canned responseready to go, in a much more.
user friendly and visuallyappealing style.
and so that, yeah, that's, likeone, one thing that's come outta
that.
also our, ability to practiceusing those controls and getting
more familiar with them.
cause you, really don't want tobe using something for the first
(35:23):
time, in a high pressureenvironment, right?
Like you want to have this,know, built up almost in your
muscle memory.
one of the other, like generalprinciples that I try and
describe to when, when buildingthese systems is that any sort
of emergency or failure mode, isexercised day to day.
(35:43):
And that way, that your
Courtney (35:46):
Yep.
Simon Newton (35:46):
for it.
You've been using it, day to dayor once a week.
And it's not a, oh, I've gottado this first, first.
this thing.
First time in a call.
there's like a documentationpage that is like four pages
long, right?
And I've
Courtney (36:02):
Yeah,
Simon Newton (36:02):
of steps
perfectly,
Courtney (36:04):
oven's on fire and
I've never used the like fire
extinguisher thing before and itcan't get the Yeah.
Basically.
Yep.
Yep.
And do you all do.
Yeah.
Do you do drills like tabletopdrills, chaos engineering type
stuff, or what does thatpreparedness look like?
Is how is that structured anddefinitely happens?
Simon Newton (36:24):
I'd say there's a,
variety of drills that occur,
right?
like teams themselves will do,will do wheel of misfortune
style or like role playingincidents.
but then there's also likelarger drills that we do as a
company, where we say oh, okay,yes, we're going to, we're gonna
do like a sort of a, businesscontinuity drill.
(36:45):
will get involved and, do thoseexercises.
Courtney (36:49):
Thank you so much for
joining me and sharing your
internal process with the world.
And while I hope you don't have,more sev zero incidents, I do
hope that you continue to shareyour incident reports with us.
And I, I really appreciate allof the time and effort and cat
herding that goes into that.
but as you said, it really is sobeneficial to everyone else in
(37:11):
the industry, so thank you fordoing it.
Simon Newton (37:13):
Yeah, no worries.
Yeah, I just, I hope people can,learn from it and we can get
better, everyone together.
Courtney (37:21):
Absolutely.
Okay.
Thanks so much.
Simon Newton (37:23):
very much.