Uptime Labs and the Multi-Party Dilemma (Part I)

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:05):
Greetings fellow incident nerdand welcome back to an unusual
version of the Void Podcast.
In some ways it'll be like allthe other ones we've done.
We will talk about an incident,an incident report, and the
people that were involved in it.
It will be different in twonotable ways.
First, the incident is actuallya drill that is hosted on the
Uptime Labs platform.

(00:27):
You'll hear a little bit moreabout that as we get into it.
The short version is it's a wayfor people to practice incident
response.
You can do it eitherindividually or as a team, and I
really like it as a way to notjust practice incident response,
get better at that.
But I've found watching some ofthe drills and talking to people

(00:47):
about it, that it also helpsreveal a lot of assumptions
about how incident responsehappens versus the messy
realities of what actuallyhappens.
And so that's the setup.
So that's what's part of what'sunique.
And then the second part is thatwe get to see what.
Incident analysis really lookslike on top of one of those

(01:11):
drills.
So I've invited Eric Dobbs tocome and do an analysis of the
incident that Sarah and Alex doon the Uptime Labs platform and.
This is not something that mostpeople get to do.
Uh, most people don't get towatch someone run an incident
unless you're on an incidentresponse team.
And we definitely don't get tosee what really high quality

(01:34):
incident analysis looks likeunless you're like one of the
lucky people who's actuallylike, worked with Eric or is at
an organization that recognizesthe value of this work.
So.
These whole shenanigans were myidea.
It's not a sponsored thing foruptime.
I just really wanted to peekbehind the curtain of what world
class incident response andincident analysis looked like

(01:58):
and bring you along with me forthat.
So let's get into it.

Courtney (02:02):
Thank you all so much for joining me on this very
unique, episode of The VoidPodcast.
There's a bunch of us, so we'regonna go through intros really
quick before we get into themeat of this.
So I would love to have Hamed,if you could start off by
introducing yourself first

Hamed (02:16):
Thanks.
Thanks, Courtney.
Thanks for having me.
My name is, Hamed Silatani.
I'm from Uptime Labs.
For purpose of.
This conversation, I'mrepresenting four different For
the, for the incident review isa bit odd, but the way it works
behind, behind the scene in ourplatform, there are multiple

(02:37):
people that play differentcharacters in the role.
And today I'm going to representmore than one person.
see how it goes.

Courtney (02:45):
We'll get to a window into your multiple
personalities.
So, Looking forward to it.
Okay, so next up, one of ourwilling participants.
Alex, will you introduceyourself briefly and, and paint
the picture of this scenario ofthe sort of we're trying.
We got you all into here withthis one.

Alex Elman (03:06):
thanks for having me on.
Courtney.
I'm Alex Elman.
been.
Serving in an incident responsecapacity for, the past 14 years
at, at various tech companies.
And the dynamic that we're gonnabe discussing today in this
incident, which I served as the,the deputy incident Commander
on, is something that I, I'veseen and, and Sarah, who you'll

(03:27):
hear from, has seen the numerousincidents that we've been a part
of is this dynamic called themulti-party dilemma.
And it, it shows up incidents.
between, parties who don'ttypically, work with each other
and, and might be guided bydifferent, like, respective
missions and.
These dynamics show up inincidents could result in the

(03:50):
two parties, working acrosspurposes or having different
agendas.
and it has huge implications forresilience, which is why it's an
important dynamic.
it shows up in complex adaptiveof systems.
So outside of software, it alsocan show up in aviation,
military medical domains, butspecifically, showing up here
in, in a software incident.

Courtney (04:12):
we'll have Sarah introduce herself next and tell
us a little bit more about theincident and then we'll get into
the meat of the matter with our,our analyst, Eric.

Sarah Butt (04:23):
All right.
Yeah, absolutely.
Hi everyone.
My name is Sarah, but I'mexcited to be here.
like Alex, I have spent, themajority of my career, almost 15
years working.
For tech companies specializingin sort of all things incidents
and, um, what that means for anorganization.
In this drill, I served as theincident commander and the

(04:44):
scenario that I wanna start topaint for you is the drill, um,
situation that Alex and I foundourselves in.
So we get dropped into theuptime labs environment, which
is.
Very realistic.
You're essentially in a Slackplatform.
And then Alex and I had a voicebridge going as well.
And what we've been told is thatwe are, part of a e-commerce

(05:05):
company that has had some lowsales.
They've got this very important,uh, online event that is
supposed to drive a lot ofrevenue.
They're gonna do a 50% off sale.
And we're sitting there watchingthe CEO sort of joke, we can't
jinx it.
All of these sorts of things.
And suddenly we hear that.
People in the EU are just notable to load the website at all,

(05:28):
and so that's the.
Place that we start the incidentis, quite frantic custom work
support folks, a very hot CEO.
a lot of concern about what'sgoing on, a lot of confusion,
frankly, a little bit of thatfirst, 15 minutes of an incident
chaos.
And, uh, from there we workthrough various troubleshooting
pathways, which I'm not going toget into because I don't wanna
spoil Eric's thunder.

(05:49):
and ultimately are able to bringthe incident to resolution.

Eric (05:53):
Thanks for inviting me.
this is, uh, uncharted territoryand, and incredibly fun for me
to be participating in this.
So I'm so excited to be here.
I have been working for the pastfour years as an incident
analyst, I think there are 10people in the world doing this
work in the software business.
I, I might be exaggerating, butthere are very few of us.

(06:16):
it's a small club, and I am, Isteered my career in this
direction on purpose.
and one of the challenges ofbeing an incident analyst and
the, the activity of learningfrom incidents is that the work
is all proprietary behind theclosed doors of, of corporate,
Software companies and it, youknow, the legal teams and the PR

(06:38):
teams don't really want to telltheir customers about how the
sausage actually gets made.
In fact, all of our softwarecompanies have, I was an
extraordinarily tangled messinside for all of the beautiful
UX that, that people interactwith.
However frustrating thoseinteractions can be.

(06:59):
the, the what's on the inside isI'll be more messy and we can
ever see it.
one of the fantastic thingsabout this opportunity is that
is a drill nobody's company'smaterial is on the line.
So we could maybe even show.
Like what the mess looks likeand, and be able to, talk more
openly in the world about thekinds of things you can learn if

(07:22):
you take a different approach towhen things break.
So that's the piece I'm superexcited about, not least of
which is that the, the contentof this activity is this
multi-party dilemma.
that's close to my heart becauseI think the, of the most
important features of thesoftware business is that we,
we.

(07:43):
adopted software as a serviceover the last decade.
So all of us have softwarerunning on somebody else's
hardware, and we are all of usin a multi-party dilemma in
every incident, whether we knowit or not.
So this material is reallyimportant to get out in public
also.

Courtney (07:59):
and if you ever want another job as podcast host, um,
you're hired.
So, okay.
So before you start.
Part of this, part of what is soexciting to me about this is the
ability to look behind thecurtain, right?
As you've said, because as whatI get to do, I never get to do
that.
I, I am, you know, the, thelibrarian on the outside
collecting all of the reports,and so to be able to peek behind

(08:22):
and see all of this as, as arare treat and, and the rarest
of the rare treat.
Part of this is to see the kindof work.
an incident analyst does, as yousaid, it's a fairly small club,
not exclusive.
Others could do this and we hopethat more people do and that
more companies do this.
But I'd love to hear a littlebit before we dive into the work

(08:43):
you've already done on this,Eric, of tell folks what do you
do?
To get ready to analyze anincident.
What are the materials youcollect, what are the things
you're looking for?
What are you pulling together?
And talk to me a little bitabout what that process looks
like.

Eric (09:00):
I think that's gonna be really useful context, although
I'm gonna actually deviate fromyour specific question and first
paint a cartoon of the typicalincident in a software company.
So the, the typical scenario,this is sort of dominant
paradigm about incidents in asoftware company, is they're bad
and you wanna avoid them andprevent them.
And when they happen, usuallybecause somebody screwed up.

(09:23):
there's a, it's, it can be hardand the company culture depends,
but it can be very hard to, tolearn what really happened
because you have this bias, andthis is sort of culture wide.
It's not even the software worldalone.
when bad things happen, it'sbecause somebody didn't do the
thing they were supposed to do.

(09:45):
And that, that's the, that's thebubble we're trying to burst in
the resilience engineeringcommunity and the, and bunch of
sibling communities.
I'll try not to do too much ofthat intro.
so the conventional retro, is tosort of sit around for an hour
talk quickly through whathappened.

(10:07):
and often the talking about whathappened stops pretty short.
As soon as we have the firstperson who made a mistake, and
we'll dig into what should wehave next time, because we
definitely wanna not have thisexperience again.
It was painful.
We lost money.
Customers are mad at us.
Uh, there's a lot of goodreasons that we do it this way,
but the, the core mechanic outof a typical retro is to get the

(10:31):
code we're gonna fix so thatthis never happens again.
And the subtext of the wholeactivity is that, we're really
trying to soothe the, theemotional stress that an
organization feels and does notadmit we don't really have,
clear control over the systemswe're running.

(10:54):
Trying to do differently as ananalyst is to admit that the
business we're running is a messand that nobody's really got a
clear view of it, and that theincident itself is actually a
crack that lets me peek into howthings really work.
Because most of the time we'retoo busy to look at how things

(11:15):
are actually working, and the,the, the sense of urgency or the
sense of regret gives us amoment of opportunity where the
business is willing to investand we could do something
different with that time.
I'd spent a bunch of time beforethe interviews reviewing the
slack transcript and a recordingof their drill.
to look at the words theyactually said, who they were

(11:38):
talking to, what kinds ofquestions they were, asking,
trying to get a picture in myown head of what was in their
heads at the time they weretaking the actions.
'cause at the end of theincident, I know, I know what
broke and it can be reallydistracting to only look at what
broke.
We, we operate with thaturgency, especially during an

(12:01):
incident.
we're choosing the stuff thatfeels most important in the
moment.
I'm trying to get a sense ofwhat that was.
So then I have an interview withthe participants to double check
that what I read and what Ithink was most important
actually was, and invariably theconversation with the people who
were in the hot seat takes medown a different path than I

(12:23):
thought I was gonna be down.
I'm looking at the, the evidenceand I learn all kinds of things
about how the business reallyworks, I learn all kinds of
things about the deep expertisethat everybody brings into an
incident, and that is, thatexpertise is gold incredibly
difficult to see until youactually do this kind of work.

(12:44):
So then having an hour, youknow, I've got an hour of video,
from the recording and someretro reflection that happened
after the drill.
The drill's about half an hourthere's, uh, a little bit front,
know, set up recording andthere's a, some reflection
afterwards.
It's about an hour of footage.
and I have.
I don't know, an hour and a halfthat Sarah and I spent, and an

(13:07):
hour that Alex and I spenttalking through what happened,
and all of the Slacktranscripts.
and I try to then process,process that to figure out what
is in common and what is incontrast between the testimony I
learned from Alex and Sarahabout their respective views.

(13:27):
The premise of doing that isthat none of us can see the
whole system.
if I can get some detail aboutthe different views, I learn
more about the system and if Ias an analyst can synthesize as
coherent as possible, a, asubset of all of that
information into the interestingbits of, of, shared experience

(13:53):
and the interesting bits of.
experience.
when we have the retro, we havea much richer discussion about
what really went on during thatincident.
But also, this is the criticalbit.
This is why it's a magnifyingglass and not just a thing to
avoid.
I get to find out what normalwork looks like and why this was

(14:13):
different.
And finding out what normalworks looks like, turns out to
actually be where the real goldof the activity is.

Courtney (14:22):
something we talk about about incident analysis
and like you could be one of thebest technical people in your
field doing incident analysis isa very different of skills, a
very different lens.
Anyone who is listening to thismight not be thinking like
sheesh, that sounds like beinglike a detective or an

(14:42):
anthropologist or maybe asociologist.
And the answer is yes.
It is all those things.
and why, that's why this is sucha treat.
you know, and how it differsfrom just going and talking to
people and asking what happenedand writing that down and, and
moving on.
Okay, but let's get to the meatof the matter.
you have all of this informationgathered.
You have a wealth of written andauditory and other data and and

(15:08):
narratives and people'sperspectives.
We're here now.
You're running the retro.
I'm gonna step back.
I'm gonna let you take that inthe direction you would normally
take it and try not to

Eric (15:18):
so,

Courtney (15:19):
too much.
So

Eric (15:20):
Now we're pretending that we're, we have, after all, I've
done this, all this work.
I've shared the document withthe people who are
participating.
Everybody was too busy to readit.
So we come into the meeting coldand I'm sharing that document as
a, as my slide deck to run themeeting.
and what I'd like to do is takejust a couple of minutes to
review our goals for the retro.

(15:41):
And because the kind of retro Ido is unusual in the software
business, I'm gonna take alittle extra time, to, to frame.
this is different from yourtypical retro and although I've
already done that.
This highlighted section is whatI'm going to read to you.
so apologies for reading thetext that you can read for
yourself, but this part'simportant.

(16:02):
so welcome and thank you so muchfor being here.
I know you all have busyschedules.
You could have chosen somethingelse.
Let's do our best to rewardeveryone for having made that
choice.
The first step is to setexplicit ground rules for this
retrospective.
goal for this meeting.
Is for each of us to discovernew insights into the complexity

(16:22):
of our continually changingbusiness.
It is a particular importance tosuspend judgment about mistakes
or errors, and instead lookclosely for why those actions
made sense in the context of theincident.
I am pausing for dramaticeffect.
It's really important thatparticular point, so let us

(16:44):
agree at the outset.
No one of us has a complete orcorrect understanding of how our
whole system works.
When things break, when we makemistakes, or we have the
opportunity to peek throughthose cracks and get a new
glimpse into what's really goingon.
gain insights that we can applyboth individually and

(17:04):
collectively to improveeverything we do, just the
particular things that broke forthis particular outage.
There's so much more valuable tobe in mine from this experience
than the conventional postincident action items.
So the thing I'm gonna ask youto do that's hardest because
you're well conditioned fromyour other retros, is to, to

(17:25):
tell me how we could have fixedit or how we, how we could have
prevented it.
We wanna understand whathappened first, uh, before we
get to, how we fix it.
the, the next step is to reviewthe narrative of what happened,
and I'm sort of, eschewing theconvention of giving you exact
timelines and I've, I'veabbreviated a little bit.

(17:48):
but I have the major plot pointsin the correct order.
And what I really want to dobesides establishing sort of the
story arc that we followedthrough this incident is to also
validate, especially with thefolks who were there.
that I have not misremisrepresented I understand of
how our incident proceeds.

(18:09):
There's an opportunity here forour co collective understanding
to be refined, and that's partof the point that, please don't
feel like this is me lecturingyou about what you went through.
We're trying to have aconversation.
the day began as Sarah hinted inher intro with high expectations
and an anticipation of hightraffic.
There's a 50% off sale that wasintended to boost sales numbers,

(18:31):
which have been down in a recentslump.
we have an immediate partialfailure that raises confusion.

Sarah Butt (18:39):
Reports are flooding in.
Ah, okay.
Here we go.
Um, Bob gave me the exact userimpact.
Uh, Alex, can you go ahead andgrab, uh, grab that incident
management dashboard and justtake a look?

Alex Elman (19:01):
Yeah.
Doing that now.

Eric (19:03):
There's a significant impact, but it seems to be in
entirely within the eu during USbusiness hours based on our
observability tooling.
in the progress oftroubleshooting and trying to
understand what was going on,the responders were briefly
fixated on a hypothesis that itwas a recent deployment that

(19:23):
would've triggered the incident.
Many changes had been deployedin the system.
The most recent had been fivehours earlier.

Sarah Butt (19:32):
Alex, let's think.
What, what can affect allservices in a region like
network, DNS, um, something likeAkamai on the front end.

Alex Elman (19:46):
Well we're, we're in the online boutique.
We're seeing a unbranded NGINX 50 3 page.
So we're still getting networkconnectivity to the web server.
But something maybe isaffecting, like there was a
change that affected n Engine Xand I didn't see anything in the
change log that jumped out atme.

(20:06):
I'll mention that.

Sarah Butt (20:08):
Yeah.
I'm kicking myself for not, I, Iasked earlier.
I said, I don't know where weare.
Are we on AWS or similar?
I'm kicking myself for notknowing that.

Alex Elman (20:20):
Saying the website's working again.

Sarah Butt (20:23):
Yeah, I think it's something at the DC level and it
came back up.
Alex over in biz comms, if youhaven't already, can you let
them know that we seem to becurrently out of impact?
We're monitoring, we're gonna beworking with the data center.
There may have been a disruptionat the infrastructure.
Uh, god dammit.
It's down again.

Eric (20:36):
In hindsight, there's a clue here.
I'm gonna not dwell on that fora minute because the little
dramatic tension in thenarrative is useful.
one of the, most jarring momentsin the incident, discovered that
there was a interaction with ourvendors and symptoms in our data
center.
And in trying to contact ourvendors, learned that we had

(20:58):
multiple support contracts.
The way we learned this was byreaching for the first one we
found.
And discovering there was a fourhour response time on that
support contract.

Sarah Butt (21:10):
As are you in a spot to reach out to your executive
contacts at the DC vendor toescalate?
not waiting four hours.
if we can fail out of there.
And I didn't think of it.
Freaking Lord, help me.

Eric (21:31):
Our site is hard down, and has been for about five or 10
minutes at the point we learnedthis, And we cannot expect a
response from the vendor forfour hours.
Needless to say, we were alittle alarmed in that moment.
And it turns out happy news.
We discovered not long afterthat that we had a second
support contract that that wasfor the production environment

(21:53):
with a faster turnaround.
we didn't end up having to waitthe four hours.
But there was definitely someimmediate response to the news
of four hours and, uh, someimmediate efforts in contingency
planning.
That level of work, that sortof, Timeline in the metaverse
was cut short when we did getcontact with the vendor, who's

(22:14):
running our data data center.
The next key insight in thisjourney that we learned is that
there was an email notificationthat had gone to a spam folder.

Sarah Butt (22:26):
There's a email that just came in from the data
center provider, so I guess wewere on first party hardware
maybe.

Eric (22:32):
And, as of as I have come to learn, I think the only
person who got that email wasalso our CTO.
So, if others got it, maybe theydidn't even know that it went to
their, spam folder, but we're,know, 15 minutes into the
incident when we discover this,email, that would've been really
good to know much earlier.

(22:53):
So it's in a spam folder.
It's gone to one recipient, avery busy recipient in our
business.
and that's a contributing factorto the drama of our, of our
initial diagnosis.
Once we understand that we'redealing with an air conditioning
issue, which was this, thecontent of that email, our
vendors having air conditioningproblems in the data center

(23:15):
where our services are runningnow in front of us about whether
we can for the vendor to fix theair conditioning, or we should
engage our business continuityplan and get into a different
data center.
So that was one of the importantpieces, in the, the story arc.

Sarah Butt (23:33):
Danielle, can the DC vendor give us and ETA there
anything else they can do tocool the room they are not
already doing.
Are we the only tenant there?

(23:54):
I need answers to thesequestions.
I wanna understand becauseHammad and Tanya seem to have
different levels of risk.
Tanya is the platform lead.
Hammad is on the customersupport side.
Makes me nervous, but let's getthose questions.
Alex, sorry, I put you in abuffer.
What was that about the BCP?

Alex Elman (24:12):
You're responsible for coordinating the flip and
communication with stakeholders.
Platform network engineershandle the DNS updates and
rerouting, and so the firstthing you need to do is notify
stakeholders, the IT teams ormanagers and customers.
Then we need to stopnon-critical services in the
data center.
Okay.
They only have to sync, synccritical retail data,
reconfigure the IP address inthe DNS console, activate a

(24:35):
application systems in the newdc, and then validate that the
transactions and orders areworking.

Sarah Butt (24:45):
I'm just trying to make a risk assessment because
here Alex, let's, let's you andI take a minute and talk through
this.
Okay.
DC has put us on an extra 10minutes, so we're down for 25
minutes.
We're functionally down in thewater right now, so we're at as
basically a zero sum game.
That being said, it sounds likewe've had intermittent inability

(25:07):
to fail the DC over, inscheduled BCP uh, runs, which we
know always goes better thanreal BCP runs.
I haven't gotten an answer.
Oh, we just got it.
Okay.
We can't shut it down partially,so we're sort of hosed there.
Alex what's your thought?

Alex Elman (25:25):
I mean, considering the state of the site right now,
I think it's prudent to justfail over it.
It's already been well over 15minutes.

Sarah Butt (25:35):
Okay.

Eric (25:37):
In the end, we did engage the business continuity plan.
this case it went smoothlycontrary to the worries of the
folks who were against thebusiness continuity plan.
we finished in that spot withsome discomfort because our
business continuity plan hasdemonstrated some, inconsistency
in the past.

(25:57):
And we don't wanna just say,saved business continuity, move
on and, and, uh, pursue otherthings.

Courtney (26:04):
Sarah and Alex going into this, how real.
Does all of this feel from yourperspective, you know, not, not
Eric's re you know, sort ofdoing the retro, but like, does
like hearing this again, doesthis evoke what you felt like
during that incident and whatwas going through your brain and

(26:27):
your body at that time?

Sarah Butt (26:28):
Yeah, I, I mean, I think it's, it feels incredibly
realistic.
it feels realistic enough, andEric can attest to this, that
I'm someone who, Gets, I getdrill anxiety, but I also just
like get this adrenaline rush ofreally loving incidents.
And I came off the incident, sosort of convinced that I'd run a
real incident that I did like aan hour and a half recording to

(26:51):
Eric that night of a debrief ofmy own, um, and my performances
as IC of that, just to get itout of my head because even
though.
My brain knew that this was adrill, like my body still felt
like I had gone through anincident.
And, um, with my employer, wehave this great debrief process
that we use to sort of help ourincident commanders, process

(27:13):
through and like regulate down alittle bit and stuff so you
don't go and stare at a wall atnight.
And not sleep.
if you're of the high strungvariety like myself.
and because I didn't do that andit felt so real, I had to like
mimic that, um, which gave Eric,a bit more, material.
But yeah, I mean, I think itfelt very real.
And I, I look at the narrativeeven now and I'm like, but I
wanna jump in and say somethinglike, there there is that

(27:34):
moment.
I'm like, but it, so I think itfeels incredibly realistic.

Alex Elman (27:39):
for me, it didn't necessarily start out that way
because, I was, in a slackworkspace I wasn't familiar
with, I wasn't familiar with theinfrastructure, with the I
wasn't familiar with things likecheckout service.
I.
But as the, almost, immediatelyas the, as the incident really
took shape, it evoked for me thesame experience as being, at a

(28:02):
company and having to respond toa, a dark corner of the system
that you're not familiar with,that has always existed but has
maybe not had problems.
And it had very much that, thatsame sort of character to it.
I have to quickly get up tospeed on this.
I had to examine this long listof, of changes that I didn't
make.
I have to.
Coordinate with multiple people.

(28:22):
I have to get up to speed.
All of that felt very familiarand is the character that most
of the incidents I'm part oftake on.
So that is very general.
You, you can generalize thatacross experiences and that's
what made it feel very real forme.

Eric (28:36):
And that that's definitely one of the key, a key finding.
Uh, there's so much available ofmaterial that we're not going to
get to in this short retro, butI'm glad that, that, I'm glad
that surfaced.
where I would like to go next, Iguess before I dip into themes,
I, I, let me just double check.
I.
that I have paraphrased thenarrative sufficiently for our

(28:59):
conversation.
Do you think there are importantpieces your experience in the
incident that I have missed or Iemphasized one more than
another?
Or anything along those lines?
I'll just open it to the floor.
Anybody got stuff to add to thestory I've told so far?

Alex Elman (29:21):
One.
One of the unfortunate downsidesof having a really.
All of the analysis work youdid, Eric is hidden in just the
fluency of just having this likenarrative, this understanding,
this deep understanding of whathappened.
But the downside is that whatgets lost in that is it sound
like it was so obvious or easywhen you describe like, oh, and

(29:42):
then they just realize it was anissue on the, on the vendor
side.
and, and maybe my memory of thisis, is not, not quite.
But, I remember we were on acouple of different, threads,
looking into the change log,trying to figure out, maybe
there was a deploy that seemedkind of suspect.
I don't know if it was, the, theemail from the vendor that we

(30:04):
found in the sam the spamfolder, or it was somebody else
recognizing it, but recognizingthat the vendor might have an
issue.
In that data center completelypulled us off of the threads we
were on and into this newdirection.
And, not obvious, at least to meat the time, from the
information that I had.

(30:25):
Sarah, do you, how do you

Sarah Butt (30:26):
Yeah, I actually wanna jump, I'm, it's
interesting that you and Ipicked up different sides of the
same thing.
'Cause I worried that in myretrospectives and such that I
had potentially, biased, thenarrative because I was very
frustrated with my own perceivedfixation on the change piece,
but I do think it's important tocall out that we were actually

(30:48):
running multiple investigationsin parallel at that point.
So we started, and I think thefirst thing that we asked about
was big infra thing.
Like I think that the ask was,network load balance and
something else.
And we had Daniel off lookingand I know there was a point he
had come back relatively quicklyand said like, load balancers
are fine.
Alex and I were talking aboutthe fact that it's a, a

(31:11):
unbranded 500 NGINX page.
So like network connectivity, atleast like to a certain point is
fine.
so I don't know that we jumpedimmediately to change.
I think there was a lot ofdiscussion pretty quickly about
change and that discussion.
Did last for a little, like, Iwish I had pulled us out of that

(31:31):
a little sooner, but we did havelike multiple swim lanes in
parallel on the infra side aswell.
and then, yeah, I think Alex'smemory of how we pivoted is, is
actually pretty correct.
and I think several peoplestarted to realize what happened
at once.
So we got the email, That, Tinusfound Tinus Hamad.

(31:53):
Is it Tinus?
Am I right on my, my CTO's name?
perfect.
One of you, one of you, whathappened in my brain, I snapped
back to two prior incidents I'veseen with HVAC failures in a
data center where we had to failout, and a drill that I've run
where we did this.
And so for me, as soon as thatcame in, I think there was a

(32:14):
moment that, if I'm rememberingit right, like we were sort of
mid can, or I was askingquestions or something and, and
I actually told Alex like, I'mgonna have to hard pivot the
bridge right now.
And I think as that washappening.
It was either Tanya or I thinkit was Tanya, someone else, like
light bulb went on at the sametime.

(32:36):
Um, so I feel like as soon as weknew that that email happened,
several of us very quickly gotto the point of like, we didn't
necessarily know that the datacenter was restarting the
server.
And I still have a lot ofquestions actually about the
like, mechanics of were theyactually restarting the server,
but we knew that there is aninfra level equipment failure.

(32:58):
We're going to have to get outof there or we're going to be
sort of at the mercy of themgetting H, the HVAC back online.
And in my head I'm also thinkingthrough all of the repercussions
that can come from sudden heatfailures, such as blown
equipment and ways that heatdoesn't dissipate and all of
this.
And I think that was the momentthat we pivoted the bridge
pretty quickly.
But we did have infra level swimlanes going prior to that point.

Eric (33:22):
that you all have included those.
I absolutely have evidence forall of that that I can bring
forward into the document and,in my haste to try to get to
the, to the major plot points.
I missed some and so thank youvery much, delighted with your
contributions to our, improvednarrative.

(33:44):
Hamed, are there any things thatyou would like to add to what
Alex or Sarah have said before Imove on from the narrative?
I.
Or any of your many roles.

Hamed (33:57):
So I'm gonna wear my Bob, the customer service hat.
Now.
I think initially what worthmentioning is as a business, we
were completely, we we're caughtoff guard I started, my team
started receiving a lot of callsand complaints from customers, I
didn't had initially frominternal, oh, there's something

(34:20):
going on.
which kind of.
It me off because instead oflike focusing, putting a status
page job, managing customercalls, I was just trying to
figure out or asking for thepressure.
Sarah and Alex, what, what ishappening?
What do I tell?
What is it?

(34:40):
Do we really have an issue, dowe not?
for me that was like a diffdifficult part in the big
beginning,

Eric (34:48):
am I paraphrasing this well?
As I typed this into the doc,you, you were unable to focus on
your primary communications withcustomers because you were so
busy trying to understand whatwas happening.
You didn't even have a statusyou could tell them

Hamed (35:00):
yeah.

Eric (35:01):
I.

Hamed (35:03):
Yeah.
Normally, uh, what happens is,is an incident, raised through
internal mechanism beingalerting or someone notice it.
This one was like a deluge ofcustomer calls.
What happened?
What is going on?
What's happening to my order?
I didn't have anythinginternally looking wrong
immediately, so

Sarah Butt (35:24):
I think to say the quiet part out loud, I think
what Bob's trying to express is,a little bit of.
Some emotion.
I'm not going to tell say ifit's confusion or frustration or
what, but some emotion that Ithink, I mean, I shared as well,
of surprise that we didn't havea monitoring alert or, anything,

(35:47):
because I think we did see, andAlex knows better, he was in the
Grafana.
I wasn't in the dashboards, butI, I think we did see like the
volume drop off and stuff.
So, I think there was, I thinkwhat I'm hearing, and I, and I
agree, was like there wassurprise that this was not
detected internally and thatcreated, a little bit more of

(36:07):
the initial crush and saturationthat we had on the biz comms
side because it was all comingin, like we were trying to give
information and get informationvia biz comms and customers,
which was a lot of this wasbeing done by Alex.
Versus us knowing and having,you know, like we have an issue,

(36:27):
we're standing everything up.
We're already looking at it andjust pushing to customers.
I think that bidirectionalcommunication made for a pretty,
like Bob was pretty crushed atone point because we're telling
him like, do a status page.
Also, we need to know thecustomer experiences.

Eric (36:41):
need to take a moment to go meta in our conversation,
step out of the retro and makean observation about the retro
we're having, having sat througha lot of retros.
one of the experiences for me asan analyst in this is that I
have never analyzed an incidentwhere my participants already
knew so much about resilienceengineering.

(37:03):
So there there are that Alex andSarah are bringing forward that,
I need a moment for the audienceto recognize are deep insights
and extremely unusual.
These two are rock stars asincident commanders with a lot
of experience in the complexityof being in incidents and, and

(37:24):
they, they are speaking aboutthis with a, a.
There's no better word for it.
The, the, this is fluency ofexpertise and, in the resilience
engineering space, we have a lawfor it.
I'm gonna sort of skip some ofthose details.
But when Alex was, firstchallenging, what I've left in
the narrative, his framing forit was one of the things about

(37:47):
having so much evidence, it'seasy to not see blah, blah,
blah.
He is pointing at my expertiseas an analyst and that I made
the synopsis look easy andobvious.
That is a perfect paraphrase ofthe nature of expertise, and
most folks don't know this, soit's, it bears going meta on the
conversation to draw attentionto, Alex understands a thing

(38:12):
about expertise that the expertsmake it look easy.
So my expertise as an analyst.
I make it look easy, like youcan just go sift through a bunch
of incidents and find the keyplot points and summarize'em in
a couple of sentences.
Turns out that ain't easy.
Alex knows it.
But the other thing about thatskill, that property of

(38:33):
expertise, is that it hidesstuff from you.
So the nature of an expert'swork means you, you, you don't
see how hard they're working orhow hard it would be for
somebody who didn't have theirskills.
Yes.
Now the second key idea and thefluency that's shown up here is

(38:54):
that both of them stepped inwith a story about saturation,
which is one of the themes inour incident.
This will become a segue backinto the, the retro,
understanding that theparticipants in an incident get
overloaded the flood ofinformation and the confusion.

(39:18):
is a very high level awarenessabout incident response.
It is really easy and reallycommon for folks to be in
overload and in a retro tototally not know that that's a
thing worth talking about.
But the fact that the fact thatthat Hamed shows up with Bob,
and in fact actually this isthe, the non-player character of

(39:41):
Bob is actually savvy already.
Because when he shows up with,Hey, I'm overloaded by customer
support, he also says, and I'mnot going to be as responsive to
you as you're used to me beingthat kind of signaling is deep
expertise about how during theincident, if you're overloaded,

(40:01):
knowing to say so is deep skillin the coordination.
So the, I can't emphasizeenough.
These are unusual skills for anelite group of folks involved in
incident response.
And I, I did, I didn't wanna letthe, their expertise pass
without commentary.

(40:22):
what I'd like to do now is,unless there's, unless there's
sort of further, we could, like,I'll get back to, you know, step
out of the meta, go back to theretro, and we can start talking
about themes.

Courtney (40:32):
The only, the only meta thing I'll add to that is
never have I recorded a podcastwhere I just sit here and say
nothing, because everybodyanswers all the questions I
would already ask.
So just like, on.
This Keep

Eric (40:44):
I'll dip into the themes a bit.
I, I pulled four themes out.
because I know you're not usedto being in a retro, like I run,
this is an unfamiliar format.
I'll take a moment to explainwhat a theme is or why we care
about it.
maybe the fastest way to explainit is a theme is something I
heard from more than one personin the conversation.

(41:07):
I saw a pattern from my owninvestigation of what happened
in the Slack transcript before Iinterviewed people, or in the
discussions I had with theparticipants independently
these, patterns showed up.
So, and for example, the, the,the sense of feeling overloaded,

(41:28):
of feeling saturated.
When I was asking Al, I askedAlex the same question you did
Courtney, about, did this feelreal?
And he started that conversationby saying, yeah, it felt real as
soon as I was saturated.
Because Alex and I know eachother, and we didn't have to, we
didn't have to explain whatsaturation is, the, that we were

(41:49):
able to move through thatconversation quickly.
But it, it, That experience ofsaturation was a firsthand, knew
he was saturated.
That's part of what made itrealistic, Sarah, on several
occasions.
In fact, actually, the, I askedSarah about this particular
moment in the incident whenthere were, she was successfully

(42:11):
managing four or five differentthreads of action by four or
five different people alongparallel tracks for
troubleshooting.
I was like, so if that was me,I'd have been overloaded in that
moment.
How was that for you?
And she was like, no, nobusiness as usual.
that all the time.

(42:32):
And I, I mean, not quite asblase as I just put it, but, but
in all seriousness, Sarah wasnot saturated by stuff that
absolutely would've buried me.

Sarah Butt (42:41):
Sarah also had a like, I think, I think the other
thing we have to call out hereand, and I mean'cause you are,
you are exceptionally kind,Eric.
You are.
but I think the other piece hereis.
I had a luxury that not everycompany has.
I had a deputy and not only didI have a deputy, I had an
incredible deputy that, whileAlex and I have never run an

(43:02):
incident together, we have donea lot of other work together.
We've written papers together,we've traveled together, we've
worked together for a very longtime.
And, that allowed two thingswith managing saturation.
One is, you see in the beginningthat I throw a lot of things at
Alex.
I'm like, Hey, I need theGrafana checked.
I need interface with BizCom.
Put Bez in a box.
He is making an entire mess ofthis whole thing, and he is

(43:24):
terrifying my engineers get himout.
And I knew, because I trust Alexin how Alex has experienced an
incidents that Alex would loadshed as appropriate or
deprioritize like he would takethe patterns of managing
saturation that we see inresilience engineering.
And I trusted him, to applythem.

(43:44):
So I didn't feel like I had tocarefully manage his workload as
much as I did some of the NPCsbecause I knew of his
experience.
I think that's, there'ssomething to be said about that.
the other thing that.
Having the relationship withAlex that I do is it allowed me
to be very blunt with him,particularly on the voice
bridge.
So you, you hear me sort of firethings off at him, get me a can,

(44:05):
do this, put Bess in a box andthere's not a lot of like fluffy
language around it.
Would you please go do, put Bessin a box?
So you're going to need to dothis because Alex and I operate
on this common ground thatenables us to just communicate,
in a way that I, I think, Had wenot had that combined
experience, I would've faced alot more saturation because I

(44:25):
wouldn't have been able toreally efficiently offload.

Alex Elman (44:28):
Yeah, Sarah knows what I know and don't I know
what Sarah knows.
If there's certain things thatshe doesn't have to explain to
me, like, can you go work on thecan report?
She doesn't have to explain acan report to me.
and I was dealing with bez inthe business comms channel.
he was quite pushy, quite noisy,kind of coming down on Sarah.
and I was managing that, but Ididn't know if any of that was

(44:51):
leaking through on the, theother channel, because I didn't
have time to context switch,

Sarah Butt (44:56):
Had zero idea.
Zero idea.
I thought he was totally calm.

Alex Elman (45:01):
and that, when, that you mentioned that during the
incident, and when you had saidthat, that's when I realized,
oh, well maybe, maybe what I'mdoing here in the, in the
putting Bez in the box iseffective.
I just didn't know it at thetime.
It was'cause I was so saturated.

Courtney (45:15):
so we're having a bit of an incident of our own right
now in that, Eric hasdisappeared.
Out of the channel, out of therecording tool that we are
currently using.
I noticed this because he wassharing a screen with us to show
the document that he uses to runthe retro and like the screen
disappeared and then I see likeall four of us, and I'm looking

(45:36):
while you're talking and I'mlike, Uhhuh.
I'm like, he's gone.
He is not in the list ofparticipants.
Eric is gone, y'all.
And I don't know what

Sarah Butt (45:45):
Do we have a BCP for, uh, for, uh, facilitating a
retro.

Courtney (45:52):
I am quite certain that someone has actually gotten
called into another incidentwhile conducting a retro.
There is like a very non-zerochance that has happened to
somebody.

Sarah Butt (46:01):
Oh yeah.

Courtney (46:02):
but, so I'm now trying to, Incident command this.
I'm looking in Slack, but I havenothing from him, and if he lost
his internet, then he's also notgoing to be able to tell me
Slack

Sarah Butt (46:16):
Alex, can you text him?
I don't know if I have hisnumber.
I might.

Courtney (46:20):
I Don't know if

Sarah Butt (46:21):
have his?
I would say I'm pretty sure ofall the people, Alex is gonna be
the one.

Courtney (46:27):
okay.
I am gonna hit stop reallyquick.
So it turned out that Eric had apower outage in Boulder,
Colorado, where he lives losthis internet, all of that.
And we were already pushing, Idon't know, I'm looking at the

(46:51):
timeline like 45 minutes, like afull episode of the podcast.
And we hadn't even scratched thesurface.
So, and thankfully, I guess itwas a good, it was an omen
because right after Eric's powerwent out, I got a migraine.
Unbeknownst to me during therecording, Sarah was also
getting a migraine, so I don'tknow if it's like incident PTSD

(47:14):
or what the hell was going on,but we decided to break and we
have a part two coming for you.
And you'll get to pick up wherewe also left off.
So see you.
In the next one.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

The Joe Rogan Experience

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Uptime Labs and the Multi-Party Dilemma (Part I)

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

The Joe Rogan Experience

All Episodes

Uptime Labs and the Multi-Party Dilemma (Part I)

Stuff You Should Know