Episode 7: When Uptime Met Downtime

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Intro (00:15):
This is The Void Podcast, an insider's look at software
incident reports.
When software falls over, as itdoes, it's people who put it
back together again.
Each episode, we hear what itwas like from the perspective of
the people involved.

Courtney (00:30):
We took a bit of a hiatus from recording last year,
but we're back with an episodethat I think everyone is really
going to enjoy.
Late last year, John Allspawtold me about this new company
called Uptime Labs.
They simulate softwareincidents, giving people a safe
and constructive environment inwhich to experience incidents,
practice what response is like,and bring what they learn back

(00:52):
to their own organizations.
Uh, this is not a sponsored postor podcast.
I legitimately just love whatthey do.
And I had the sincere privilegeto meet Uptime's co founder and
CEO, Hamed Silatani at SREConEMEA in November, where he gave
a fantastic talk about some ofthe things they've learned about
incident response for runninghundreds of simulations for

(01:14):
their customers.
They recently had their firstserious outage of their own
platform.
And so Hamed is joined by JoeMcEvitt, co founder and director
of engineering at Uptime todiscuss with me the one time
that Uptime met downtime.

Hamed (01:31):
Thanks, Courtney.
Thanks for having us.
I must say, it's a great honorto be here and talking to you
for many years.
I've been following your work inconferences or through through
videos.
So being here is awesome.
Introduction about me, softwareengineer civil engineer turned
to software engineer.

(01:51):
Early on in my career, I wipedout an Oracle database that
listed all the products that ourshop was selling.
And that was my introduction tohow things fail.
And I realized that actually itwas very stressful, but I
enjoyed it, got a lot of energy.

(02:11):
So since then, most of my roleswere either in incident
management, incident response,Or application support.
That's me, briefly, background.
we loved it so much that managedto convince Joe and we set up
Uptime Labs.
Just about three years ago.
We are in business of creatingincidents.

(02:32):
create many of the incidentsevery day.
Most of them P1.
And the idea is create a safespace for people to experience
incident response and worktogether.
It's really stressful in reallife.

(02:53):
And it's a stressful in our, inthe, in as well, but no one
dies.
No one gets fired.
So it's a great place topractice.

Joe (03:04):
Yep.
And I'm Hamed's Irishcounterpart.
My name's Joe.
I've been working with Hamed forabout 15 years.
I'm more of the more of thesoftware development background.
So Hamed and I work together inthe same company.
I was shipping stuff and thenHamed was an operation side
fixing stuff.
So that that's

Courtney (03:22):
Perfect pairing.

Joe (03:23):
started.
Yeah, yeah, that's how westarted working together.
I've sort of been always beenobsessed with.
I love software and I lovetechnology, but I always also
equally fascinated with thepeople's social side on how we
work together on It always justused to bamboozle me when I join
a post incident review sessionand we start talking about what

(03:44):
went wrong and stuff like that.
I always was fascinated andobsessed with how you can
improve.
And like Hamed, just think it'sbonkers that a lot of people go
on call for the first time andexpect, did you just figure this
out?
Here you go.
It used to be a blackberry.
That's how I started.
Right.
But you just, let's figure thisout.
You'll figure this out, thiscomplex system across multiple

(04:08):
regions and stuff like that.
Whether, you know, sittingbehind someone for a week.
So we've always been obsessed atUptime Labs.
There must be a better way, toprepare, prepare yourself to
deal with, deal with incidents.
So yeah, that's our, that's ourbackstory.

Courtney (04:23):
I love it.
Bonkers is accurate.
We've, you have a lot ofchildren.
I don't have as many as you.
I just learned this.
But it's as bonkers as somebodyjust handing you a baby and
leave the hospital.
You're like, what the, I don'tknow how to do that.
So yeah, here you're on call.
Good luck.
It'll be fine.
So wonderful.
I am really thrilled to have youtwo join me because.

(04:43):
This one, this is a slightvariation on the podcast You had
an incident, but it's notpublicly written up currently.
So I was given the privilege ofreading your internal writeup of
it.
Which is cool.
Cause I don't always get thatfor me, I got to nerd out on
that.
But also because.
You are as you'd said, Hamed, inthe business of helping people

(05:04):
learn from incidents.
So the way you all wrote this upand the way you approached it
was really different andrefreshing.
And that's definitely what Iwant to dig into.
And so the title of this episodeis When Uptime Met Downtime,
which you chose actually, Hamed,I couldn't write a better title
if I wanted to.
Especially because people don'thave a report An incident report
to read.
I would love it if you couldgive us the very quick TLDR, if

(05:27):
you can, summary of whathappened, what went down at
uptime that one fateful day.

Joe (05:34):
I started, I started the chain.
So we have a morning practicewhere we do our checking our
systems, health checks, etcetera.
And I noticed an internalworkflow not working.
So like a total minor issue.
And I published it into our,into our incident incident chat
room to say, Oh, you know, I'venoticed a minor incident.

(05:55):
I'm not triggered.
A set of cascading,

Courtney (05:59):
So are you saying that you're the root cause, Joe, you
started the incident?

Joe (06:03):
I was.
Yes, exactly.
Root cause was me.
And I was totally relaxed too.
I started the day, I like, Iliterally went on another call.
I sort of, I wasn't, I wasn'tdirectly on support that day,
but it started cascading set,set, set of incidents or set,
set of, issues that cascaded andcascaded.
And ultimately we had a fulldowntime outage within probably

(06:25):
90 minutes of that initialtrigger point.
And it took us a further, intotal, it took us four and a
half hours.
To resolve our platform.
so yeah, that was the, on alovely Monday morning as well,
to top it off.

Hamed (06:41):
I was just going back on what Joe said.
I never figured that out untiltoday.
When we have to put the pin onthe start point of our time to
recovery measure.

Courtney (06:52):
It's a fun one.
Yeah.

Hamed (06:53):
was it when you noticed the one or when the customer's
actually affected?
And I'm saying that, Courtney,with tongue in cheek.

Courtney (07:00):
I know you're taunting me with that one, but I did, I,
when I read through it, I waslike, okay, they put it to a
certain, and this is, this issomething I've talked about and
other people have talked about,is when you're trying to track
durations of things or any ofthat, like, when did it start?
It's a very metaphysicalproblem, but also sometimes a
very real problem.
But you're, it's almost like youpoked Schrodinger's cat by
looking in there to see what washappening.

(07:22):
And it's, and we're not going tojust for context for people
listening we're not going to getinto the gory technical details
of the incident because that'sactually not what's interesting.
And I'm sure there are a fewpeople who will be like, I
really want to know.
But the reason I wanted to haveyou folks on to, to talk about
this was to Talk about the otherpieces of it.
And Joe, you said you'd alwaysbeen fascinated by the people

(07:45):
and whatnot.
But I am curious, if you cantell me, was this sort of the
first Really serious one for youall.
Like I remember the firstincident of a customer impacting
thing was the, the first real goat this for your own software.

Joe (08:01):
Yeah, it was probably three years into our journey and, you
know, we've had a few, but thiswas definitely the largest, most
critical incident we've had.
And we, we had a good streakgoing as well.
So yeah, it was it was a coupleof long times as we had
something like this.
So four and a half hours isdefinitely our longest outage
we've ever had.
And hopefully it'll be a whilebefore we get another one like

(08:23):
that.

Courtney (08:25):
Yeah.
I think the thing about theduration that people don't
always think about when youthink about how long an incident
is, if you will, they'rethinking about obviously and
not, and rightfully so thecustomer impact or the financial
impact or whatever, but thelonger these things go on, the
more stressed out, the moretired, the the, it is, as you'd
mentioned, when you first didran across that yourself, Hamed,

(08:46):
it's there's a whole physicalcomponent to it, as well as the
mental side of it.
I wanted to talk about a littlebit more of one detail in there
in terms of that timing and thatprocess and what that felt like
people often have neat and tidyideas about incidents, right?
A problem is identified, peopleinvestigate, you troubleshoot,
you resolve it, right?
This nice linear process.

(09:06):
As you all know, even fromrunning these more than most
people probably know fromrunning incidents for people as
a business it's rarely the case.
And in the writeup that you alldid internally, I noticed that
as we were talking about, it wasabout four and a half hours end
to end, but halfway through it,I feel like it said a second
incident was declared.
And can you talk to me a littlebit more about what that was?

(09:27):
Like was that, was thereconfusion?
Was it the same incident was thescope of it increasing?
Cause this is something I thinkis more common than people
realize that you have these sortof incidents within incidents,
or sometimes you have incidentsor things that are, you can't
tell yet what it is.
And I love that little detail ofthat in there in the timeline
that was Oh, Hey, by the way,second incident declared, I was

(09:49):
like, Oh God what was that?
Tell me a bit more about whathappened there.
Yeah,

Joe (10:04):
flagged it.
Okay.
It was an internal issue, notcustomer facing.
Right.
So off we went.
And then within a couple ofminutes, one of our engineers
flagged it saying, hang on asecond, there's a change,
there's a set of change setsthat are in production that are
unexpected, and he just flaggedit up.
Okay, he just flagged itinternally.

(10:26):
Now, we had done initial smoketests, et cetera, that morning,
so, you know, some of the corefunctionality was working as
expected.
But very quickly, behind thescenes we started making
corrections.
So the user could see a changeset on a project which we'd been
working on for a couple ofmonths.
It was actually a large changeset.
As I discovered, as the time ranon, the engineer and

(10:47):
investigating realized, wow,there's actually a massive
change set that has beenreleased.
into our environment, and thatstarted a cascading effect where
he started to make these smallchanges and slowly, but surely,
it started the cascading issuestarted.
And what what happened was thatthere was a set of changes

(11:07):
pending changes sitting there assoon as he fixed an issue, the
new set of changes went acrossinto our environment.
Totally unexpected, right?
We're totally unaware.
And this is when things startedto get worse.
suddenly the issue startedcoming in from our live, live
customers to the point where itgot doomsday, where we had a

(11:29):
full, full outage.
90 minutes in and that's reallywhen from a personal
perspective, it's like, okay,this is way bigger and more
challenging than we initiallythought.
And that's when we had to take astep back.
So it was like a two leggedincident, if that makes sense.
Yeah.

Courtney (11:46):
It's a classic situation where the sort of
debugging or trying to figure itout makes it worse, right?
We've seen that one quite, quitea bit, actually.

Hamed (11:57):
It was one of those days, when you realize something and
then like very cold stone shapesin your stomach this is actually
worse than I thought, and then.
That thing is bigger and bigger.
It was definitely one of thosethe incident was so Joe was

(12:20):
dealing Joe and Engineers theywere dealing with really the
hard part of it Which isfiguring out what it is, but I
think it was even harder for mewas a very tough day and at many
levels I felt conflicted withmyself.
Bear in mind that I have anengineering background, been

(12:41):
fixing incidents and dealingwith them for many years.
in this business, I'm put in theposition of CEO and just
managing the attempt at I wantto be inside.
I know what is going on, but notbe there.
And then dealing with thepressure of sales calls being

(13:04):
canceled or delayed, it, put mein a position to like,
experienced everything that overthe years I argued management
should do or should not do.
I like face every single ofthose decisions in that, 48
hours during and thenafterwards, which yeah, it was

(13:27):
very tough.
I can expand on it if you want,but

Courtney (13:30):
I actually would like you to because one of the things
that a group of us have beentalking about in the community,
it's in the resilienceengineering and learning from
incidents community is I thinkeven I'm guilty of this too.
A lot of us for a long timeabstracted away.
Like we have the sharp end as wewould say, I make a little
cheese wedge sign with myfingers whenever I do that,
right?
The people at the sharp endwhere everything's, happening

(13:51):
Joe and crew, right?
And we talk a lot about that andthem, and I think we should,
that's really, there's a lot ofbenefit of that.
And we'd always be like, Oh, theblunt end, right?
Womp, Charlie Brown's teachervoice, like these, the business,
the management and stuff, butthey have their own set of hard
trade offs and implications ofthese things.

(14:11):
And and we don't actually talkabout that a lot, but a bunch of
folks have been talking aboutwhat is it like to be empathetic
to that?
What does that look like?
It's a part of the incident.
It's a reality for theorganization.
And I think you're in a uniqueposition because you are aware
of those dynamics more thanprobably a lot of managers and
execs, maybe especially atlarge, much larger companies.

(14:34):
When should the, CEO get in theincident chat room, questions
like that.
And things I think you're deeplyaware of.
So I would love to hear a bitmore about your perspective of
that, especially the conflict,as you say, of having gone from
being a sharp end sort offrontline person to the person
in charge.
I think it would be very helpfulfor people to hear what those

(14:55):
trade off decisions or,considerations look like.

Hamed (15:00):
Okay, great.
So you asked for it.
You need to stop me whenever Igo to take a lot of time.
So I start from morning when thefirst message I got was, oh, so
it was from Joe.
I suspect there is somethingwrong.
So there might be some issues.
I'll let you know later on.
And then that week was veryimportant for us.

(15:21):
We had three sales demos duringthe week, important deliverable
and one of our features that wewanted.
To progress and show and one,one meeting on, I think on the
same day on the Monday evening,a customer session I had with a
senior customer.
So I just wanted to bephysically present and see him

(15:44):
using the product.
as soon as Joe said there couldbe a problem, my thinking was
immediately switched.
Okay, should I keep thisengagement?
Should I cancel them?
And I thought, let me ask Joe.
Can this be fixed beforeafternoon?

Courtney (16:00):
The classic management question, right?
You're the 1 asking thatquestion.

Hamed (16:05):
But before I talked that into slack, almost like my
previous self hold my hand.
So if you ask that question, isit going to help him?
Is it not going to help him willit add stress?
They either way are going to fixit as soon as possible.

Courtney (16:23):
yes.

Hamed (16:24):
literally I had to pull my hand away from keyboard, bite
tongue, sit down, and I wasdesperate to get more
information.
So that was another edge do Iask how things are going, do I
what do I do?
So that, that was like thesecond difficult thing to step
back actually, Joe, I, maybe Islipped once or twice.

(16:47):
Would you help me?

Courtney (16:48):
Did he give in Joe?
Did he ask?
You can tell us it's a safespace.

Joe (16:52):
Hamed's like a brother to me, so it's no filter.
So, one of the things we did do,we did do was, We deliberately
created space.
So the investigation team thatwere running, we really created
like space for them.
And then we had the, you know,which is good practice.
We had the differentcommunication channel for
updating stakeholders.
Hamed was never in theinvestigation war room at a

(17:15):
time.
And so that helped and we didpractice that even though it was
really tempted to go in andhelp.
We just have this philosophy ofleadership.
Hamed maybe you want toelaborate on that about even if
they want to help something justbeing there at best make things.
Yeah.
You want to talk about that?

Hamed (17:31):
That's, I think I borrowed from John's, John
Allspaw senior management beingpresent on the incident bridge
at best is not helpful.

Courtney (17:40):
At best.

Hamed (17:41):
That I remember, but what I did on that day, which later
on, I think I was proud of was,okay, I can't help and I
shouldn't really get involved tofix this, but how I can help the
team is just being brave enough,picking up the phone with the
customers and people who areengagement with them and letting
know.
It won't happen this week andpostpone those engagement.

(18:03):
And then the other thing was.
I think it was in the second dayJoe, we were talking about, so
the service is restored.
There's a bunch of work needs tobe done to just make sure, A, we
understand exactly what happenedand get into stable state.

(18:23):
Do we prioritize that or do weget back to that important
delivery work that we promised?
And at that point, there's a lotof promises being made.
And I think that was there wasanother time which I seriously
felt conflicted because I sobadly wanted that feature out,
then step back and said, itdoesn't make sense distract the

(18:48):
team with the delivery work.
Let's just get as long as ittakes as much time as it takes.
We do this incident, we learnfrom it, the things that needs
to happen.
again, we start this productionmachine later on in the week or
next week.
and that was again, a difficultcall for me.
The last one was when we startedto understand, okay, how this

(19:10):
chain problem.
Start manifested itself.
Joe talked about a change setbeing deployed to production,
which they were not meant tothat change set went to
production over a weekendbecause one of our, one of our
good engineers being reallythoughtful and helpful, he
wanted to deliver a piece ofwork.

(19:34):
And he used his weekend tocomplete the work, get it to
production.
Now, what I was thinking in myhead, should I like ban everyone
from deploying code in theweekend?
Should we, should I tell him,have a conversation with him?
That was like the most naturalone.

(19:54):
But then I thought actually whathappened here is, we really
learned that, or it was more areflection on myself.
If as a business, we want tomove fast, we need to actually
encourage what he did.
He wanted to take initiative toput, deliver, do completed work,
deliver changes to production.

(20:15):
The fact that it resulted inincidents, I think that's a
systematic learning for us.
Why it resulted.
That was another thing that,okay, how do we act on the back
of this?
which we encourage everyone, thework you did was great.
Keep doing it.
Everyone else, if you want todo, we need to go faster.
We don't want to slow that down,but we need to learn what went
wrong here or what things wentwrong and avoid that.

(20:40):
So yeah, we celebrated thatperson's work, was, which was
against my natural, not natural,the first response that came to
my mind, what On the back of it.
All of this, you always say thatincidents are an opportunity to
learn more about your systems,how they work.
For me, it was also anopportunity to learn more about

(21:00):
how my brain naturally works.
So it gave that spotlight intomy own personality And, what are
the tendencies and how I canchange things that come to my
mind.

Joe (21:14):
I'll just riff on just two points, just when Hamed was
talking, It's just two thingsthat I'll never forget, But when
you have incidents, right,they're literally memories.
He almost, it's like a timemachine.
You just remember where youwere, when it happened, et
cetera.
And there was two moments.
The first moment was, there wasa point when the cascading issue
was happening.
I just got the sense of losingcontrol.

(21:37):
I had this mental model, howdoes mental model, how things
will work.
And we, yeah, okay.
So this change went into here,let's make this fix here.
And suddenly, you know,cascading issues, these pending
changes that I had no idea aboutstarted coming through.
And I just, I just felt likelosing control.
That was quite scary.

Courtney (21:54):
yeah.
That was a quite

Joe (21:54):
scary sensation about losing control.
You know, as an engineer, youknow, you've got this logical
breakdown and in reality, itjust wasn't like that.

Courtney (22:03):
And you built the thing, right?
Or you were part of buildingthis thing.
You should totally know how itworks.
yeah, so that's it's it's scary.
It's humbling.
I think this might be the firstone in the list that you have in
the writeup you all did, andthis is an internal thing too,
which is, I think it was greatbecause it's better than a lot
of, Internal or external reportsI've seen.

(22:25):
Better is a very normativejudgment, but I'm going to run
with it anyways.
It was it had a lot more helpfuland interesting information.
We'll go with that instead ofbetter.
How's that?
And you had a number ofsurprises called out.
And I love this because to me,it feels a little bit like a
call.
I don't know if it wasintentional or not, but it's a
feels to me a little bit like acall back to our former
colleague, Richard Cook, whoused to say that all incidents

(22:47):
are fundamentally surprises.
And but you went through whatwere the things that were
surprising about this and theone of the ones I wanted to
highlight, I think it might be,might have been the first one in
the list.
I reordered things as I wasediting this, was a"feeling of
paralysis" during the incident.
And I don't know which of youthis was now that you've both
told me your stories.
No ability to assistmeaningfully.
Okay.
Oh, and technicaltroubleshooting resolution.

(23:09):
That was probably you, Hamed.
Yes.
Is that where that piece camefrom?

Hamed (23:14):
Yes, I because I thought, I've done this for my life.
I should be there.
It should be part of it,

Courtney (23:20):
Okay.
I couldn't

Hamed (23:21):
make any difference.

Courtney (23:23):
But Joe, you said something on the, as it were,
the sharp end.
That's the same thing, which isfeeling you didn't say helpless.
What was the word you used?

Joe (23:30):
Out of control.

Courtney (23:31):
Out of control.
Yeah.

Joe (23:33):
Out of control.
So there was one point when wehad all our working theories
and, you know, we're working onthese working theories about,
you know, what to do and we'recoming up with our plan.
And then suddenly we startedgetting these new information
coming through.
And then we, we had a fulloutage, which is totally
unexpected, right?
You know, we thought we weremoving towards green, you know,

(23:55):
Instead, we're going moretowards red.

Courtney (23:58):
And how could you not know that?
Yeah.
Yeah.
Why didn't you know that you'reout?

Joe (24:04):
I wasn't the only one, by the way, the team were, we were
all standing there.
It's just agape.
gape.
virtual, right?
This is all virtual.
We're virtual companies.
This is all virtual.
And that idea of losing controlthat mental model of the current
state versus what we need to getto was, you I can never forget
that point when suddenly I hitthe web, I hit our platform and

(24:26):
we're down, right?
So we moved from a minor to amajor to critical.
So that was, that was a scary, ascary thought.
And then I had a consciously,had an out of body experience,
but remained calm

Courtney (24:38):
Yeah.

Joe (24:39):
regroup and the iteration started again.
We just, okay, what's, what'sgoing on, right?
I mean, the loop starts again,right, for the

Courtney (24:45):
Yeah, and this is what I loved about the, because this
is the reality of incidents thatnever makes it into incident
reports.
You might talk about what youwere factually surprised about,
right?
That'll be maybe a surprise inan incident report, or people
talk about how they got lucky orwhatever.
But this like physical,emotional, Experience is so
common.
And so I'm just thankful to youboth for writing about it and

(25:08):
then sitting here and sohonestly talking to me about it.
So one of the other surpriseswas.
And you've already largelyalluded to this, so I don't know
if we need to, we don't have tonecessarily get into more
details of this Joe.
But if there is more, you said"achain reaction of seemingly
unrelated issues triggered asnowball effect." And this is
definitely, I think, thecascading piece of it and you

(25:31):
were talking about, the, thatphase that is, That people are
commonly in of you've got fouroperating theories or something,
right?
Like you've got a choose yourown adventure of what might,
where do we start trying, whatdo we do if we want to have a
good theory about what'shappening before we do more in
theory, or it might get worse,or it might get worse, even if
we don't do anything, but Whattalk to me about the sort of

(25:53):
unrelated issues triggering asnowball.
And if you, if it's there anymore, you can add to that

Joe (26:00):
Okay, so when we started the investigation, we had this
change set.
Okay, there's changes comethrough the one expected.
So we sort of had like a, likea, like a ring fence around the
technology set, right?
The area in our, in ourarchitecture, which is impacted.
And suddenly by making thosesmall corrections, suddenly we
started to see issues.

(26:21):
Okay.
The other side of thearchitecture, right?
Like the other side of the

Courtney (26:24):
outside of what you thought the scope of control or
impact was, UGH, yeah.
And not only that, they were

Joe (26:32):
really bad, right?
Really bad.
Okay.
So it's bad when you're, youknow, when the platform was
offline and that was like, thisdoesn't make sense.
This is unrelated.
Right.
So, so we had a, then you haveto pause, right.
And take that space.
Okay.
Let's.
Let's investigate.
Right.
So suddenly, I don't know.
Obviously, the obvious techniqueis escalating, bringing other

(26:54):
eyes to it.
In fresh eyes and stuff likethat, the committed how, so I
brought in all our other folksfrom the team to bring in.
So I'm increasing the size ofthe, of the, of the
investigation team, fresh pairsof eyes.
And then suddenly we startgetting more and more like spot.
Okay.
Well, I can see the error.
I can see the problem, etcetera.

(27:15):
And now we're looking at anotherset of changes.
And then suddenly we realizethis pending change problem.
That was like a moment thatthere's basically a whole change
sets were coming through thatwere blocked.
and then when these changes wentin, suddenly the scope of change
is like tripled, right?

(27:36):
And then we start moving theninto, into, I had this really
scary point where I said, let'sjust roll back.
Can we roll back?
I could see that the look causewe're on video call.
I could see the look in theengineer's eyes saying, we're
rolling forward.
So that, again, that was a.

Courtney (27:52):
Can we talk about that a little bit more, please, in as
much detail as you'recomfortable sharing, because I
hear all the time ah, just likeincident management stuff, like
always have a plan to roll back.
Always have a plan.
Sure.
The best laid plans, blah, blah,blah.
What can was the fact that youcouldn't roll back also a
surprise or it was like, it wasa known thing, but you hadn't

(28:13):
had to run into it before.

Joe (28:15):
No, no, that was it.
That was a learning for me inthe incident.

Courtney (28:18):
Okay.

Joe (28:19):
so I think a certain point I said, okay, I'm going to time
box this.
And, you know, once we get ourhead right, it's okay.
We can always revert back.
got our exit strategy, a plan B,especially when things started
going awry, right?
Getting worse and worse.
I was like, okay.
Time out, time out.
Okay, let's, let's step back.
Let's just, let's just see if weroll back the changes.

(28:39):
But it was, the problem was, Ithink when you talk about roll
back, Rollback's great when youhave the small, distinct
changesets and you can reasonaround change quickly and
rollback, right?
It's the total advantage ofcontinuous delivery and
practices like that.
The situation we're in was wehad a larger than we wanted
changeset, that had let's justsay multiple projects were

(28:59):
pushed in.
Right.
So it was like a lot, like alarge change set pushed into
prod the ability to reason,right, to the engineers was
easier to fix forward than itwas to roll back.
Right.
And remember, this is all understress, stressful conditions.
Right.

Courtney (29:15):
yeah.

Joe (29:16):
So that was the decision when the engineers would weigh
up the pros and cons.
I said, Oh, Joe, it's easier ifwe roll forward break down these
issues that we have, rather thantrying to do, a wholesale change
set rolled back.
So that was the decision wemade.
But that was certainly anothermemory that a moment in time
when, when it happened where wecan't roll back was one of those

(29:37):
gotcha moments.

Courtney (29:38):
And probably felt even scarier.

Joe (29:41):
we're back to learning from incidents, right?
Like if you're not convincedsmall changesets are not a good
thing.
If you do a situation like this,you learn very quickly there's
the power of incremental change.

Courtney (29:51):
Yeah, a lot of times what I see is there's some kind
of production pressure or salespressure.
And I don't mean pressure in thewrong way.
Just in natural pressure thatled to that, right?
Like, why did it make sense atthe time that you had a big set
of changes all roll out at once?
And I'm sitting here thinking Iknow that Hamed had a couple of
demos and a feature was there abit of looking back at that as

(30:16):
well, like what were the sort ofother forces in the system that,
that ended you there in thatspot?

Hamed (30:26):
Yeah, definitely so it's very easy to justify and cut
corners to get new features outbecause, you think about it.
Okay, that's I want to try towin new business.

Courtney (30:42):
Yeah.

Hamed (30:42):
Your startup, every sales counts massively.
So constantly communicating topeople that our advantages are
speeds, moving things out asquick as possible.
And forget about theconsequences, what it means.
So basically what, for me, thelesson during this incident was

(31:03):
by, by moving, I think, too fastunsafely.
Okay.
I get to enjoy the reward ofgetting features out early and
forget about that, this cost isgoing to come back and get us at
some point.
And then when you're given thebill, you get surprised.

(31:24):
Oh that's all.

Courtney (31:27):
Yeah, you used a really interesting word that I
think within our community ofresilience engineering, learning
from incidents we use a lot.
I would love for you to talk alittle bit more about what you
mean when you say unsafely, whatwas unsafe about that from your
perspective.

Hamed (31:47):
Ability to release quick and deliver feature quickly to
production, it needs someinfrastructure in place.
You need to have some sort ofcapabilities in place to be able
to do that.
From how the team work theskillset you have in the team
the testing and deployment.

(32:09):
Infrastructure that you have youhave in the team, the practices
that you follow, like Joetouched on, the big change set
versus a small change set.
So it's a whole a combination ofpractices, technical advancement
that needs to be in placebefore, before you you can

(32:31):
achieve certain speed withoutthat.
Yes, you can.
We tried.
We did, but.
There's a risk for it because,if we have in this example, I
pick up and that was like one ofthe learnings from the incident
that Joe picked up, we learnedthat we've got to be very
meticulous that we don't letchanges pile up.

(32:53):
Before having that mentality inthe team trying to move very
fast can have consequence.
And that's just one example.
There's a lot of that.
The other thing for me was.
the complexity of the the system

Courtney (33:07):
I think this is fascinating because it's you're
a small startup, you've beenaround for three years, some
people might be like shruggyface, but how complex could it
be right?
So can you talk a little bitabout, yeah.
Oh, you were like, how cancomplex y'all.
Yeah.

Hamed (33:35):
was like, simple.
Like I had it, all of it couldfit in my head and fast forward
two years.
I'm trying to understand whathappened and just list of
systems and tools involved andthat our CD does this.
And then we have the set ofKubernetes here and then we do

(33:56):
this.
And I always just.
Wow where we get to this placethat like, even it's taking a
few days to understand howthings connected and that
definitely complexity didn'thelp during the incident.
And I think in general, itdoesn't help with being able to
deliver code fast and safebecause these systems, they do

(34:19):
so many things in your deliverypipeline without, it.
And if one of them doessomething slightly different on
picking and understanding thatis impossible.

Joe (34:29):
Yeah, like, it's the abstractions.
So with like Kubernetes, AWS,all these abstractions we have,
right, which are brilliant forfast adoption and being able to,
like, you're standing shouldergiants.
You will do some as a startup,right?
We follow these conventions.
We will get to move fast, right?
But all well and good untilsomething goes wrong.

(34:53):
Right?
And suddenly digging into eachabstraction, understanding where
is the issue, what's happening.
And this was a cascading thing,right?
The cascading issues we werehaving was at different
abstractions.
And it was just non trivial todebug.
So it's fine whenever it worksgreat, but when it, when
something goes wrong, that'swhen, you know, that's when
you're really challenged onthese abstractions.

Courtney (35:14):
Yeah, and and I'll throw the old AI automation loop
in that one too, I just wastalking with Hamed.
I got to, I got the extremeprivilege of being on the This
Is Fine podcast with Colette andClint, and we were talking
about.
With how complex systems are ingeneral, and then you layer lots
of automation, or lots of AIinto it.
And then it's, again, I thinkit's same thing.

(35:36):
It's great, all in fine and gooduntil it's not.
And you can't tell why, what'shappening.
And, so it's a, we're makingthat problem worse real fast as
an industry.
So it's interesting to hear youall acknowledge, running
headlong into that as well.

Hamed (35:57):
Yes, and brings a question that still I haven't
answered in my mind that, itworth it?
Yes, I get the benefit of thespeed but definitely going to be
moments of that things doesn'twork and fall apart and those
are becoming harder and harder.
So how do you mitigate that?

Courtney (36:16):
You're in a unique position in that your business
is this business of incidents.
You might have some pretty goodcustomer goodwill for a while.
I would imagine for that or youcould just tell them that
they're in a meta incident andnow they have to help solve
yours.
So good luck with that.

Hamed (36:32):
I tell you, I need to be careful.
Probably our customers willlisten to this, but smaller
incidents for us is pretty muchgoes unnoticed because there's
an incident drill going on andthen we have an incident and
that drills all of a suddenbecome a meta drills.
So people think

Courtney (36:49):
Yeah, I

Hamed (36:50):
something, but it's a little bit more challenging

Courtney (36:53):
You have a certain advantage there.
They're like, oh, the incidentjust got way more interesting.
You're like, you have no idea.
Okay, so we just talked a bitabout some of the other themes
that you all had desire to movefaster while maintaining or
improving safety was 1 of them.
One of the other ones that youhad that I would love to talk

(37:14):
about, cause I think it's alsorelated to this reality of
complexity and like this, evenin a small startup a few years
one of the ones that was towardsthe end of the themes was
knowledge gaps within the teamregarding specific areas of the
infrastructure.
And I don't know, Joe, but Iwould love to hear more about
that because it's a concreteexample of something that a lot

(37:36):
of time we talk about in theabstract or in nerdy research
papers or whatnot in theresilience space And here it is.
So I would love to hear more ofit.

Joe (37:45):
A core part of our platform is sort of this engine and we
got this really.
Yeah.
quite sophisticated sort oftechnical architecture that
we've been building layers onover the, over the three years,
right?
So it's like the engine of oursimulation, right?
For the incidents.
So when you're in an incident,you see grafana, you see real
stats, real metrics, you touch awebsite, it's really down,

(38:07):
you're getting 503s, right?
So there's all thatorchestration part it's really
cool stuff, and it's like Hamedstarted it, right?
With his, you know with histhing two years ago, and we
built upon that.
But when the incident washappening, and we've all these
mental models, Right.
Of how, how things work we'vegot, I've got the diagrams,
trust me.
I was, I used to be anarchitect, so I've got the, as
is architecture.

(38:27):
This, this system calls this,you know, a calls B right.
Yeah.
Yeah.
Well, actually.
A calls proxy, and then proxycalls this abstraction, and this
calls this abstraction.
And then we need to get thekeys, right?
And we need to get the secrets.
And suddenly I'm like, oh mygoodness, okay.
not only that, in, in theincident, you have different

(38:47):
opinions as well.
So, you know, we, two of oursenior engineers during the call
were saying, well, hang on asecond, if this works like this,
You can even see there wasn't ashared understanding within the
incident itself.

Courtney (38:59):
so it's a mental model engineer A, mental model
engineer B, and maybe a Venndiagram that is some of that,
but not even necessarily 2people working on the same
product.

Joe (39:12):
right, right.
the sort of meta thing I gotfrom this was again, imagine an
early startup culture, right?
It's way faster to have a singleperson work in a space and
optimize that engine, right?
To deliver fast, right?
It's way faster.
Until something goes wrong andthat person's not there.
Right.

(39:33):
Okay.
You've this abstract knowledgeall built in on one person with
the bus factor.
Right.
It's way harder and disciplinedto work in a way as a team on
these core pieces and knowledgeshare.
You know, it's.
happens all the time.
You give someone else the pieceof work and get them to do to
make the change in that system,it takes longer to deliver.

(39:54):
Right.
But you're getting the payoff ofknowledge.
was a, this was a big thing welearned was that, okay, our key
pieces that we have, we need tospread the knowledge and, and
that's only by doing not, notabout having whiteboard sessions
and more diagrams.
It's

Courtney (40:11):
run books or documents, but the hands on the
systems.
Yeah.

Joe (40:26):
So yeah, that's something we were very aware of.

Hamed (40:30):
Actually, this is a good point, Courtney touches back on
your question.
What do we mean by faster andsafer?
And I hold my hand up becauseit's very easy.
Oh, X did work on that partbefore, hasn't he?
It's much quicker for him to dothis part, which we need to
deliver.
So let's get X on that.
And then X becomes expert there.
Y becomes expert in somethingelse.

(40:52):
And Z, Works in something else.
And as Joe said, until somethinggoes wrong and these people
aren't around or even betweenthem, they can't agree on what
is going on.
So that was like, okay, I've gotit again.
False economy that exited, lethim do it again, or let her do
it again.

Courtney (41:10):
Yeah, it was was Amy, the Kafka expert is my scapegoat
for that one from a previouscompany, actual true story.
But I think Joe, the piecethat's really interested is
cause I always do really trylike with the void.
What I'm always trying to do isconnect.
Like theory to reality topractice, right?
And there's all this researchout there about expertise and

(41:32):
how you only get it by doing.
And so to hear you say that sortof, warms my academic heart a
little bit because that's howthe brain works.
Like you don't get better atsomething by reading about it.
You might learn about it.
You might understand somethingabout it.
But you certainly don't justwatch 14 YouTube videos and then
get in a Car and drive.
Although I did once watch a lotof YouTube videos and teach

(41:55):
myself how to use an excavatorI'm not recommending most people
do that.
However, I May or may not havedone some bad things But at
least it was on my own propertyso yeah, just to the, I just
want to harp on that point alittle bit more, because I think
it's also my pet peeve withautomation and AI systems where

(42:17):
you're, you're abstracting awaythe work that a human's supposed
to be doing.
And again, it's probably allgood until it's wrong, because
then that person probably hashad no experience hands on with
that system and what it'ssupposed to be doing.
And so it's much harder for themto even A, understand what it's
doing and B then, go in and tryto reason about it and fix it.

(42:38):
And so that hearing that in, inthe wild, as we say is I think
really important for people tounderstand.
And that a run book is not thesame as having your hands on
something.
And it is disciplined.
You're right.
And you would, you, Joe, to letfour people on your team rotate
through something or spend moretime on it, have to go to Hamed

(42:59):
and say in order to build thisdegree of safety or whatever in
our system, you're not going toget XYZ until two weeks from
now, or whatever that lookslike, right?
Culturally, that has to beacceptable in order for that to
work.

Joe (43:18):
Yeah, yeah.
And you know, you see otherexamples practices where you can
do the chaos engineering.
I know Courtney mentioned thatbefore as well.
And we're trying to mock thatdiscovery exercise with with the
team as well.
Those practices work.
And the other thing we do thisis another theme we had.
So Hamed definitely have it.
You should definitely talk aboutthe way you run the PIR before

(43:39):
you go.
But this one of the themes wehad that simplicity.
again, you know, we did take astep back and say it doesn't
need to be that complicated.
Yeah, we are, we are now, butyou know, is there an
opportunity now where we are,can we make this simpler?
Can we simplify it?
Right, and again, that's moreinvesting in you know, I look at

(44:01):
the positive, you can, you cancall it the glass half empty is
technical debt or the glass halffull and say, this is technical
agility and resiliency and justmake it, make it easier, you
know, so maybe there's things wedid in house that now we
probably could, could abstract,right?
Things, yeah.
But, but that idea ofsimplicity, like, can we make
this more simple.
So, like, there's things, forexample, we, maybe we don't,

(44:22):
we're a startup and we builtstuff maybe we don't use now,
it's time we can churn and cleanthat up.
So again, that was another keytheme was let's make this, let's
make our, our architecture a bitmore simple.

Courtney (44:35):
So Joe mentioned how you ran the PIR and I didn't see
this in the docs.
I see it.
I do, I lied.
I see it in the doc in that Isee a very thoughtful post
incident review.
The learnings from that in thisdocument.
But there's not a meta.
Here's how we did this.
I would love to hear a littlebit more about that, if you
don't mind Hamed.

Hamed (44:54):
So I'm going to share something.
It's a safe space, but toucheson PIR as well.
And it was first time ever in mycareer, I picked up on it and
really had a profound impact onme because when Joe, I think it
was day two, he called me toexplain what we know about it

(45:16):
and how's the situation is andstill some lingering problem.
I read a sense of guilt, and itwas almost like,"Sorry about
what happened" and it was

Courtney (45:28):
Guilt from Joe?

Hamed (45:30):
From Joe it was in his voice like, and that all of a
sudden hit me hard, never in thelast 15 years I noticed it, but
then it reminded me, yes, when Iwas engineering manager, when I
was doing that job explaininghow incidents progressing or
what happened.
I had that sense of guilt in meas well going to CTO as if it

(45:54):
was like,

Courtney (45:55):
Your fault.
We

Hamed (45:55):
are at fault.
And I just picked it up for thefirst time in this incident,
probably because the rules wereSo at that moment, I promised
myself the first thing I'm goingto do in the PIR and when we
meet people is just to thankeveryone for the amazing job
they did and recognize that itwas very stressful time for

(46:19):
everyone, and they all did in itespecially, special thanks to
Joe as well.
So that kind of sense of guiltin the incident was the first
time I picked up on it.
Coming to PIR, I don't think Idid something extraordinary.
I think two questions we spent alot of time.

(46:40):
One was.
What was so surprising in thisincident from every, everyone's
perspective, because it wasdifferent.
So we went around the table andeveryone explained what was
surprising for that personthroughout the incident.
So we captured that, which someof it reflected all of it

(47:00):
reflected in the internal reportthat's there, and then the other
question was, What made itdifficult to fix?

(47:33):
from this specific incident?
So I think that was the only,thing that I did, unless I've
done things or run it in a waythat I didn't notice, yeah, it
was, yeah.

Joe (47:54):
retrospective.

Courtney (47:55):
Yep.

Joe (48:07):
Hamed set the scene as a blameless setup, and then that
openness came then.
generally, follow up then afterthe post instant review, around,
okay, what have we learned fromthis?
What are we going to do?
What are we going to change?
and that's patterns I believeany organization can copy.
And also the timing of it aswell, Hamed, the cadence of the

(48:29):
timing is very important whenyou do the PIR, it's about
Goldilocks, it can't be too soonor it can't be too late, right,
just getting it right.

Courtney (48:36):
Which is not always easy to do for sure.
Joe, do you feel like there wasa feeling, not because of
organizational pressure oranything, but do you think
people felt guilty or did youfeel guilty?

Joe (48:49):
Yeah, because you're, you're very proud of what you
build and ship on, you know,when you have a four, four and a
half hour outage on your, youknow, as a co founder of Uptime
Labs, I see the impact, thebusiness impact on the
opportunities we missed forthose four and a half hours.
there's incredible amount ofguilt.
but.
Generally, first of all, the,the leadership coming back,

(49:10):
Hamed coming back and giving youthat said, look, what did we
learn these the tribe, as peoplejoin, you don't lose that

(49:42):
knowledge.
It's really, really important.
So, so yeah but it's human, it'svery human nature, right?
It's, I think it's very human,human nature to feel bad.

Courtney (49:51):
You're showing up for something you care about every
day.
People, the age old thing thatAllspaw and others would say is
people don't show up to do thiswork to not do well and to screw
up and to make mistakes or tocause problems.
And so even when it's not yourfault, as it were.
Because it can't be with thesekinds of systems.
Yeah, you feel a sense ofresponsibility and ownership.

(50:11):
And I think to tell people don'tfeel that way is silly.
But to help them, I think, asyou did, Hamed, work through
that feeling and get to a placewhere you can accept what that
is.
And then, there's so manymetaphors of incidents and
parenting.
I feel like I could have a wholeother podcast about that right
now, because, like I'm like, I'mjust thinking about you not
being in the room and I'm like,how many times do I just have to

(50:33):
leave my 13 year old alone andnot touch it?
Anyways another podcast, anotherday.
But working through thosedifficult, real human emotions
is another sort of oftenobscured piece of the process
that's so important.
It sounds like you all did astellar job of it.
So on that note we have gonequite long, which is great

(50:56):
actually.
And so I want to thank you bothfor joining me.
I will put some links toanything else you want me to put
in there.
I'll put a link to Uptime Labs Iwill put in a plea for you to
work with yourselves and whoeverelse that you can write these up
for us to put them in the Voidsomeday, please.
Because yours are such a fullof, all the things I want to

(51:19):
tell people about.
And so being able to show thatto them is really nice.
So even if you have a UnitedStates government redacted
version all the details gone,it's just like a lot of black
boxes.
And then what you learned, Iwould still be okay with that.
So that's my unfair request.
Thank you for joining me somuch.
And I hope things stay uptimewithout a lot of downtime for a

(51:39):
long time.

Hamed (51:43):
Thanks Courtney.

Joe (51:44):
The one thing guaranteed in life is incidents, right?

Courtney (51:46):
Yeah, a hundred percent.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

The Joe Rogan Experience

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Episode 7: When Uptime Met Downtime

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

The Joe Rogan Experience

All Episodes

Episode 7: When Uptime Met Downtime

Stuff You Should Know