Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Courtney Nash (00:30):
Today I am joined
by Lawrence Jones, an engineer
from Incident.io.
We are here to talk about anincident report that you
yourself authored from the endof November last year.
Walk me through the top levelsummary of what you feel like
happened and we can talk throughsome more of the interesting
things that you all experienced.
Lawrence Jones (Incident. (00:51):
Yeah,
sure.
Courtney Nash (00:52):
And I'll put the
link for people listening to
that incident report in the, inthe show notes.
Lawrence Jones (Incident.io (00:56):
The
incident that we wrote up, we
titled it"Intermittent DowntimeFrom Repeated Crashes," which I
think kind of encapsulates whathappened over the course of this
incident.
So it happened back in November,and it I think took place over
about 32 minutes, or at leastthat was the period with which
we experienced some downtime.
Though it wasn't the whole 32minutes, it was, periods within
that 32 minutes that we wereseeing, the app kind of kind of
(01:19):
crash a bit.
And I think it's quite useful toexplain the context of the app
because otherwise you don'treally understand, what
"incident" means
Courtney Nash (01:24):
What is it y'all
do?
Lawrence Jones (Incident. (01:25):
Yeah,
exactly.
So, I, I work at a companycalled Incident.io.
We offer a tool to help othercompanies deal with incidents.
So, yeah, in a, in a kind ofironic sense, this is an
incident company having anincident.
So
Courtney Nash (01:35):
Everyone has
them, right?
It's no one is immune.
Lawrence Jones (Incide (01:40):
Exactly.
so I think the, what happenedhere was, our app, which serves
the dashboard, which is whereyou would go to kind of, view
some information about theincident, and also the API that
receives web hooks from Slack—because a lot of our incident
response tool is Slack based—that was the thing that ended up
failing.
Obviously this means that ourcustomers, if they were at this
particular moment to try andopen an incident or do things
(02:02):
with incidents, they would beunable to, or they'd see some
errors, in the period with whichwe were down.
Hopefully they would retry maybea minute later when it came back
up and they get through.
Obviously it's not, it's notideal, which is exactly why we
did kind of a public write upand, and went through the whole
process of how can we make surethat we learn from this incident
and get the most out of theexperience so that we can do
better in future and hopefullyeliminate some of the sources
(02:24):
that cause this thing.
And maybe if it happens again orsomething similar happens again,
we'll we'll know and be able torespond a lot faster.
Courtney Nash (02:29):
There are always
these little interesting details
that people might choose or notchoose to include.
So there's the early on in thefirst, you know, the initial
crash description, there's thislittle detail that makes me
think that something about thisincident might have been a
little stressful...
You wrote"20 minutes before ourend of week team debrief," which
to me just makes me feel.
It's Friday, everybody's liketrying to wrap up.
(02:53):
And your on call engineer waspaged.
It was like app crashed out ofHeroku.
Lawrence Jones (Incident.io) (02:58):
We
actually use our product to
respond to incidents.
So we dog food the whole thing.
Which honestly, when I firstjoined I thought"that can't
possibly work." But it worksquite, it works quite well
actually.
We get a sentry in, and we, wepage on any kind of new error
that we've seen just because anyof those errors could mean a
customer hasn't been able createan incident and we take that
really seriously.
Our pager, it has a fair amountof stuff coming through.
(03:19):
Not, not crazy with, we're veryhappy with volume, but a page is
not something, at least inworking hours that we would be
too, too worried about—it's kindof business as usual.
But obviously when you get thepage through and you realize
that you haven't got an incidentin your own system because we
failed to create it, that thatsuddenly means, okay, cool,
you're on, you're on a bit moreof a serious incident here
because it's not just a degradedsystem.
(03:39):
You're seeing something that'sactually impacting the way that
we are creating incidents on ourend, which is not good.
Like you said, it was, it was ona Friday.
So we're a fairly small team,about 40, 45 people at the
moment and about 15 engineers inthat.
And every Friday at about 4:00PM we host team time where we go
and discuss, you know, how's theweek been, et cetera.
Obviously this happens 15minutes beforehand when everyone
(04:00):
is winding down or, you know,maybe someone kind of rushing to
try and get a demo ready for ademo time, something
Courtney Nash (04:06):
No.
No.
Lawrence Jones (Incident (04:07):
Geared
for, geared for the for the team
time that we've got coming up.
And then suddenly you realizethat the app is like, pretty
badly down.
Well firstly, our oncallengineer start starts going
through this and startsresponding.
And we do, eventually the systemcomes back up.
So we do get an incident in onour side, which means we've now
got a Slack channel and we startcoordinating it in the normal
way.
But I don't think we realizeuntil a few minutes in that
(04:28):
actually this is, this issomething that will be bringing
us down repeatedly because whenwe fix it the first time round
it, it went down again a coupleof minutes after.
And then it gets a bit moreserious, you know, and you
start, you start getting acouple more senior engineers
join in and you start buildingup a team around it, and then it
kind of snowballs and yousuddenly realize team time isn't
gonna happen in the normal way.
So, yeah, that was how westarted, which is never, never
(04:49):
exactly fun, but, yeah, ithappens, isn't it?
Courtney Nash (04:53):
You mentioned
from previous incidents you'd
already learned something.
Lawrence Jones (Incident. (04:56):
Yeah,
so we, as I said, we're, we're a
small team, but we've beengrowing quite a lot.
So, I think over the last 12months, I think we've gone from
maybe four engineers to the 15that we have now.
So we've got a core group ofengineers here who have dealt
with all sorts of differentthings happening over that year.
And one of the things that we dohave that is a benefit to us is
that we run our app in a verysimple way.
(05:18):
At the point of this incident atleast, we just had, it's a Go
monolith monolithic app, that werun in Heroku using the Docker
container runtime.
So we go and build the image, wego ship that image as a web-like
process.
This serves all of our incomingAPI requests.
IT also processes a lot of asyncwork that's coming in from
pubsub.
And we have a cron component,which is just doing regular,
(05:41):
running regular jobs in thebackground.
But that's basically all thereis to it.
It's, it's super simple.
Obviously with simplicity youget some trade offs.
Some of the trade offs for usare, if something goes wrong and
the process was to crash, thenyou are gonna bring down the,
the whole app.
Now we run several replicas,which means it's not necessarily
the whole app that goes down,but if there is something that
is a common crash cause to allof the different processes that
(06:01):
are running at the same time,then you're gonna find that this
thing will, will turn off.
Now coming from running systemsin Kubernetes and other similar
environments before, you mightbe used to, I, I certainly was,
might be used to a kind of likeaggressive restart strategy.
So, you know that processes aregonna die.
So immediately something willbring it back up.
In Heroku anyway, you, you won'tnecessarily get this.
(06:21):
So Heroku we'll try and bringyou up quite quickly afterwards,
but if something happens again,then you don't enter kind of
like this exponential back offthat's kind of like trying to
bring you back up.
It, it goes quite, quite harsh.
So I think it, it might beplaced in what Heroku call like
a cool off period where for upto 20 minutes it just won't do a
single thing.
Which of course for an app likeus is just nonsense.
(06:43):
Like it can't work for us at all20 minutes of downtime because
Heroku is just waiting aroundand kind of put, put your app on
the naughty step is, is not, isnot what we need, like could to
be telling us to cool off when,when everything goes down is is
not gonna work.
So I think we've learned beforethat if the app is continually
crashing, if we're seeingsomething that is bringing the
app down, then if we jump intothe Heroku console and press
(07:03):
restart, then we can kind ofjumpstart that call, that cool
off period, which will bring usback up.
Because one of the benefits ofit being a Go app is it, its
very, very quickly, in fact, soquickly, we occasionally run
into issues on Heroku, where Ithink they're built for more
like Ruby apps, like racingagainst poor allocation and
stuff like that.
But it does mean that it comesup really, really quick.
So it, it is quite an immediaterelief to the incident, I guess,
(07:24):
which is why we pressed itinitially.
We're back up.
Cool.
Let's start looking into this,and then it's only two or three
minutes later that we hit thesame issue again.
Presumably that brings the appdown and we have to go back and,
and start hitting this manual,restart button more often than
we would like.
Courtney Nash (07:39):
So it's
interesting because, you know,
one of the things that I talk alot about with the VOID is, is
how interventions from peoplebased on their intimate
knowledge of the system, youknow, are typically what we rely
on, in these kinds ofsituations.
Was this the poison pillsituation with the sort of
subsequent crashes?
How did you get there?
Lawrence Jones (Incident.io) (08:01):
So
I'll say upfront, by the way,
when we came around to doingthis review of the incident, our
internal debrief has kind oflike a lot more detail that's
relevant to how we're runningour teams and how we want to
help our teams learn as well.
And the preview of that is thatthere was a lot of stuff that
went on inside of this incidentthat people took actions on
manually that wasn't commonknowledge amongst the team.
So one of the things that wehave definitely resolved to do
(08:22):
out the back of this is try andschedule some game days, which
we're gonna do, I think nextmonth, where we'll simulate some
of these issues and we'll do abit more of an official kind of
training and making sure thateveryone's on the same page when
it comes to what to do in thesesituations.
So whilst I might have mentionedbefore that Heroku is not doing
a normal exponential back offkind of pattern, thankfully we
do have something in our systemthat is, and that's our pubsub
handler.
(08:42):
So usually when this ishappening, if you can see a
crash in the process, you aregoing to go look for the thing
that caused the crash.
So we were kind of sat there inlogs looking, well, where is our
sentry?
Where is our exception?
Where, where's it come from?
We were really struggling to getit.
And I think one of the problemshere is that if the app
immediately crashes, then you'vegot sentries and exceptions that
(09:04):
are buffered as well as tracesand logs that don't necessarily
find their way into yourobservability system.
So what we were having was theapp would crash and then we
would lose all the informationthat it had at that particular
point, which leaves you in astate where you don't quite know
what caused the crash.
You just know what led up to thecrash and if it's happening very
quickly after you startedprocessing the information,
that's not ideal.
The good thing was that we couldkind of tell from the pattern of
(09:27):
the crashes, that it wassomething that was going into
some type of, kind ofexponential back off.
And I think having run thesystem for a while now, one of
the patterns that you realizeis, our async work that exists
in Google pubsub, when you failto ack a pubsub message, you're
going to a standard exponentialback off pattern.
So the fact that we are seeingsubsequent crashes come maybe
two minutes and then threeminutes, four minutes, that sort
(09:48):
of thing, from each one of theretries, that gives you a, a
pretty strong gut check thatthis is probably coming from
some pubsub messages, which iswhy we went from subsequent
crashes into cool, we might notbe able to tell exactly what has
crashed because you know, theHeroku logs are being buffered,
so we don't even get the goroutine traces coming out of
this app when it's crashing.
We obviously don't have any ofour exceptions.
We don't have any of our normalmonitoring, but we know this
(10:10):
thing is happening to a regularrhythm.
It's probably pubsub and thenyou start looking for where have
we got kind of an errant pubsubmessage across all of our
subscriptions.
So you start looking throughthem and going, cool the
subscriptions that I know I'mokay to clear, I'm gonna start
clearing.
And we were just basically goingthrough those and trying to
clear them out so that we couldget rid of the bad message,
which presumably was the thingthat was being retried each time
(10:32):
and bringing it down.
And that's how we had a coupleof our engineers allocated going
through subscriptions whilst theothers were looking through
recent code changes or trying tofind from what logs we did have
from Heroku, whether or not theycould see a pattern in the Go
routine traces.
Courtney Nash (10:46):
While this is
going on, you noted that, you
know, you turned off a number ofnon-critical parts of the app,
right, that are related to sortof async work stuff.
I think it's like,"it's been awhile since you last up, you
know, do you wanna update yourincident?" was that just part of
the sort of trying to figure outwhat was going on or
Lawrence Jones (Incident.io) (11:02):
So
I think I've, I've mentioned how
in this particular situation, wedidn't have access to a lot of
the tools that we would normallyhave access to in an incident.
So observability wise, it was,it was, not particularly great.
One of the things that you cando whenever you're in that
situation where there'sessentially something bad
happening, there's a lot ofpotential things that could
cause it, is you just try andsimplify.
(11:22):
And I think that was thedecision that we made.
We know that we have someregular work that gets scheduled
via what we call the clock,which is just a thing that's
piping pubsub messages in at aregular basis.
That's providing functionalitythat actually, in terms of how
our customers will experienceour product, it's not critical.
They're like nudges to remindyou to take an incident role or,
or do something at a particulartime.
(11:43):
We could disable those and you'dstill be able to create an
incident, which is, and, anddrive one proactively through
Slack web hooks.
So, that's obviously the thingthat you'd prioritize if you had
to pick some subsystems to keepup rather than others.
And if you can get rid of that,then you suddenly remove a whole
host of regular work that allmight be providing this event or
this, this job that is causingthe thing to crash.
(12:03):
It was more about we don't knowwhat's going on.
We have limited options, solet's try and simplify the
problem in a chance that itmight fix it.
But also we know if we removethat we'll have less noise.
So there will be just simplyless hay through which to find
the needle.
Courtney Nash (12:19):
Yeah.
And so you, I mean, you foundthe needle.
Was it Aha, we think we knowit's this, or was it just, we
turned a bunch of stuff off andfinally it just stopped
breaking.
Lawrence Jones (Incident. (12:30):
Yeah,
so I think, it was quite a funny
one.
I think we, we had a suspicionthat it might be coming from
something.
And that was because we couldsee that there was an event that
had just at our particular pointwhere we'd started seeing the
crashes, we could see in theGoogle pubsub metrics that there
was an event that startedfailing or was, was left un
(12:51):
ack'd in the subscription.
So that was giving us kind of,you know,"this looks bad." it
was also part of a piece of thecode base that we had seen an
error come from.
I think the day prior, we, wewere already feeling a bit
suspect about this.
We didn't quite know how itwould've happened, because we
were under the assumption atthis point that, any errors that
were returned by thesesubscription handlers,
(13:12):
regardless of whether or notthey panicked, that the app
should proceed and continue, to,as it happened, that that was an
incorrect assumption.
At this point, you already don'thave a consistent viewpoint of
exactly what's causing theincident.
So you start ditching some ofthe rules that you think you
have in mind because it islogically inconsistent with what
you see, that is actuallyhappening in production.
Courtney Nash (13:31):
How hard is it to
get to that point where you have
to abandon an assumption, likea, a well earned assumption.
Like that's, that's a tip,that's a tough thing to do, I
think, especially in a somewhatpressured situation, right?
Lawrence Jones (Incident.i (13:45):
Now,
when I say that that assumption
was logically inconsistent withwhat we're seeing in production
at this point, you have to havea fair amount of confidence in
how you think that the systemshould work to come to the
conclusion that it is logicallyinconsistent.
Otherwise, you might think thatyou're missing something or,
you're just interpreting thesignals in the wrong way.
Now I think the thing that wehad definitely concluded, given
(14:05):
what we could see from whateverHeroku logs we did manage to get
out of the buffer, was that thisapp was crashing and we were, we
were seeing the whole thinghalt.
That usually only happens ifthere's a panic that hasn't been
caught.
So even though you think thatyou have the panic handlers all
over the app, you startconsidering other ways that it
might go wrong.
And we are aware that there are,for example, third party
libraries inside of our app.
(14:26):
Even down to the Prometheushost, so you've got a little
endpoint that's servingPrometheus metrics.
There could be a bug in therethat's causing a panic, and if
it's not handled correctly, thatcould be it for the app.
So there are definitely causesthat could lead to this.
You start breaking out of yourwell doing work inside of our
normal kind of constraintsshould be safe.
And you go, well, something hasfailed somewhere, some
safeguard.
So we have to walk that oneback, at least for now.
Courtney Nash (14:50):
So there are a
couple of things that you talk
about in here in terms of theregroup and I'm curious because
you did mention something, Ialso appreciate that you also
made sure people like stop toeat and drink.
Something like take a little bitof a
Lawrence Jones (Incident. (15:04):
Yeah,
that's a big one for me.
I've done that wrong too manytimes before.
Courtney Nash (15:08):
Yeah.
Yeah.
These are these importantreminders that we're, we're
humans.
You mentioned something furtherback that I made a mental note I
wanted to come back to in termsof the growth of the team.
And some people taking certainactions during the incident, can
you talk to me a little bitabout what that looked like?
Lawrence Jones (Incident.io) (15:24):
So
Pete, out CTO, he came back and
he'd been, he'd been watchingthe incident from afar.
In fact, at one point, there wasa very funny message being like,
it'd be really nice if you couldprovide incident updates for
this incident.
Using our tool, of course, towhich we have to go.
Courtney Nash (15:36):
Oh no
Lawrence Jones (Incident. (15:37):
Like,
yes.
It's a, it's a real, like it's areal horrible message to look
back on.
You're like, oh, yes, no, thatwould be why.
Going back to your point aboutthe team growth and where we
were at, at this particularpoint, our app has actually been
very robust.
We have quite a lot ofexperience building apps like
this and we prefer simpletechnology over complex ones,
just literally because we wantthis thing to be rock solid and
(15:58):
we've dealt with the pain ofthose things before.
So we've got a couple of burntfingers that mean that the app
has been kind of blessedlystable for a while.
That meant that this was ourfirst actual"ooooh, how are we
gonna fix this?" moment whereyou actually hit a production
incident and you go, well, Idon't quite know what the cause
of this would be, and if I don'tfix it, then you know, we're,
we're gonna be down for a whileand that's gonna be really
(16:19):
painful.
And I think that was the,honestly, that was a bit of a
shock.
I've been on a lot of incidentsbefore, but it's been now maybe
about a year and a half thatI've spent at Incident.io.
So it means I'm a year and ahalf away from having dealt with
production incidents like thatbefore.
And I think that that was thesame even for our more senior
responders, within the team.
So, whilst we had done reallyquite well at dealing with the
(16:40):
problem itself, one of the firstthings that you forget whenever
you are doing something likethis and you are kind of
involved in the nitty gritty offixing it, is doing all the bits
around an incident.
So getting someone to doproactive communications.
We kind of had that halfwaythere, but someone had taken the
lead role and then they weren'tnecessarily doing a lot of the
things that we would expect thelead to do because, quite
frankly, they were just superinvolved in the technical
(17:02):
discussions.
So I think Pete coming in didwhat you should do in that
situation.
And he made a call that waslike, look, it looks like we're
out of the woods, or at leastsome urgency is diminished now.
So we've found some things out.
I'm not clear on exactly what'shappening, and I think we've got
a lot of the team over here whonecessarily weren't the most
senior of the people in the teamwho are actively responding, but
(17:22):
they definitely want to help andthey can if we need them.
So let's, let's take a moment.
Take a breath.
And we all just stood around,did an incident stand up.
Had a couple of people catch upon comms.
We reallocated the incident leadrole, and got everyone to relax
their shoulders, and get theirposture back.
And then just decided to splitup the work that we needed to
do.
And moved on to trying to do itafterwards.
(17:42):
So I think the regroup wasexceptionally useful, and came
at exactly the right time in theincident.
And it's just something that,for me anyway, it was a gut
check on, oh wow, it's been awhile since I've done this.
And it's really easy to forgetif you're just in the weeds
doing the actual fix.
Courtney Nash (17:57):
It doesn't matter
how many times you've done this
at however many differentplaces, each local context has
its own, yeah.
Sort of, learning curve
Lawrence Jones (Incident.io) (18:05):
I
mean it should go without
saying, but it's oftenforgotten, every time you change
up the people who areresponding, it doesn't matter if
they have individually got thatcontext from other places.
You don't know until you, youreally practice and put the time
in, how this is gonna work.
It's why we're scheduling gamedays and doing some drills and
practicing because, obviouslygoing forward this is gonna be
something that's super importantto us and we want to do it well.
(18:26):
Quite, also a bit selfishly forour product.
We think our product can help.
So it would be a bitembarrassing if we weren't doing
all this best practice stuffourself.
Courtney Nash (18:34):
How much of the
things that you did would you
say sort of technical and howmany would you say were like
organizational, to, you know,from your mind and re and, and
after the fact?
Lawrence Jones (Incident. (18:45):
There
was two angles to this, this
incident debrief that we had.
So we, we ran an internalincident debrief where we got
several people together in aroom to chat through this
incident.
At the time we were walkingthrough a writeup that we'd
produced that contain bothtechnical detail and
observations about how we'dworked as a company.
So, I think that that debrieffocused probably only one
(19:07):
quarter on the technical stuff.
So the public postmortem thatwe've released, that's primarily
about the technical elements,quite frankly, because that's,
that's interesting to a publicaudience and I think they
explain the nature of theincident and kind of give color
to what we've done to try andfix it.
The internal stuff was a lotabout the learnings that we got
out of it as a company.
So, as I said, this was one ofthe bigger incidents that we've
(19:29):
had in the past.
Hopefully it will be the biggerone that we have for at least a
couple of months.
But one thing that it does testis, you know, are we
communicating correctly withcustomer support individuals?
We have a lot of people who areeven selling the product to
people, and when they're doingthat, they're doing demos.
There's probably somethingorganizationally there that's,
that's, you know, if we've got amajor incident going on, you,
(19:51):
you don't want to be schedulinga demo call over it.
That's not going to look goodfor us or make sense for the
person on the other end.
So probably proactively bail outof those and reschedule them
rather than try and make ajudgment call on if it will
continue.
So there's a lot of trying tofigure out who or what are the
levers that we will pullorganizationally next time
something like this happens tomake sure that customer success
is involved properly.
(20:12):
And make sure that thecommunications that we need to
be sending out are cohesive,they make sense, across both
what we're putting on our statuspage and actually what's going
via our customer success team.
And yeah, make sure thateveryone in the company is kept
up to date in the right way.
This is exactly what you wouldexpect when you are running an
incident for the first time thatthe size of company that we are
(20:32):
at, especially given the, thelast major one will have been
honestly, half the people whoare here now just weren't there
before.
So it's something that youreally need to drill this and
try and get everyone on the samepage before these incidents
happen if you want to make surethat it goes well.
Courtney Nash (20:47):
Yeah.
I mean, the irony right, is youget better at them if you have
them.
Ideally you don't wanna havethem, and so if you're fortunate
enough to not be having a lot ofthem, then you don't practice a
lot unless you intentionallypractice.
Was there, was there anythingreally surprising to you about
the whole thing?
Lawrence Jones (Incident.io) (21:04):
I
don't, I don't think so.
So there's there's a lot.
Yeah.
No, no, no.
So I think, it, it's veryinteresting from the position of
a startup as well, just becauseyou think quite carefully about
what is the right type ofinvestment to make at different
points.
And especially coming fromsomeone who's scaled an app like
this before at a much largercompany and gone through all
those kind of honestly justhitting your head on the door as
(21:27):
you go through each different,each different gate.
I'm kind of waiting fordifferent types of problem to
appear as they come.
And this was, this was one ofthem that we had called out as a
risk, which was why we hadsomething quite well prepared
for it and why we were able toactually, within kind of an hour
and a half of the incidentclosing, we were able to make a
change for our app that splitthe way that the workloads are
(21:47):
run in Heroku, which makes usentirely insensitive to a, a
problem like this happeningagain, or at least our customers
wouldn't necessarily recognizeit if our workers were to crash
like this.
And that's kind of not what youwould expect from an incident
like this to have a fix in placetwo hours after.
So we've given some thought tohow this stuff might crop up,
but you never really know whenthe thing might appear.
We were quite happy with thedecisions we made leading up to
(22:08):
this, and learned a lot abouthow to fine tune that barometer
on when we're looking atinvestment technically coming
forward over exactly what wewant to do to try and make sure
that we're kind of keeping aheadof the curve.
So ideally we won't need one ofthese to kind of keep us, on
exactly the right balance there.
But I think given everything in,in retrospect, I think we are
probably making quite decisions,prioritizing that along the way.
Courtney Nash (22:30):
The other thing
that I, you know, I was
fascinated about by this is youuse your own product because
it's an incident response tool.
My favorite case of these kindof incident things was when AWS
went down and then the AWSstatus page like relied on AWS
so they couldn't update thestatus page.
Like they fixed that finally, Ithink.
But what's your strategy goingforward for using your own
(22:51):
product...
Lawrence Jones (Incident. (22:52):
yeah,
I mean,
Courtney Nash (22:52):
When your own
product goes down?
Lawrence Jones (Incident.io) (22:54):
I
think honestly we will
definitely continue to do thisand the, the benefit of us using
it is, quite frankly, we, I meanthere's one reason that we're
able to sell this thing.
That's because we really dobelieve in it.
It makes you so much better atresponding to incidents, and
what I've learned from how thisapp is the greatest over the
last year and a half is thatwhen it does go kind of wonky,
it's either in a very local,area or eventually it does kind
(23:17):
of catch up on itself.
And, honestly, like a lot of thevalue that you get is the
creation of the incidentchannel, the coordination of
everything, and the app onlyneeds to be up for, you know, a
brief moment to try and do allthat stuff.
So immediately you get theincident channel, you've
escalated, you've invitedeveryone, and that that in
itself is just the first peoplecall it like the, the first few
minutes of an incident, right?
And it, it manages to shortenthose and get you in the right
(23:38):
place.
And from that moment on, youwere able to coordinate.
It would be a similar situationtoo if Slack goes down.
That's obviously an issue for usbecause people are using Slack
to drive our product and we useSlack internally to communicate.
Now we have plans around how wewould communicate if that was to
go down, but then also there'slimited ways that we can help.
At that point, we're we'rewaiting for Slack to bring
things back up and.
(23:59):
Quite frankly, our customers whoare also at that time having
incidents are impairedthemselves.
Very few people have the backupplan or the policy to go
somewhere else when Slack goesdown.
And if, and if they are, thenthey're usually going for.
I mean, not as ridiculous asthis, but like collaborative
Microsoft Paint in, in a
Courtney Nash (24:14):
Yeah.
Lawrence Jones (Incident.i (24:15):
it's
not a tool, it's, it's not a
tool that's going to beparticularly familiar to them
or, or they're trying tocommunicate via like a G Doc
link.
And, at that point everyone'skind of in degraded mode and,
and you're waiting to bringthese back up.
But yeah, we have backup plans,but at the moment anyway, this
is working really quite well forus.
And I'd rather solve the problemif it's at all possible, by
making our app incrediblyrobust, and investing in it like
(24:36):
that.
Courtney Nash (24:36):
So, you collect a
lot of data on your customer's
incidents and you look at a lotof incidents from that
perspective.
I've seen a few things you'vewritten about that, that I
thought were really interesting.
Is there anything from lookingat your customer's incident data
and incident response that'sinformed how you all think about
incident response in general?
Lawrence Jones (Incident.io) (24:59):
We
work really closely with a lot
of our customers, so especiallyour larger ones.
It's genuinely fascinating tofind out how different
organizations are structuringtheir incident response.
So we, we often learn from ourlarger customers about the
processes that they've created,especially out the back of
incidents.
So things like reviewingincidents and kind of everything
that happens after you've closedit.
(25:19):
So incident debriefs andretrospectives and writing
postmortem docs and processesaround, following up on incident
actions and things like that.
We often look to them to giveus, a view, as a very large org,
how are you managing this andhow can your tooling help you?
There's also just a ton of stuffthat we're doing around insights
and helping those organizationsunderstand their own behaviors
(25:39):
when it comes to incidents.
So I think if you lookcollectively, for example, over
all of our incident metrics, Ithink it's really fascinating
that you can kind of see a holein all of our graphs around
Christmas, and it's just...
It's, it's like stuff like thatis really interesting from
someone who's interested inincidents because you go, well,
(25:59):
actually that's the, that's thecombined impact of people
putting in change freezes orpeople going on holiday and just
simply changing things less.
There's also generally all thestuff that happens around the
holidays that that kind of leadsto this sort of thing.
But you can, you can help alarger org, for example,
identify seasonality in theirtrends or, I hope in future we
can take some of this data andwith permission from obviously
our customers, understand it abit more in terms of how the
(26:21):
industry is responding toincidents and try and help
people, for example, implementthe stuff that we know
empirically is working for someof our other customers.
Yeah, there's a huge amount ofpotential in incident data.
I'm super excited about it,having spent a lot of time
working on our, our insightsproduct.
I think it's a very untappedarea, at least when you have
this type of data on how peopleare running their incidents, you
can do lots of reallyfascinating things with it.
Courtney Nash (26:43):
You have this
richness of the people involved
and sort of what thosecommunication webs look like and
kind of the patterns and ebbsand flows of those things, which
I think tells you so much moreabout how an organization's
handling their incidents than,than the numbers we've had so
far, as an industry.
Lawrence Jones (Inciden (26:59):
There's
many things that we're looking
at at the moment, like aroundoperational readiness and
helping you get a pulse on that,for example.
But one thing that really landswith a lot of our customers at
the moment as a result of ushaving this depth of data, is,
being able to automatically kindof assign a number of hours that
you've invested to eachincident.
Now if you've got organizationswho are running incidents for
small, all the way up to verylarge incidents.
(27:22):
And you can segment that data bythe users, so the people who are
actually responding to theincident, and also by causes of
the incidents of services thatare affected and things like
that.
You can do some really, reallycool things.
We, for example, are able to,where we've had our product
roadmap and we've had certainprojects come out, draw lines
between that and a spike in theamount of workload that we've
(27:42):
had to put into incidents,especially related to, for
example, Jira.
If we've just built a Jiraproject and you can come up with
a cost of the back end of thisproject, how much are you going
to cause in terms of operationalworkload at the back of this so
that you can kind of predict andunderstand a bit more about how
your organization is working.
And that's stuff that I hadnever been able to do before.
I mean, I think we've all builtthose spreadsheets at companies
Courtney Nash (28:03):
Yeah, but you
don't believe them like, they
you don't believe them and, andhand waving...
Lawrence Jones (Incident. (28:08):
Yeah.
And it's not your job to spendfull-time all your hours trying
to figure this out.
Whereas we've sat there andwe've got some way of trying to
detect when you're active in anincident and done all the stuff,
like slicing it and making surethat we don't double book you
across the hour in multipleincidents.
It comes together into somethingthat is it's a lot more of a
rich picture on how you areresponding to incidents across
your organization, and that'swhere I'm really excited to try
(28:30):
and do this.
Cause I think for people inleadership positions at these
companies, they're so far awayfrom the day to day, but they're
so very much interested in itand it's kind of this, this
horrible, contrast or where thefurther away you get, the more
interested you are in it, theless data you get or, or the
less you can trust the data thatyou're getting out of this
stuff.
And I think it helps in both ofthose situations, right?
(28:52):
A tech lead who wants toarticulate that their
operational workload isincreasing, can be helped
humongously if you can providethem with some of that data to.
Back up their claim and then theexecutive that they're speaking
to asking for more investmentcan see the ROI that they get
from putting in the investmentand that sort of stuff.
I think we should be trying toget out of our incidents and if
(29:12):
we can then, yeah, I'd be veryhappy with it.
Courtney Nash (29:15):
Oh, I was just on
a podcast talking about this.
Like how data driven ourindustry is to our advantage.
And I think sometimes to ourdetriment.
Right.
Of
Lawrence Jones (Incident (29:24):
course
Courtney Nash (29:24):
There's plenty of
things that we should do that we
just can't get the data for.
It's very costly or time orwhatever.
Or you feel like you're kind ofpulling them out of thin air or
your, your own intuition isprobably right, but they're
like, well, you just made thesenumbers up, you know?
And, so for better or for worse,when you can get your hands on
those kinds of data and you'retrying to make a case further
(29:44):
up, you know, leadership furtheraway, the blunt end versus the
pointy end.
That's a huge advantage toengineering teams who know after
these kinds of incidents, thekinds of investments they want
to make.
Lawrence Jones (Incident. (29:56):
Yeah,
so I think especially around
training and making sure thatpeople are onboarded correctly,
we've got a ton of plans goingforward into this year.
One of them is being able tomodel this idea of what someone
needs to have done to berecently fresh for an incident.
If you look at our situationhere, one of the issues that we
had was really we want a corpusof people who are able to, or
(30:16):
have recently responded to amajor incident or something of
that kind.
Now you can express that eitheras they've dealt with a real
major production outage orthey've ran a game day recently
that helps'em simulate it.
Either one of them is fine.
And we are planning on helpingyou track that within the
product and express it as well.
I want to see how many peoplewithin the last 60 days have
done these type of criteria.
And then when you do that,hopefully over time you can see
(30:38):
as you hire more people, But youmight have lost some of your
more tenured staff so thatnumber of people who are ready
and able to respond might havedecreased.
And it's the type of stuff thatif I, I always think about
interviews if you've ever beenupskilled, but for interviews at
a company, you often have thatspreadsheet for the onboarding
and it's kind of tracked verymanually.
I want something very similarfor incidents, but something a
(31:01):
lot more automated that can helpus keep a pulse on how many
people are actually ready andonboarded, ready to respond to
this stuff.
It's super important that youknow, that the one person who
has recently responded toPostgres incidents has left the
company.
That shouldn't be a thing thatyou should have to guess.
So.
Courtney Nash (31:15):
Or well or find
out the hard way.
Lawrence Jones (Incident.io) (31:18):
Or
find out the hard way.
Yeah.
You, you do one or the other, Iguess.
Exactly.
Courtney Nash (31:23):
Is there anything
else you want to wrap up with or
tell folks about, aboutIncident.io?
Lawrence Jones (Incident.io) (31:28):
If
you just head over to
Incident.io, then you can book ademo and we'll do our best to
make sure that our services arerunning happily when you, when
you try and take the demo...
yeah
Courtney Nash (31:37):
Perfect!