Episode 3: Spotify and A Year of Incidents

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Courtney Nash (00:22):
Welcome to the VOID podcast.
I'm your host, Courtney Nash.
If you or anyone you know reallyhas listened to Spotify, you're
likely familiar with their yearend Wrapped tradition.
You get a viral, shareablelittle summary of your favorite
songs, albums and artists fromthe year.
Or if you're like me, andsometimes you let your kids play
DJ in the car on road trips, thealgorithm can seem very off from

(00:43):
your own personal taste inmusic.
But I can confirm we're all hugefans of Lizzo, so we've got that
in common at least.
Today, we're chatting with ClintByrum, an engineer whose team
helps keep Spotify for Artistsrunning, which in turn keeps
well, Spotify running.
No pressure.
Right?
Each year, the team looks backat the incidents they've had in
their own form of Wrapped.
They tested hypotheses withincident data that they've

(01:04):
collected, found someinteresting results and
patterns, and helped push theirteam and larger organization to
better understand what they canlearn from incidents and how
they can make their systemsbetter support artists on their
platform.
Let's get into it.
I am so excited to be joined ontoday's VOID podcast by Clint
Byrum, Staff Engineer atSpotify.

(01:24):
Thank you so much for joiningme, Clint.

Clint Byrum (01:27):
Oh, it's great to be here.
Thanks for having me

So we are going to be chatting today, about a
multi incident review that youpublished back in May earlier
this year.
I'll put the link to that in thenotes for the podcast.
Your team decided to look backat a bunch of incidents from
2021.

(01:50):
And so I was hoping you couldtalk about what led the team to
wanting to do a morecomprehensive view of that set
of, of incidents from last year

Yeah.
My focus and my team's focus ison the reliability of Spotify
for Artists.
So most people are familiar withthe consumer app that plays
music and podcasts, and doesthose sorts of things for you.
But Spotify for Artists is sortof that back office application
that when you want to manageyour presence on the
application, or if you want topublish music or some some other

(02:22):
things where you're, you're,you're more of a in work mode
rather than listening to musicmode.
That's where you come to us.
And so my team, which is I'mproud to say we got to call
ourselves R2D2'cause we're thedroids that do the reliability
work.
The company has a very richplatform that's centralized for
the consumer application.

(02:43):
And most of it works for theback office app too, but you can
imagine we're like lower scale,but higher consequence.
And so some of the, some of ourpractices and testing focuses
are different, and my squad sortof helps everybody around— we
call it the music mission— dothose things in a reliable way
and take care of their uptimeand, and do better incident

(03:04):
reviews and things like that.

So what, what led to that, that team then looking
at incidents in order to try tohave a better experience for the
users of that particularplatform?

Yeah, you, you might be familiar.
Spotify has this tradition of"Wrapped" where if you're a user
of Spotify, once a year, we sendyou a cool little packet of
like, this is what you listenedto last year.
And this is you know, yourfavorite artist and you're most
like, These people over here.
It, it's kind of a tradition atthe end of the year that Spotify
looks back at the, at theprevious year.
And so we started that a coupleyears ago before I was actually

(03:42):
even on the team.
It was just a working group ofsome reliability focused
engineers where we just stoppedand looked back at all the
incidents that people declaredfor a year and try to learn from
them.
We've gotten better.
At the beginning, it was justsort of looking at them and
making sure that we understoodlike how many there were and
what were the, what was our meantime to recovery.

(04:03):
If we could discern that youknow, basic, basic discerning
things.
But the last few years we'vebeen trying to do real science.
So we've sort of made hypothesesand said, we think that this was
true last year.
Or we think that these twoaspects played against each
other and then we design.
Little pieces of data that we'relooking for in each incident.
And we go looking for them andwhen we find them, we catalog it

(04:23):
and see if we can learnsomething from, from the
aggregate, rather than justlooking at individual.
incidents

So that my next question was specifically going
to be about hypotheses.
So you mentioned having some,can you say more about what some
of the hypotheses your teamgenerated were and, and how you
came up with them?

Yeah.
So the, in the public post you,you're seeing sort of the best
ones, like the things where wethink we can learn the most
from, and they're relevant tomaybe not everybody at Spotify.
So we had a hypothesis, our teammaintains this web oriented
synthetic testing suite.
So I call it it's sort of likedroids making droids.
We, we help the teams that areresponsible for features write

(05:08):
tests that drive a web browserand just go click around on the
live site and make sure thatstill works.
And we had a hypothesis thatwhen you don't have a test for
your feature, it would take muchlonger to detect an outage.
That seems obvious, but, well,we sort of were wondering like,
is this true?
Is our value propositionactually real?
I think that's a pretty easy, wethought it would be, you know,

(05:31):
pretty easy to, to prove.
And, and it was For instance,you you'll see in the, in the
post.
We found that essentially whenyou have a test and you, and,
and then the feature is able tobe tested in this way.
It's about 10 times faster thatyou find out because you're not
waiting for a person to noticeor metrics.
Sometimes metrics don't tell thestory until it's really, really
broken.

(05:51):
And when you, don't, you'rewaiting for all those other
things to, to dominoes, to fall.
And actually quite often it'sjust support that's gonna find
out because it's not like ametric or a scale problem for
us.
A lot of times it's justfeatures interacting poorly

I guess better support than on Twitter.
But yeah, better, even moreupstream from that.
Right.
Actually it's interesting, cuzTwitter, we do have metrics that
are, you know, sort of like whenpeople are talking about Spotify
or Spotify for artists onTwitter.
And so those are, you know,something that we use as an
early signal, sometimes even thebest monitoring fails, Yep.

It's interesting because we've gone on a journey
that, you know, this was done inDecember of 2021.
We published in may since then,we've even watched your talk at
SREcon where you talked aboutmeantime to recover and, and
we've sort of let go of ourdependence on metrics like that,
but we were looking for, how canwe get better fidelity?

(06:47):
As we looked in the sample,various small sample of the
data, we were finding wildlyinaccurate and we actually
proved that one out as well.
People don't enter start and endtimes.
So that's the kind of stuff wewere looking for.

You can imagine I was very honed in, on, on the
TTR stuff in the report.
For anyone who hasn't listenedto me rant about that before.
I can, I can post you, I'll putsome links to that in here and,
and I'll link to the talk that,that Clint is referencing.
One of the things that we'veseen in the void obviously is
that those kinds of durationdata are all over the place.
And so I'd noticed that you saidthat, I'm super glad that you

(07:19):
all are, are moving away fromSome incident nerd type folks
like us often like to call"shallow metrics" hat tip John
Allspaw for that term.
But I've always harboredhypothesis, I guess we'll say
since we're in hypothesis land,that time to detect data might
actually be more useful andthat's kind of what you're

(07:39):
talking about here, right?
I harp on meantime to you know,sort of remediate or whatever
you wanna call it because it,it's not a normal distribution
of data.
And that's probably what youfound, which is why yours were
all over the place.
But I wonder if the detectiontimes are more bell curved.
Right?
I wonder if that slice of the,because it's really, and

(08:02):
especially when you can havesynthetic tests and those kinds
of things that you, you are inthat world where your data are
normal distribution and youcould find improvements in them
and you could say, oh, a 10%difference in the average is
meaningful.
Versus the variability that yousee in the actual aftermath of
an incident happening.
I thought that was really stillkind of what was lurking in.
in your data there.
So I thought that was a reallyinteresting piece of it.

I.
I think, I think you're ontosomething.
And I see that as well in theconversations that have gone on
after this post and people sortof notice it and, and we'll pick
it up and, and, and come andreach out.
There's a lot of curiosity aboutthat, and a lot of people
pointing out that really whatyou did was show a normal
distribution in, in time todetect, as you said.
And also our data scientist waslike just super excited to get

(08:45):
into some of the weeds in thedata and, and wanted us to do
some more collection because hewas able to show that we, we, we
really had proved this.
We looked at some of the othervariables that we had reliably
collected and we found that, no,this is, this is something
that's really tugging it in thatone direction in, in a really
hard way, and I think you'reright.
One of the things that I gotfrom not just what you're
saying, but John Allspaw and,our learning from incidents

(09:08):
community is that theseaggregates they're, they're easy
to read, but they're not so easyto discern real value from.
So it's, it's often easy to likelook at the, the dashboard and
say, oh, meantime to recovery,it's in good shape, or it's not
so great.
Need to, you know, get thehammer out.

Pay no attention to the man behind the curtain,
right?

Right.
But when you start to break themdown into time to detect and
time to, to, to respond and, andsort of also like, there's this
idea of like, when is it over?
That's another reallyinteresting question that we got
into while we were doing thisstudy, like, what are we going
to decide is an incident isover, are we recovered?
Because oftentimes you'remitigated, but you still have a

(09:51):
degraded system that is eithercosting a lot to run or may be
at a high risk to, you know fallover again.
So those, all those questions,like having smaller metrics that
end timelines to look atactually allows us to learn
more.
Whereas the aggregate wasn'treally telling us anything.
And I, I wanna add one morething.
I think, I don't remember whosaid this.

(10:14):
Somebody said it in the Slack— Ithink you might be there on
there too.
Looking at meantime to recoveryand trying to learn anything
just from that is like lookingat the presents under the
Christmas tree and trying toguess what's in them

Yeah.

like there's

I used to, yeah.
I used to put rocks in big boxesfor my brother when at
Christmas, cuz I was a jerk.

Right.
Right.
Or even like in, in theChristmas story, right?
The BB gun is hidden behind thepiano to throw him off.
And there's a little bit of thatwhere like, like we, we opened
the presents and we looked inand actually saw what was there.
We saw a very different storythan what we were guessing we
would see.

There was one other little note in that
section.
You know, you mentioned peopleweren't recording these things
and, you said something likethis isn't a failing of
operators, but a failing of thesystem." Can you talk a little
bit more about what you meant bythat?

Yeah, I'm a, I'm a big believer in that, axiom,
that, that people don't fail,the system and cognitive work,
the system fails them.
And in this particular case, wewere asking people when they
were, so we used JIRA forcataloging incidents and, and
managing them—which is a wholecan of worms that I rather not
get into— but it does have aform that we ask people to fill

(11:26):
out when they're done, whenthey're sort of marking it as
moving into the post incidentphase.
And it has a start and end timeit's it doesn't ask you many
questions.
It sort of like asks you to makea, some, some, you know, tags
and, and some short description.
And then when did it start andwhen did it end?
And it has defaults and thosedefaults are very popular.

(11:48):
So

As, as so many defaults are right?

Right, right.
Because that's one less thing togo go out and, and, and dig
through.
We have a correcting practicefor it, which is that we do ask
people in that post incidentphase to build a timeline so
that we can have a post incidentreview and talk about what
actually happened.
And those are often veryaccurate.
So when we went and actuallywent and read through the
timelines, you could go throughand find the start and the end.

(12:16):
So when we corrected those, wekind of that's one of the things
we were looking at is how fardid we have to correct them?
And in fact, that wasmeaningless.
What we found was they're justalways those default values that
kind of come from when youopened it, or when you touched
the, the incident in a certainway.

Yeah.
And I mean that you can placeyourself in the mindset of a
human who is just finishedmanaging an incident and yeah,
probably don't wanna have to dothat much anymore at that point.

Yeah, I called it "paperwork" in the post and I
stand by that.
Not just start an end time, buteven just getting to that form,
which is where we move the, tothe post incident phase.
People often take days to getback to that because it is a
stressful event whether you wereable to mitigate and go back to
sleep and maybe handle it the,the next business day or however
you got to that point whereoftentimes people just

(13:07):
completely move on.
It's a, you know, it didn't havea big impact.
Maybe there's not a lot ofpeople with questions and we
actually have a nag bot thatcomes and tells people if they
haven't closed an incident for awhile, it'll come back and poke
them.
And that's usually when they'refilling out that form.

That's so funny.
You know, we, we don't see a lotof these kinds of multi incident
reviews out in the world.
I, I feel like honeycomb is acompany I've seen some from and,
and a few other places have doneit.
So there's not a lot ofprecedent for how one does such
things.
And I'm sort of curious whatyour methodology was for how you
approached that.

Yeah, I, I agree.
I it was something thatsurprised me when I arrived.
So I, it had been done two yearsprior when I, when I got to
Spotify and that torch had beenhanded from this reliability
working group to this full-timesquad.
And so we thought it, we likedwhat we saw.
And so we, we decided to repeatit, but it was interesting
because most of the people fromthe working group, it sort of

(14:03):
moved on and weren't availableto us.
So we had to look at the resultsand work backwards to the
methodology.
So we invented some of it, our,our, ourselves.
Our main centering principle wasthat we are gonna time box the
amount of time that we spendwith each incident and that
we're not going to change themin anyway.
So we may make notes that wecouldn't find something or we

(14:28):
don't have much confidence.
That's actually a score that weadded, but we're just gonna go
back and look.
We really wanted to change themby the way we could see glaring
errors in the paperwork, but wejust said, that's not what we're
doing.
We're just looking at them.
So we set a time box.
We looked at the number ofincidents, decided the two of us
who were doing the analysis atthe time, we have 16 hours total

(14:48):
for this, divide the number ofincidents, add a little slack
time and say, okay, once you'vereached that, if you haven't
gotten enough confidence in whatyou're looking for, then you
mark it as a low confidenceincident and we move on.
So we just dumped it all in aspreadsheet and assigned them to
one analyst or another.
Sometimes we did them in pairsand we started just moving

(15:10):
through what we thought would beinteresting things to look at.
So it changes with each study.
We were looking at things likehow how complex did this seem or
how many people got involvedwith it?
The first time we did it, wetried to do time to recovery and
we found that we were spendingtoo much time reading timelines.
Also whether or not there was apost incident review that was

(15:31):
actually just like on a whim.
We just thought, well, I see afew that it hadn't happened.
We didn't realize just how manyincidents ended up not getting
that time spent.
And that was something thatreally troubled us the first
year.
So we did a lot of work the nextyear as a squad to advocate for
those reviews.
And then the rate did go up.
It's still not where we want itto be.
I'd like it to be a hundredpercent.
But that was the idea,was gothrough each one, look at you

(15:55):
know, these aspects that you'retrying to find based on the
hypotheses.
I forgot to mention that firstwe make hypotheses.
Then we pick metrics, then welook for them.
And at the end we'd spend a fewtimes having a third person come
and spot check just to make surewe were applying it.
And then from there, we did somedata analysis, worked with our
data scientists to make surewe're not inventing statistics,

(16:16):
because none of us arestatisticians.
And we totally did that...
at first.
But luckily we have datascientists around who are very
happy to like get into it and,and were like,"Yeah, you can't
really say that." And thenproduce a report with some, with
some findings.

What were you looking at?
What materials were you lookingat?
Just the post incident reviews.
Were you, I mean...
Grafana?
I don't know what y'all use...
outputs.
Like what gave you the, the sortof data for those hypotheses
that other than...
I at a lot of timelines.

That's a great question.
So there's a single JIRA projectfor all incidents at Spotify.
And then we narrow that down tothe ones that have been tagged
as related to our product, whichis actually pretty reliable,
just because it's one of theways we communicate sort of, if
you declare an incident againstSpotify for artists, then some
stakeholders get notified.
So it's pretty important that ithappens.
So we, we first just pulled outthe raw data from JIRA and, and

(17:11):
that has a description.
Some actually teams willactually manage the incident
directly in JIRA.
So they'll, they'll entercomments through the way.
What's really common though, isthey'll just drop a, a Slack
link in there.
And so in the time box we had, Ithink like this last year, we
had 30 minutes to look at eachincident, which is actually

(17:32):
pretty generous.
Whatever we can find related tothe incident in 30 minutes.
So we start at the JIRA ticket,but you'll find there's a post
incident review documentsometimes they're recorded, so
we'll try to watch a few minutesof it.
You'll find Slack conversations.
If there's really not a lot, wewould do a little bit of
searching, but in you'd usuallyhit the time box pretty quickly,
if that was the case.

(17:52):
And then we had like a one tofive scale in confidence.
And so we pretty much anythingyou see mentioned in that study,
we threw out everything underfour.
So like fours and fives werelike, we're really confident
about that we captured the data.
There's actually not a lot ofones and twos and there's just a
few threes.
Most people put enough in therethat we could make some calls

(18:13):
about them.

This feeling I've had about people who do your
kind of work and, and otherpeople who are getting hired
even as incident analysts.
You're like you you're likecomplex system archeologists.
I mean, you're just going backand trying to look at tunnels
and scratchings on the wall and,and sort of figure out really
what happened.
When you would go in, like you'dmentioned that there would be

(18:34):
like a Slack link.
Were you lucky enough that therewould be like an incident
channel or does a channel getdeclared for incidents and like
everything's in one place?
Or what did, what did that looklike for you all.

Would that they were always true.
Yes, sometimes there there's aspecific channel created for an
incident.
Generally, the more severe orbroader impact incidents get
those channels created.
It wasn't always the case.
I would say actually in 2021 iswhen we started doing it more as
a matter of course, but there'sno sort of manual that says,

(19:07):
"Make a channel and number itthis way," it's sort of a tribal
knowledge that's works out wellfor everyone.
A lot of times, teams thatmaintain systems or on-call
rotations that share systems—sometimes that happens— will
have a single channel wherethey're like, sort of having all
of the slack alerts fromPagerDuty or from Grafana or

(19:30):
whatever system they're using toalert them will, will sort of
coalesce there.
And then they open threadsagainst them and discuss.
So sometimes you just get athread and sometimes you get a
whole channel.
It just depends on how sort ofbroad or narrow it was and how
well managed the incident was.
At times you get very, verylittle, but the incident review
is so good that you're able tojust read that document.

(19:52):
One of the magical andfrustrating things about Spotify
is that, well, we have a lot ofautonomy, how you do your job is
your choice as, as a squad.
So a small group of 10 peoplemake that choice.
And as a result when you try tolook at things across any broad
swath of the company, even justacross those that work on

(20:12):
Spotify for Artists, expectingthings to be uniform is probably
a mistake.

Well, I think expecting things to be uniform
in general is probably amistake?

Probably yes.

In any system, honestly?
But yeah, I mean, that soundslike that would be, you know, a
bit more challenging.
And so you didn't interviewpeople yourselves, then that may
have happened in the postincident reviews, but not in the
work you all did.

Yeah, not as part of this, although, because we do
some of that at the company, wetry to go deeper on certain
incidents.
Those are the fives on thatconfidence score, where we have
a really well written narrativereport where maybe a few
analysts have gone in and, andgone over things with the fine
tooth comb and had a reallyeffective post incident review.

(20:58):
So I think we had out of the ahundred or so that we looked at,
I think we had five or six ofthose and those are so easy to
read that you can get that donein five minutes.
I wish we had the time to dothat on every incident, but
it's, it's, it's a heavyweightprocess, but it makes it for a
really fantastic result for thearcheologists.

So you had your hypotheses, but while you were
doing this, did it reveal anyother sort of patterns or
repeating themes that youweren't expecting...
or?

I wish that we had a little more time to do these
and to look for exactly that.
While we're going through them,we sort of by osmosis noticed
some stuff, and that is anothersort of unstated reason to do
this.
Sure, it was great to havestatistics and, and to be able
to make some really confidentcalls about how incidents are

(21:48):
managed at Spotify, but from aperspective of a squad that's
trying to help people be betterat being reliable and resilient
and they're building theirsystems.
We certainly got a good broadpicture of what's going well and
what's not.
I think if we actually sat outand didn't do the statistical
approach and just tried to findthemes in them with the same

(22:10):
amount of time, we probablywould have an equally
interesting, although verydifferent report to, to write.

Interesting.
How is the report receivedinternally?
I'm curious some, you know, someplaces it's like, well, people
read it and they said it wasgreat.

Yeah.
It's it's been interesting.
Our, our, our leadership reallyloves this thing, so they have
become, have come to expect it.
In the in the winter they, youknow, they, they're sort of
like, oh, are you working on astudy this year?
Like, you know, what are you,what are you taking a look at?
Like, did you, did you ask thisquestion this time?
So they usually kind of rememberwhen they get back from a winter

(22:48):
break that we'll be, probably bepublishing it.
And that's very gratifying that,they're paying attention.
I think also it just tells thatwe're giving them sort of things
to keep an eye on from areliability perspective.
Engineers also tend to see it asa sort of an interesting,
curious moment to, to, toreflect on where they sit on it.

(23:08):
So I've, I've talked to a fewengineers who are you know, had
a few incidents through the yearand, and, and they're curious to
see, like, they, they go look atthe spreadsheet and they, they
wanna see how their incident wasrated.

Oh,

like, yeah.
So they're, they're, they're,they're digging around in the
ones that they remember and,and, and they're sort of seeing
like, oh, that, yeah, I don'tknow.
Is that really complex?
Did we get a lot of what, youknow, that sort of engineering.
A lot of also what happens ispeople will go back and fill in
a lot of details because we'llmark their incident as low
confidence where like, wecouldn't find an incident review

(23:41):
document, or we didn't see theslack channel.
We don't go back and revise thereport, but it's always
interesting to get a ping, like,"Hey, I saw that, you know, this
was a really big incident and I,I forgot to put the link here.
So here it is in case you'recurious." So we get, we get all,
all kinds of engagement from it.
I think there are a few hundredengineers who work on Spotify

(24:01):
for artists and I I think we, wesaw that about 60 of them read
the report.
So I think that's a pretty goodrate of of at least opening.

yeah.
Well, by marketing standards,that's a smashing open rate.
But that, engineers are readingthat maybe a little bit out of
terror, I suppose, but hopefullynot.
It seems like they see itlargely in a, in a positive way.
It's like, yeah, you guys need ayearbook: the Spotify 2021
incident yearbook

I like that.
Yeah.

"Most likely to never occur again" or so we
think...

Oh, I'm totally stealing that.
It's it's gonna be in this one.
Like I said, we wanna make itwork like Wrapped.
So Wrapped is this very visceralvisual experience, audio
experience where, you know, wekind of own that dance, social
media for, for music.
And we're always like, oh, we'regonna make an Incident Wrapped.
And usually it's just threecheeky slides in some sort of

(24:55):
all hands that drive people tothe report.
But that seems to get the jobdone.

So it was just the, it was just two of you then
doing the meta-analysis.
And were either of you involvedin any of those incidents, or
no?
Did you go back and see thingsthat you had been a part of?

Yeah, for sure.
We get involved in incidents.
We do maintain that testingframework and that's often
broken itself.
You know, there are falsepositives.
We get involved when that'shappening.
Or in, in some cases like oneinteresting aspect is that so we
have a web test that will justsort of log in and sort of add
somebody to a team and removethem to make sure that that
works.

(25:34):
Well, that is an auditableevent.
We wanna know when that happensbecause it has security
ramifications.
And so our test suite filled upthe audit log database.
So we got involved with thatone, for instance, that was a
lot of fun.
And so we had to, and, but whatactually was really fun about
that is the result wasn't that,you know, sort of, we had to

(25:54):
stop doing it so often, the teamthat was maintaining it realized
that actually the purgingprocess that had been designed
in just wasn't working and noother user on the system had
ever done as many things as thesynthetic test users.
So they, they were able toactually fix something.
I love incidents like that— howdeep can we go in finding things

(26:15):
that aren't what we think theyare?

Yeah work as imagined versus work as done or
systems as we think they workand how they actually work.
What was it like to go back andlook at some of those?
Did the memory fade was the, youknow, was, was looking back at
it at all interesting for youjust to personally, from having
been involved in some of those?

I, I love looking back.
Time, you know, really does sortof set the memories into your.
Sort of neural network ofconcepts, right?
So there, there's some sort ofproof that people remember
things best when they've had alittle time to forget them and
then are reminded of them.
And so I certainly do rememberlike that one, for instance It

(26:56):
happened.
It was over fairly quickly.
And, and honestly wasn't allthat complex, but then going
back and reading the postincident review, I remembered
the conversation and I talked toone of the other engineers
again, and they were like,"Ohyeah, we, we did do that work.
And here, look, this is how, youknow, we found this too." And so
going back and revisiting themdefinitely was, was a, was a

(27:16):
good experience for mepersonally.
And I know from other engineerswho've said similarly to the
leaders, like when is thatcoming out?
But they've had same thing.
They're, they're usually lookingfor a specific one that, you
know, maybe had an emotionalimpact for them or that they
just are curious about.
But yeah, it's a chance to lookback and I think that's really
valuable.

It all is so much time to do, but you mentioned
that at the very beginning of,of the post, that we trade
productivity now for betterproductivity later.
Right.
And I think a lot oforganizations struggle with
justifying that kind of work.
You know, so it's not alwayseasy to draw the direct line
from that, from that fuzzy frontend of, of it analyzing these

(27:59):
things to like really clearoutcomes, every time.

Yeah.
I say that out loud quite a bit,actually.
Whenever we're discussingincidents in general that this
is intended that these failuresare things that we want you to
have as an engineer.
We have an error budget, andthat doesn't mean that it's, oh,
if something goes wrong, you canspend up to this amount.
We actually want people to pushup to the edge and spend that

(28:24):
error budget.
If they're not, that's also asignal they're not taking enough
risks or that they're spendingtoo much time on the system
health, which will blow somepeople's minds to say it that
way.
But it's, it's absolutely, Imean, it very much, when I say
that we are spending thatincident time to learn about the
system in ways that would justnot really be possible if we

(28:46):
simulated it, or if we try tothink through and, and planned
for everything, it would be toocostly.
It is okay to have a few minuteswhere the users are
inconvenienced.
We're sorry, users.
Because we are building so muchmore value when we're actually
spending the time after that tolearn.
How's it gonna break again?

(29:06):
Where is it weak?
Maybe it needs even more work orless of doing well, so really
important.

What are your plans for this year for your
post?
Are you thinking about it yet?
Are you gonna do anythingdifferently?
What are you looking out towardsfor Spotify Incident Wrapped
2022?

Yeah.
At this point, it's become atradition.
So I think we will definitelylook back at the year of
incidents again.
We had a plan to sort of do morework before we get to the end of
the year so that we could maybespend the time box going a
little more in depth.
So spend a little bit allthrough the year, rather than
all of it at the end that hasproven difficult to, to find

(29:48):
time to do.
We also thought about maybecrowdsourcing some of this and
asking people to fill in somemore forms.
We realized they already aren'tdoing the paperwork we're asking
them to do yet.
So until we come up with anincentive for that we've kind of
put that on the shelf, althoughwe still would like to do it.
I think this year, we're gonnafocus a little bit less on these
metrics and looking at you know,trying to like find aggregate

(30:11):
clues here.
And we're really trying to findthe incentives that drive people
through the process to get thatvalue of the post incident
review and, and during, and thelearnings of fixing things when
they're really badly broken.
Because we, we, one signal thatwe also got, that's maybe not in
the post very much, is thatthere, there are a number of
incidents that would be veryuseful and valuable to learn

(30:33):
from that are never declaredbecause they're near misses or
they didn't rise to the levelwhere they needed to be
communicated about.
And we really, really would likepeople to be able to declare
those, or record those somewhereand just give us that quick
learning.
That's proven very difficultbecause our, our process is
heavyweight.
It needs to be but we, you know,we'd like to figure out a way to

(30:55):
do that.
So we're gonna look into, youknow, how we, how we can maybe
find those the, the magic timewhere it's still happening, even
despite the heaviness of theprocess and, and, and do more of
that.

That's very exciting to hear.
I am a big fan of near misses,but I also fully understand how
that is one of the hardestsources of information to
incentivize and, and collect.
So I wish you well, and I hope Iget to hear about some of that
in 2023.

Is that already almost coming?
Oh my gosh.

I mean, if you're gonna write it up, it'll come
out next year.
So I'm not saying it's anywherenear news.
No, so no pressure.
Well, thank you so much forjoining me today, Clint.
It's been really a treat to haveyou on the podcast.

Yeah, it's been great.
Thank you.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

The Joe Rogan Experience

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Episode 3: Spotify and A Year of Incidents

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

The Joe Rogan Experience

All Episodes

Episode 3: Spotify and A Year of Incidents

Stuff You Should Know