All Episodes

December 1, 2021 44 mins


At the end of January, 2021, a group of Reddit users organized what's called a "short squeeze." They  intended to wreak havoc on hedge funds that were shorting the stock of a struggling brick and mortar game retailer called GameStop. They were coordinating to buy more stock in the company and drive its price further up.

In large part, they were successful—at least for a little while. One hedge fund lost somewhere around $2 billion and one Reddit user purportedly made off with around $13 million. Things managed to get even weirder from there, when online trading company Robinhood restricted trading for GameStop shares and sent its values plummeting losing three fourths of its value in just over an hour. But that's less relevant to this episode.

What matters is that while all this was happening, traffic to a very specific page on Reddit, called a subreddit, r/wallstreetbets went to the moon. Long after the dust had settled, and the team had a chance to recover and reflect, some of the engineers wrote up an anthology of reports based on the numerous incidents they had that week. We talk to Courtney Wang, Garrett Hoffman, and Fran Garcia about those incidents, and their write-ups, in this episode.

A few of the things we discussed include:

  • The precarious dynamic where business successes (traffic surges based on cultural whims) are hard to predict, and can hit their systems in wild and surprising ways.
  • How incidents like these have multiple contributing factors, not all of which are purely technical
  • How much they learned about their company's processes, assumptions, organizational boundaries, and other "non-technical" factors
  • How people are the source of resilience in these complex sociotechnical systems
  • Creating psychologically safe environments for people who respond to incidents
  • Their motivation for investing so much time and energy into analyzing, writing, and publishing these incident reviews
  • What studying near misses illuminated for them about how their systems work


Resources mentioned in this episode include:



Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Courtney Nash (00:21):
At the end of January, 2021, a group of Reddit
users organized what's called a"short squeeze." They intended
to wreak havoc on hedge fundsthat were shorting the stock of
a struggling brick and mortargame retailer called GameStop.
They were coordinating to buymore stock in the company and
drive its price further up.
In large part, they weresuccessful at least for a little

(00:44):
while.
One hedge fund lost somewherearound$2 billion and one Reddit
user purportedly made off witharound 13 million things managed
to get even weirder from there.
When online trading company,Robinhood restricted trading for
GameStop shares and sent itsvalues plummeting losing three
fourths of its value in justover an hour.

(01:05):
But that's less relevant to thisepisode.
What matters is that while allthis was happening, traffic to a
very specific page on Redditcalled a subreddit R slash wall
street, bets went to the moonlong after the dust had settled,
and the team had a chance torecover and reflect.
Some of the engineers wrote upan anthology of reports based on

(01:28):
the numerous incidents they hadthat week.
Fran, Courtney and Garrett fromReddit.
So what happened?
What was your experience whenall of a sudden all of these
people were swarming Reddit totalk about this situation,
Courtney, why don't you go aheadand kick it off?

Courtney Wang (01:47):
A confession
until the first incidenthappened.
I was not following very much ofthe, the crazy hype.
we have a incident room channelat Reddit.
I will just go in and, you know,see what happens.
And we were seeing a lot oftraffic and somebody just
mentioned, oh, this must ber/wallstreetbets.

(02:10):
And I was kind of like, huh?
W what is that?
And this is already likeWednesday or Tuesday.
I think it already started inthe weekend.
Right.
And we hadn't seen real impactuntil the week started.
So that was really my firstintroduction to it.
And really through the firstday, until I could sign off and
actually go onto Reddit when Iwasn't managing incident, I

(02:30):
still had relatively no ideawhat was going on.

Courtney Nash (02:34):
So just you all, some of you, at least we're
learning what this socialphenomenon was while fighting
fires on the infrastructure sideof the house.

Fran Garcia (02:45):
Yeah.
Like I remember in, in my case,one of the first inklings I had
of what was going on was what wewere dealing with, you know,
traffic surges.
And we were looking at what wasgoing on and someone mentioned,
oh yeah, these traffic searchesis happening because the markets
are opening.
That was kind of like a veryquick course in to what was
going on and why the marketswere suddenly, for a week or

(03:07):
two, very, very important to myjob.
There was all of a sudden timesI had to set like an alarm
because there's going to be ahigh traffic event.
Then the market is going toclose.
A lot of people are going to gointo one particular subreddit.
And they're going to share witheach other, how much money they
made or they lost.
And that was something I was notprepared for.
that was a lot of learning thatwe had to do on the fly

Courtney Nash (03:29):
Garrett, was that the same experience for you as
well?

Garrett Hoffman (03:32):
Yeah.
So I was a little bit awarethat, something was going on
with GameStop.
Prior to Reddit, I worked at asocial media company focused
solely on the retail tradingmarket.
So I have former colleagues,friends kind of really close to
the markets and, and close tothat space.
I've actually seen crazy marketdriven stuff like this at that

(03:55):
job, but I didn't reallyunderstand the full extent of,
of what was happening until westarted seeing these incidents.
We're looking at data pipelinesand being like, why are we
getting so much hotspottinghere?
And, It was really only, I thinkwhen we saw the impact it was
having on our systems that wereally fully grasped the

(04:15):
magnitude of the situation.

Courtney Nash (04:18):
It wasn't just like,"oh crap.
Our stuff's on fire." What isgoing on in the world at the
same time, which I think is insome ways, maybe one of the more
unique aspects of, of this wholeseries of incidents.
Before we get into all theindividual ones, what led you
all to decide to write thisanthology, which by the way,

(04:39):
I've never heard of an"incidentanthology" before.
And I really loved that notionbecause you're like
anthropologists to a certaindegree of your own organization.
what prompted you all to go intothis much level of effort and
detail to not just analyze this,but to write these up and
publish them so that otherpeople could read them.

Courtney Wang (04:58):
I just really love telling the stories outside
my work life as well.
I think storytelling is one ofthe fundamental ways that humans
interact and learn from eachother, support each other.
I think there are so many coolincident stories out there that
aren't being told.
My motivation as I was goingthrough this process which was

(05:20):
really like the span of a couplemonths of doing longer
post-mortem interviews than wenormally do coming and knocking
on Fran's door and Garrett'sdoor and saying,"Please do
this!" was kind of to set anexample and to say, Hey, this.
is one way that Incident storiescan be written, can be told.
I really was hoping, I still amhoping that the folks at other

(05:43):
places will come across this setand say, oh, you know, I could
write a story like this.
And one of the, actually one ofthe main inspirations for me was
Slack's January outage write-up.
I think if I hadn't read thatand seen something presented so
nicely, I might not have had asmuch motivation to do this one.

Courtney Nash (06:01):
That that's a great one.
Laura Nolan, who is, is theperson behind that one.
I agree.
It's great.
And when you see these otherexamples of other people doing
it, it's so exciting to me tohear someone else go, oh, not
just we could do that cause it'sfun, but I mean, what's the
value you see in that?

Courtney Wang (06:20):
For me, the value was hopefully learning.
It was hopefully to, to tell thestory in stories, in a
compelling way that externalfolks could learn, but also this
was such a unique experience forall of us and one at a time of
very rapid growth for Reddit,that it seemed really important
for us to capture a lot of theintricacies around us dealing

(06:43):
with 13 or 14 differentincidents.
It felt very important tocapture that.

Courtney Nash (06:48):
Oh, wow.
It was hard to tell from thewrite-up that you're talking
about 13 or 14.
I think that wasn't evenapparent to me from rereading
these so many times.

Courtney Wang (06:58):
We had, yeah, so there were, there were some
there's a bunch of actually wedid not we chose, we pick and we
picked and chose certain thingsto talk about.
And similarly there were a lotof other scaling wins and
operational near misses that Ialluded to that aren't fully
captured in, in this anthologybecause of time.

Courtney Nash (07:17):
There's a lot of pressure to move on, right.
To go onto the next thing.
And why did you choose thesefrom the set then?

Courtney Wang (07:24):
For these ones I think they were especially
compelling because there wereseveral outside forces.
These were ways that subredditinteractions mess with our
systems or maybe mess with isthe wrong word, but interacted
with our internal systems, theywere patterns of behavior that
we had really never seen sothings like modmail spam, where
a subreddit goes private peoplestart spamming modmail.

(07:46):
That was interesting.
Very cool.
And a lot of things were theones that I also wanted to
highlight were in retrospect,what made them interesting to me
was that they wereUnpredictable.
I'm going to say they wereunpredictable.
I think some people might readthe incidents and say, oh, you
know, you, you could have donethis or that.
And I'm going to say no, that'shindsight bias.

(08:08):
There's no way, you know, wecould have predicted this kind
of rapid growth so quickly, evenif we had provisioned, you know,
there's so many other factors inplay.
there are examples of ways thatjust things will fail and that's
why I also chose Fran andGarretts near misses is because
they were anticipatory up to anextent, like they were, they

(08:28):
were investments into thingsthat were like, maybe they will
happen, right.
Maybe we will need this in thefuture.
Hey, we did need this in thefuture.
And, and so that's kind of thereasoning behind the specific
ones that I pulled out.

Courtney Nash (08:43):
Garrett.
Had you done this before, hadyou invested this much time to
analyze and write up an incidentwhen Courtney came knocking at
your door?
And then from there, whatmotivated you to actually say
yes.
And go ahead and do it.

Garrett Hoffman (08:57):
In my career, I've done the traditional
post-mortem write-up, but Ithink that's more procedural
rather than this more narrativeapproach to reflecting on, on
your work.
I'm a big proponent of writingstuff down, whether it's before
you designed something afteryou've built it and you're
reflecting on it.

(09:18):
I think that that gives you themechanism to really think
critically and clearly, and beable to think about your systems
in a way that's coherent enoughto explain them to an outside
audience who doesn't have fullcontext.
And I think there's value youget out of that, doing that

(09:39):
internally as well, because tonsof people come in and out of
your org and not everyone canhave full context about
everything.
So to the extent that you'reable to get this stuff down and
share it, it's great in thebroader ecosystem of, of the
technology landscape, but it'salso a little bit self-serving
and helpful for us internally aswell.

(10:00):
As far as saying yes toCourtney, I've done a little bit
of writing before, but the thingthat was so intriguing to me
about this series of blog posts,is it takes the technical
writing and flips it on itshead, because the standard
engineering blog post is kindof, we had this problem.

(10:22):
We came up with this solutionand here's your new solution
tied up, packaged nicely with atight bow.
But, we all know that that's notthe reality of most of your
day-to-day, work environment.
So to be part of sharing storiesthat are opening up the curtain
and giving people a view of theraw unfiltered, situation of, of

(10:45):
what went down.
It was just a really, reallycool idea to me and different
than most of the kind ofblogging I've had experience
with in the past.

Courtney Nash (10:56):
There's a really great researcher in the UK named
Steven Shorrock, who writes andresearch a lot about work as
imagined versus work as actuallydone.
Which I think is exactly whatyou're hitting at here.
I'll, I'll put a link about somemore details about that in the
show notes, if people arecurious about it.
You alluded to this alsoCourtney in the meta post that
talks about the things you'regoing to talk about of, of how

(11:18):
much it exposed about how yourcompany works.
Which I think is one of thosereally important things about
these kinds of analyses isyou're not assuming you're going
to find a technical"trigger,"and then you're going to solve
that and it's be done.
When you really dig into thesethings, and especially in the
near misses that we'll talkabout later, you discovered a
bunch of things about how yourhow your company works, how your

(11:39):
processes work, how teamscommunicate.
And I really appreciated thatyou all called that out in the
beginning of it.
so Fran along those lines,what's been your experience in,
this kind of post-incidentanalysis and write-up?

Fran Garcia (11:54):
In my previous life, before joining Reddit, I
worked for a monitoring company,so we try to take that kind of
thing very seriously because youhave highly technical customers
that really need a very highlyavailable system.
So whenever something goeswrong, you will try to spend as
much time as possible giving avery detailed postmortem, very
detailed write up of willhappen.

(12:14):
I will definitely spend a lot oftime doing that kind of thing.
So whenever, Courtney willapproach us saying do you know,
we want to write these blogposts.
It was a kind of feeling likeI'm always happy to do more
writing.
The framing of these series ofposts is completely different is
we're going to let you see whatwas going on and what was
happening with our systems.
And I feel like that's a verymore interesting way of

(12:35):
approaching it.
in terms of telling these kindof stories, it serves a purpose
that is not really served byanything else that we do, even
postmortems or evendocumentation, because this is,
this is basically memory, right?
We have memory as people on, weshared that memory across people
by telling stories, we need todo that as engineers and the

(12:59):
only way that we can actually doit across with people across
different organizations, bywriting it down, and having that
collective memory grow.
So the more we can do that andthe more we can encourage people
to do that, the better I feel.
This is something.
everybody should be doing.
The reason I haven't even triedto do it before Courtney
approach me is something that Ithink happens to many of us,
which is we'll have a story ortwo inside of us, but we'll

(13:24):
think is someone really going tobe interested in this?
Is it not really incrediblyobvious in hindsight and you
know...
because Courtney is a big ballof enthusiasm.
So he will come to you and hewill say.
No! This is awesome.
you need to write it becauseeverybody needs to read it." And
that's, I think that was the keythat, you know, when that to

(13:45):
keep dialogue, people needed toshare this.

Courtney Nash (13:49):
there's so much shared experience that's not
public.
And in particular, there's allthis human stuff behind these
incidents, which is one of mynext questions.
This one sounds like it was notone, these dozens sound like a
doozy that must have been quitea week.
Can you take me back a littlebit to what the team felt like?

(14:13):
What were you all experiencingwas everybody's super stressed
out?
Was it, what did it feel likebehind the, technical detective
work?

Fran Garcia (14:24):
I can tell you, for example, not necessarily that
week, but the week after to meis a blur.
I don't remember much of it.
I think there was this generalunderstanding that, that week
was going to be a cool-downweek.
There was a lot of clean-up upthat needed to happen.
And we did a lot of cleanup.
Courtney himself, he went on arampage of writing postmortems
getting a lot of things readyfrom that documentation point of

(14:46):
view.
It was a week of holding theFort, cooling down and getting
to, a physical, psychological,emotional state where we will
continue doing.
That particular week whereeverything happened, I think my
experience is a little bitdifferent than a lot of people
already because I am based inEurope.
So my hours are shifted a littlebit, which means that the

(15:08):
beginning of my day was a lotmore me trying to hold the fort
in isolation in anticipation ofthe markets are going to open.
Is someone going to do somethingweird somewhere you never know?
So that was usually the firsthalf of my days there's, there's
definitely a lot of pressure to,you know, you need to make sure
that you're holding the Fort.
You do, you need to to make surethat you're keeping an eye on
everything.

(15:28):
And the, everybody starts comingonline and that's when all that
cooperation starts happening, doyou need to have all these teams
talking to each other and allthis cooperation happening?
It was actually, I mean, I don'twant to say surprising, but
like, it was very nice to seehow natural all that cooperation
happened.
It was all very natural onpeople know who they need to

(15:50):
talk to.
Our know community team knowshow to reach out to us and say,
you know, maybe people arecomplaining about this.
Do we need to talk about, do youknow any of our communities
about any of this?
So it was, it was very good tosee what, while that was
happening, but I think you onlyget to appreciate that after the
fact when you sit down.

(16:10):
Just looking at your slack logsand say, oh look, everybody's
cooperating.
That's very nice.
But at the time it's just a bigblur.

Courtney Nash (16:19):
I recall the specific mentioned in one of the
write-ups about the relationshipwith the community team and how,
well that went.
was that the result of explicitinvestment in that relationship.
Was it more organic?
do you have establishedprocesses and runbooks and all
that kind of stuff?
How did that come to besomething that worked so well?

Fran Garcia (16:42):
I wouldn't say we have a process that's as clearly
defined as that.
It's, it's something that Ithink grew very naturally
because it's something that wehave to take advantage of very,
very frequently, because by theactual nature of how Reddit
works, there will always be thatone subreddit that's doing
something weird.

(17:03):
That's tickling the database ina strange way.
Right?
So we need to be able to contactsomeone from the community team
and talk to them and see whatour options are.
So that relation needs to bethere because there are all
times where we very quicklymight need to reach out to them
or they very quickly might needto reach out to us.
So I think that kind of growsvery naturally because at some

(17:25):
point it's not just you haveinfrastructure engineers working
on the servers and that's it.
It's all part of a whole.

Courtney Nash (17:32):
You're a victim of your own success in that
regard, right?
Having subreddits sort ofexplode is a good thing, right?
It's, it's a feature, not a bug,but you feel the consequences of
that in interesting ways and youcan't resolve those just
technologically.

Fran Garcia (17:49):
I think that's one of the things that make Reddit
so unique.
Not only as a platform, but a aswhole, is that every subreddit
it's its own ecosystem, it willhave its own behavior patterns.
It will have its own, group ofpeople that use it.
So you can make assumptionsbased on the behavior.
What's a rates from all theothers.

(18:10):
Do you need to be ready foranything at any given time,
which is...fun.

Courtney Wang (18:17):
The community aspect of specifically these
incidents and the interactionsthat we had was something I
really wanted to shareinternally with new
infrastructure folks, anyonewho's on call essentially to, to
open their eyes more to thisgroup of people that works
generally behind the scenes.
Our community team is anincredible team and it's a huge

(18:37):
source of resilience for us.
In handling things that we can'tpredict, because like Fran said,
there's no way we are going topredict what every subreddit is
going to do.
And there's so much technologywe can build towards, a guard
railing or preventing some majortechnology things.
But also that Shouldn't mean, weignore this group of people and,

(18:57):
and processes that are built in.
And a lot of it was so organicand that stood out to me.
It was a lot of back-channelingand conversations there's we
don't really have like a Wiki atthe time.
We didn't on how to do this.
And since then I will say wewe've improved that to be less
organic and to be more expliciton how we do these sorts of
interactions.
And that was one of the, I thinkthe huge organizational wins.

(19:21):
Of all of the things thathappened was we now on the
technology side are much moreingrained with our community
team.

Courtney Nash (19:31):
I like the theme across all of these in terms of
where the people are, thesources of resilience for you
all.
To that end, I want to dive intothe open systems post which I
believe you wrote Courtney andyou talk about the various teams
and responders can you give me asense of how many people were
involved over the course of,managing all of these incidents

(19:52):
that week?

Courtney Wang (19:55):
That is a great question.
And there are so many questionsin, in this sample, the list of
sample questions you gave methat I was like, I wish I'd
asked this during the reviewsduring the review process.
Why didn't I?
It's at least 60 or 70individual people across all of
our teams.
Every single team at Redditcontributed in some way to

(20:18):
facilitating our response.
I think in an individualincidents, there were a solid
group of eight or 10 that we'rein most of them.
Seeing the same 10 names come upconsistently was also to me, a
very interesting kind of redflag, right?
An indication that, oh, you knowwhy it was just these 10 people.

(20:40):
And to your, your questionearlier about, how is everyone
feeling?
That's a really good questionthat we didn't capture.
And we can't now because it wasso far behind.
I actually, before this, Ithought that this event happened
last year.
I thought there's a whole yearbetween like, I was like five
months ago.
Wait, no way.
So, so my brain is already sofoggy about that time.

(21:02):
And it's a really good questionand one that we weren't, we
aren't able to answer and I'dlike to go back actually, and
get an actual number of theseare all people that jumped into
slack.
And they're also a bunch ofpeople behind the scenes,
probably doing a lot of workthat isn't captured, even in my
review.

Courtney Nash (21:17):
there's, There's usually marketing people or PR
people things reach this level.
You've got the executive teaminvolved.
it really hits into all of thesenooks and crannies of an
organization.
So it was, I thought it wasreally great to see the ones
that you did identify.
wow.
60 to 70.
I mean, Even if you're justballparking that and you're
close, that's it, that's kind ofamazing.

(21:38):
And the, the name's coming upagain, and again, is an
interesting one.
There's a whole other, probablypodcast about burnout and
incident response.
But the sort of flip side youknow, of humans helping is
expertise, You kept seeing thosesame names.
I'm going to, I'm going to spitball here.
And some of those names keptshowing up because they were the
folks who have probably had thedeepest expertise about how

(22:00):
those systems work.
And, and you can't just clonethat.
And so then when you need thatexpertise, it's a, it's a heavy
burden to a certain degree.
I don't know if that's rightword, but it's, you rely on
those people even more.
And that's a factor, I thinkwhen you have this set of
incidents recurring over andover again.
And maybe that's why it was abit of a blur?

Courtney Wang (22:19):
I think that's a value of doing, of writing,
writing up incidents.
Is it in a lot of ways it helpsnew people and existing folks
learn more about what they needto learn about.
So the same eight or nine or 10people, aren't always on the
thing.
It's more people should bereading incident reviews,
internally and also externallyto better understand what you

(22:40):
don't know.

Fran Garcia (22:41):
I think there'sa lot of value in telling these
stories in this way, becausethat's the only way there are
only two ways.
Well, there's only really oneway you can get people to learn
all these things is they need tobe there and they need to go
through it.
Now, that's not a way thatscales particularly well for

(23:03):
many reasons.
And sometimes might not be eventhe more psychologically safe
way of doing that.
One of the other ways that youcan do is with that kind of
training where you say, I'mgoing to tell, I'm going to get
you to go through the story ofhow I saw it and you can see it
as close to as you can see itthrough my eyes.
So it's not enough for me totell you, Don't do that thing

(23:25):
with the database because thedatabase doesn't like that,"
right.
You need to tell"Here I was on adark stormy night and the
database was doing the thing,and these are the things I was
seeing, and these are the thingsI was doing." And that's a low
barrier to, to help people growinto understanding how that
looks like and how they can getto that place.

Courtney Nash (23:46):
I have so many questions that I want to make
sure I get to as many of uh, thereports as I can.
So I'm to kind of spelunk intothe open systems one.
You had covered sort of somesurprising details related to
the Crowd Control relateddatabase issues, but then after
you all sort of get throughthat, then you have this, follow
on secondary effect with, withthe Mailmod flood that you, you

(24:08):
alluded to Courtney and, and ithighlighted how unforeseen user
features and behaviors canimpact your infrastructure.
You talk sort of being likesurprising and I was hoping you
could take me back to what thatwas like for the team.
What was surprising about whatwas happening how did you begin
to tease that apart to try tounderstand.

Courtney Wang (24:28):
Those were very interesting ones.
So they were surprising becausethey didn't follow patterns that
we had seen previously.
To take you into our mindsetbefore the series of incidents
opened up kind of howunpredictable reddit is.
I I'm going to make theassumption that a lot of us
thought that Reddit wasrelatively predictable.

(24:49):
We have a couple of events.
Like soup, like the Superbowl,for example election days that
we for many years were kind oflike, these are the days that
things were going to happen.
We're going to see just thesekinds of traffic surges.
And we will prescale or, addmore resources accordingly.
And there might've been thingsthat existed that maybe

(25:09):
individuals knew about, but itwasn't shared broadly.
So it was surprising mostlybecause none of us had ever seen
the combination of effects thatwe were looking at on dashboards
together before.
So two disparate entirelydisparate systems modmail and
all of our other infrastructureseparately, going down.

(25:30):
Okay, they must not beconnected.
It's actually, yesterday I wastalking with somebody about we
were looking at some 500 errorsand they said,"Oh, these 500
errors are This status code.
I'm not looking at this othersystem." And he said that
sentence specifically, I'm noteven looking at this because
there's no way they can berelated.
It turns out they were related.
And so I think the surprisingpart is how It's just how we

(25:52):
didn't know what we didn't know.
And the process of teasing thatout was extremely difficult and
caused a lot of reflection.
And it was extremely difficultbecause when we build stuff it's
interesting that a lot of thetools that we build and the
dashboards that we build andfixes and mitigations for things
that we build are for thingsthat we've seen before.
And that's kind of wild.

(26:13):
When you think about how manythings we're not like...we don't
know.
It's hubris to think that we canpredict all of those things.
Like Garretts, the recentconsumed work is that that's an
example.
And the reason I highlightedthat is that's an example of a
project where it kind of, itscales infinitely in that it is
a good solution to a lot ofproblems, including ones that we

(26:36):
might not know about.
And I think highlighting that asan example of a near-miss
project was really powerfulbecause we're not solving for a
specific case.
We'resolving for the generalcase of"traffic might increase
in a lot of different ways wecan solve it this way." And
similarly with autoscaler right.

Courtney Nash (26:52):
I would love to talk about Garret's write-up but
before I do that, I just wantedto call out how invaluable I
think it is to study these kindsof near misses and Fran, you
alluded to this early on, itcould be framed very
differently.
It could be framed like"here'show Reddit's awesome, and this
is why our system is this way,and like, you should go do the
same." Which spoiler alertpeople probably shouldn't

(27:13):
because their business modellooks nothing like yours, but
it, the framing of it is thatit's not that this was a giant
success, but it was a near miss.
And I love that framing becauseI think you've, you all have
teased out and understood justas much about how your business
and your systems and your,social organizations work from

(27:34):
studying the near misses So, itsounds like maybe a conscious
choice on your part Courtney,but, Garrett, talk me through
what you were thinking about interms of writing this one up.
Were you thinking about it as anear miss when you first did it?

Garrett Hoffman (27:47):
Yeah, I think so because, It's a Friday night,
it's late, you're having adatabase be at 95% capacity out
of nowhere.
Based on like Courtney said,unpredictable patterns of,
putting additional load on thisdatabase.
And we're, we're on slack, we'rewriting up docs.

(28:11):
How are we going to scale thisup?
Do we have to have any downtime?
You know, what's the mitigationplan?
How long is this going to take?
And you're, you're writing thisup and you're going through
these steps and it all goesaccording to plan, we can scale
it up easily, we can scale it uponline with no downtime.
And I think in that exactmoment, I think me and Courtney

(28:34):
slacked each other, and we werelike, could you imagine if this
happened nine months ago beforewe redesigned the system.
I do not think what justhappened would have happened.
I think we weren't thatconscious of it, like in the
moment of responding to, to thisincident and it was, I don't

(28:54):
think we immediately said Hey,this is something we need to,
write about or talk about, but Ithink that little conversation
between me and Courtneyresonated enough that I think he
was like,"no, this is worthtalking about it" because you
need to talk about this toremind yourself why it's so
important to do that redesignand, and maybe push off that,

(29:17):
one new feature by a quarter tojust prepare for this exact
situation.

Courtney Nash (29:23):
So you alluded to that trade off of, should we
spend X amount of time onfeature Y or should we invest in
this thing that may or may notpay off one day.
What was that decision makingprocess like at Reddit?
was that an easy sell, werethere a lot of people that had
to be involved— hard trade-offs,engineering managers, what did

(29:44):
that look like internally tomake that decision?

Garrett Hoffman (29:47):
Courtney and Fran might definitely have a bit
more to add because they have abit more time at Reddit than I
do, but especially in my, abouta year and a half here, more and
more, has there been given morefocus on foundational work and
quality.
So I think it's, it's becomingeasier than I imagine it may
have been in the past whenReddit was kind of in smaller,

(30:11):
super, super high growth phase.
Not that we're not in thatanymore, but I think we're at
the point where we're kind offinding the right balance of,
that high growth, new featuredevelopment along with the
foundational work.
It's largely more of a trade-offin resourcing.
I think Reddit's notoriouslyoperated very light, relative to

(30:32):
the scale that it, that itserves.
And so I think those decisionsmostly come down to, when you're
strapped for resources, what.
What do you do?
And there's merits in bothapproach is what's the point of
preparing for more users whenyou aren't building the features
that you need to get thoseusers?
There are certain systems that,when it becomes super apparent

(30:55):
of the limitations, it becomeseasier.
In this system, we werefortunate enough to have hit
those limitations in like a moregradual, fashion rather than
just being slammed at one time.

Courtney Nash (31:09):
So we're talking about two"What Worked" posts
just for context for folks andall of these will be included in
the show notes.
One was around the autoscalerand one was around recently
consumed.
I believe it was you Fran forautoscaler.
I'm going to read what youwrote.
"The code was originally writtenin response to a scaling event
years ago that had been largelyunchanged.
Unfortunately, this means thatit wasn't an easy, easy code

(31:31):
base to understand if you didn'thave the right context.
And the prospect of makingchanges to it was intimidating,
which only exacerbated theissue." And I'm sure most
engineers can agree.
Like that's the spooky shit.
Don't touch it.
it But take me back to when thatcode was written, if you can,
and the event that precipitatedyou said it as a scaling event,

(31:52):
years ago, somethingprecipitated your willingness to
wade into this— why was itintimidating for the people who
were dealing with this code,later downstream and why did
nobody want to touch it?

Fran Garcia (32:03):
I actually just check this right now.
The code was created like eightand a half years ago, which is
more than half the life ofReddit.
So that should give us an idea.
Like it basically remainedfairly unchanged since then.
And I think that's in itself atestament of it being a solid
piece of code, right?
It did its job, and mostlydidn't complain.

(32:25):
It was, it was written by, by,by Jason Harvey.
One of our engineers here atReddit, he's still around.
He wrote it, the legend saidthat he, he really well, he had
the flu.
So I, I equate that to MichaelJordan's flu game, but I don't,
I don't know if pizza wasinvolved or anything, but like
the point is that it wascreated, it works reasonably
well.
And I feel like that to touch onwhat Garret was talking about,

(32:48):
there are two types of softwareprojects in terms of the value
that they're bringing and thekind of investment that they
get.
So there are the projects thatare delivering a lot of value on
those are going to get a lot ofinvestment.
Right?
Because it's, it's very obviousthat there's a reward.
there.
There are the other ones thatare on the opposite ends of the
spectrum, which is like,everything's terrible,
everything's broken and you needto do a lot of investment just

(33:10):
to keep things running.
And then there's everything elsethat kind of falls into the
middle where they'remostlyworking the most don't
complain.
You can make it better, but it'snot necessarily obvious,you know
how, why.
A lot of the time to kind offall through the cracks and
investment doesn't happen.
And on the Autoscaler was kindof falling on that criteria like
that.
So if you want to test it.

(33:33):
How did you test the thing thatrequires so many moving parts,
right?
You need all these auto scalinggroups, all these large pools of
servers that are serving, youknow, varying amounts of
traffic—you need all of thosethings.
So we didn't really have thatinfrastructure.
We didn't really invest in thatinfrastructure to test that.
So if you want it to make anychange to that particular piece

(33:56):
of code, you need to have a verydeep, very deep awareness of.
How do low balancers work?
How requests are distributed toall those load bouncers.
What happens if you screw upwith your code, how do you
revert it very quickly, what canbreak?
So the end result is maybe don'twant to change it.

(34:19):
And there's a very small numberof people—like I was checking
the commit logs—a very smallnumber of people that have
contributed to it over theyears.
And most of the changes were,out of necessity.
One of the things that we triedto do to fix his was okay, so
can we refactor this in a waythat if people from other teams
tell me,"can I see how theautoscaling decisions for my

(34:44):
server pull are made?
Can Isee that can I modifythat?" people should be able to
come to me from any team thathas no infrastructure background
whatsoever.
I should be able to point themto the code and say,"Yeah, just
modify this or look at this andit's all you need." And that's,
that's basically the problemthat we set out to solve because

(35:04):
nothing was broken.
But we weren't giving people theflexibility to do those things,
to make those changes, which iswhat worked in this particular
case, right?
In this particular case, duringthe wallstreetbets this
shenanigans.
What we had was a case where wethought we will fine tune the
autoscaler.
But we needed to have a way todo it easily.

(35:25):
We need to have a way to havemaybe different people
contribute to that and be ableto chime in and say, I think you
can make this particular change,or I think you can make that
particular change.
It's going to work better.
Maybe in the previous version ofthe autoscaler, the would be
more difficult to achieve.

Courtney Nash (35:43):
I want to get through one more of the cases
really quickly, if we can.
The other one that you wrote upCourtney was the"more data, more
problems" incident, and earlyon, you mentioned a quote
unquote known pattern.
I love known patterns, right?
Because I mean, these are, whenwe talk about expertise, there's
heuristics, there's all thisacquired knowledge.

(36:04):
Oh, this looks like that."sometimes that can lead you down
a garden path.
And you're like,"no thosetwo...that can't be connected.
It looks like this." Right?
How did the team come to acquiretheir knowledge of that
particular known pattern?
I think you said the four tableswere a pattern that pointed to a
hot post on Reddit.
What did that knowledgeacquisition process look like
for you all?

Courtney Wang (36:25):
This is another question that I wish I had asked
during the review with theresponders who came up with
that, and the people whoresponded to those particular
incidents the three mainengineers, I'm going to shout
out Jason Harvey, re-hearmirrors and Brian Simpson, the
three of them collectively arethe most senior infrastructure
engineers who worked on theInfra team at the time.
so, it must be like 8, 9, 10years individually.

(36:50):
So like 30 years ish ofcollective experience doing this
sort of work.
And so my bet is that theyfigured out that pattern and
just had it ingrained in themfrom years of interacting with
those data models and withseeing how they broke.
And that was just time that theycould use to understand those,

(37:11):
intricacies.

Courtney Nash (37:13):
And then there was a solution.
It was, it was a form of anavailability versus consistency
sacrifice decision you have tomake here.
so it was sort of do you turnthe feature off and preserve the
core experience, like who wasinvolved in making that trade
off and that decision, whatfactors were considered, how do
you reach that conclusion?

Courtney Wang (37:33):
I do know who was involved because this was
something that I saw.
And I looked, I went back andchecked slack logs and looked
up.
And so at the time it wasactually Fran, who was leading.
I would like him to chime in alittle on what he was thinking
about when he said,"pull thetrigger on this." But the
engineers, one of the engineersbasically just said," I think if

(37:54):
we just turn off this particularset of features, it will work."
I didn't ask, you know, whatwent into that decision making
in general, I can speak now tokind of some of our internal
general processes around thatwhen we're thinking about
turning features off.
A big part is like, how easy isit to, to disable the feature?
These are questions that we askduring incidents, when somebody
says, comes in an engineer formy project and says,"Hey, I can

(38:14):
turn this off if we need to."And that actually happened
during this incident as well iswe had a couple other engineers
chime in and say, oh, you know,these services interact with
these data stores.
We can turn off X, Y, and Zfeature." And that was really
cool to see now looking back wenever really took advantage of
that, and we never needed to.
However we decide, okay, howeasy does it turn it off?

(38:34):
How many, what is the actualimpact of the core Reddit
experience?
Is it a lower level feature?
Is it in this case, being ableto save and hide links was kind
of a, okay, many people usethis, how necessary is it to the
core?
And that is that is a toughbusiness decision.
That at the time we, as theinfrastructure group, made.

(38:55):
We were given the agencyimplicitly, I think at the time
to make those decisions.
And now looking at them, I waslike, that's a really hard
decision to make.
I don't know that every engineerif you don't have, you know, 10
years of experience here, willbe able to make that decision on
your own.
And that's why in the monthssince we've started a more
robust incident command process,and we have these sorts of
flows.
A lot of that I think came outof those learnings.

(39:16):
And at the end of the day, ourcore goal is to get, read
availability up.
and so anything where we can getusers actually just be able to
look at content again, asopposed to interacting.
That's one of the main leversthat we turn, but, but I would
like to ask Fran, in your mindwhen Brian said,"Hey, I can turn
this off" and you said,"go forit." What went through your

(39:37):
head?

Fran Garcia (39:38):
Brian said it was good to do it.
No, I mean, I think to add alittle bit more context to it
this is something that is veryintegral to how things work at
Reddit.
There's always a hot key orthere's always a hot something
because it's due to the way thatReddit works.
So they'll always be onesubreddit that's being more

(40:00):
active than anything else, oneuser that's more active than
anything else, one thread that'smore active than anything else.
so that kind of pattern happensvery, very frequently.
So this is something that Ithink we're getting to the point
where we are very good atidentifying those on
identifying,"okay.
Let's shed that load, as best aswe can" and that also works with

(40:21):
features, right?
So let's degrade as gracefullyas we can.
And sometimes.
We don't do enough of a good jobto prepare for those who have
the features to have a simpletoggle.
There are many, many incidentswhere you will see someone very
quickly making a pull request,which is basically saying if
you're going to read thatparticular key, don't do it.
It's good but it works.

(40:42):
And that informs decisions lateron to say, okay, so we need to
toggle here to allow us to shedload here.
We need to be able to be in aposition to say, we need to stop
all these kinds of reads fromthe database at the drop of a
hat.
So every time that happens, itinforms a situation where
hopefully we can be in a betterposition in the future, that's

(41:02):
something that I feel we do alot on the fact that we are
sincere and responders feelempowered to make that kind of
decision.
It's really, really importantbecause I've always been part of
companies where that kind ofthis issue probably involved a
conference call with at least 10different people.

Courtney Nash (41:22):
Maybe a war room, which

Fran Garcia (41:26):
Oh no!.
At the very least

Courtney Nash (41:28):
I have PTSD from previous jobs with that one.

Fran Garcia (41:31):
Carney described the strategy was basically you
have Brian, and Brian says,"Ican do it!" On them, whatever
engineer is there is like"Yeah,do it.".
And that's, that's basically allyou need.
Feeling empowered to be able todo that is very, very important.

Courtney Nash (41:46):
I would love to close out.
Courtney, You mentioned that youare developing a, broader sort
of incident response plan...
team?

Courtney Wang (41:57):
Yeah, it's probably a good topic for a
future blog post.
Coming out of that the incidentwall street bet stuff, it's
actually, the first initiative Itook on was, Hey, we're starting
to see a lot of coordinationneeded among a bunch of teams.
And this can't be the same,three or four people.
We need to train more people to,to find these connections and to
make them.
And so we now have an incidentcommander rotation internally

(42:18):
comprised of individuals that wewould like to expand to the
broader org which is the ideathat anyone can become an
Incident Commander.
it's not a technical, it's moreof a non-technical role if
anything.
And these folks are there tofacilitate communication,
coordination.
The biggest value I see issafety— help incident responders
feel safe, especially as we are,as an org, scaling very rapidly.

(42:38):
And our on-call model is serviceowners are in charge of their
services and, and of being oncall but sometimes services talk
to each other and other servicesand other teams.
And sometimes, you know, you'reon call and if you are paged, we
want to make sure that there isa safety net for people to come
in.
And so we've been running thisprogram for a few months now and
starting to collect some data onhow useful it is, right?

(42:59):
How much safer do people feel?
How much quicker do we resolveincidents?
The stated goal is these peopleare there to help reduce
severity and duration ofincidents and, and help folks
find what they don't know.
I think if we had the, theprogram in place for our
r/wallstreetbets, I'm not surethat things would have
necessarily resolved morequickly, but I do think
psychologically folks would havefelt more comfortable and cared

(43:22):
for in a very stressful time.

Courtney Nash (43:25):
Well, thank you all.
First of all, for writing all ofthese up, it is an incredible
investment of time, but it seemspretty clear to me the, payoff
in terms of what you've learnedhas been well worth it.
I hope you will continue to domore of that because you're
going to have more incidents.
I wish I could tell you you'renot...Fran's shaking his head.

Courtney Wang (43:43):
We have had our last incident officially just
going to announce that.

Courtney Nash (43:47):
Oh, You're done!

Fran Garcia (43:49):
We had a meeting.
We all agreed that that was it.

Courtney Nash (43:52):
Cool.
Good.
I'm glad to hear that.
I'm sure everyone else will hopon that train too.
Advertise With Us

Popular Podcasts

Stuff You Should Know
My Favorite Murder with Karen Kilgariff and Georgia Hardstark

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

My Favorite Murder is a true crime comedy podcast hosted by Karen Kilgariff and Georgia Hardstark. Each week, Karen and Georgia share compelling true crimes and hometown stories from friends and listeners. Since MFM launched in January of 2016, Karen and Georgia have shared their lifelong interest in true crime and have covered stories of infamous serial killers like the Night Stalker, mysterious cold cases, captivating cults, incredible survivor stories and important events from history like the Tulsa race massacre of 1921. My Favorite Murder is part of the Exactly Right podcast network that provides a platform for bold, creative voices to bring to life provocative, entertaining and relatable stories for audiences everywhere. The Exactly Right roster of podcasts covers a variety of topics including historic true crime, comedic interviews and news, science, pop culture and more. Podcasts on the network include Buried Bones with Kate Winkler Dawson and Paul Holes, That's Messed Up: An SVU Podcast, This Podcast Will Kill You, Bananas and more.

The Joe Rogan Experience

The Joe Rogan Experience

The official podcast of comedian Joe Rogan.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.