All Episodes

November 1, 2021 31 mins

"We no longer felt confident about what the exact operational boundaries of our cluster were supposed to be."

In early 2021, observability company Honeycomb dealt with a series of outages related to their Kafka architectural migration, culminating in a 12-hour incident, which is an extremely long outage for the company. In this episode, we chat with two engineers involved in these incidents, Liz Fong-Jones and Fred Hebert, about the backstory that is summarized in this meta-analysis they published in May. 

We cover a wide range of topics beyond the specific technical details of the incident (which we also discuss), including:

  • Complex socio-technical systems and the kinds of failures that can happen in them (they're always surprises)
  • Transparency and the benefits of companies sharing these outage reports
  • Safety margins, performance envelopes, and the role of expertise in developing a sense for them
  • Honeycomb's incident response philosophy and process
  • The cognitive costs of responding to incidents
  • What we can (and can't) learn from incident reports

Resources mentioned in the episode:


Published in partnership with Indeed.

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Courtney Nash (00:31):
I'm your host, Courtney Nash.
And welcome to the inauguralepisode of the void podcast.
Today I'm joined by LizFong-Jones and Fred Hebert of
Honeycomb.
We are going to be talking abouta Kafka related multi incident
report that they recentlypublished.
And I'll start by asking Lizwhat motivated you all to write

(00:53):
this meta report in the firstplace.

Liz Fong-Jones (00:56):
We are a company that is very transparent.
And we try to be candid, bothabout our engineering successes
and our engineering failures.
And it's part of our commitmentto really foster this ecosystem
in which people openly discussincidents and what they learned
and how they debug them, becausethat is our bread and butter.
Our bread and butter is helpingyou debugging incidents better.

(01:18):
And sometimes that's what honeycomes to tool itself.
And other times it's with thelessons that they learn from our
own explosions.

Fred Hebert (01:25):
From my perspective, there's a different
alignment with that and thissort of idea that there's a
trust relationship between usand our customers and users.
We have the huge pleasure ofhaving a technical audience and
sets of customers.
If we see ourselves as goingdown or being less reliable,
it's a way to retain trust or tobe very transparent about the

(01:49):
process and hopefully make itsomething where the customers
themselves appreciate being ableto understand what went on.
And to me, the ideal thing wouldbe that they're almost happy
when there's an incident at somepoint, because they know there's
going to be somethinginteresting at the end of it.

Liz Fong-Jones (02:03):
Honeycomb at this pointdoesn't tend to go
down because of the easy things,Honeycomb tends to go down
because of the hard things.
And therefore they're alwaysnoteworthy to our customers.
And also it's an interestingsign in terms of helping people
who are choosing Whether tobuild or buy an observability
big data solution, whether theyshould really be building it
themselves.
And certainly seeing the ways inwhich our system has failed is a

(02:26):
lesson to other people in,"youprobably don't want to do this
at home."

Courtney Nash (02:29):
Can you say a little bit more about why it is
that you're failure modes orincidents are not the garden
variety?
Because people might think,"oh,Honeycomb is so sophisticated,
they shouldn't have really crazyoutages or really big problems."

Liz Fong-Jones (02:46):
The simple answer is that we have processes
in place.
We have automation in place thatensures that the typical ways
that a system might fail atfirst glance, pushing a bad
deploy having a customer,overwhelm us with the kind of
traffic, those kinds of failuremodes we can automatically deal
with or deal with in a matter ofseconds to minutes.

(03:07):
So therefore, any kind of longerlasting performance impact or
outage is something that is hardto predict in advance and that
normal automation and tooling isnot going to do a good job of
dealing with because it's kindof the, if it were simple, we
would have done it already kindof thing.
Whereas we know that a lot ofsituations that many companies

(03:29):
in the software industry findthemselves in is that they have
a lot of technical debt.
They are not necessarily doingthe simple things and they need
to do the simple things first.
Whereas in our case, we've beenproactive in paying down the
technical debt.
And that's kind of why we're inthis situation where most of our
failures are interestingfailures.

Fred Hebert (03:49):
The complexity of a system increases and it always
has to do, just to be able toscale up.
Usually it's that the easy earlyfailure modes are at some points
stamped out rather quickly.
And then the only thing thatyou're left with are the very,
very surprising stuff thatnobody on the team could predict
at the time.
And those are last are left todo.

(04:10):
Fuzzy surprising, interestinginteractions and causing these
outages.
So it has to do with one, withthe experience of the team, the
engineering team, and the teamin general, and what can
surprise them.
And for Honeycomb, ourengineering team is pretty damn
solid.
And so we're left with some ofthese issues.
And this incident specificallyis interesting because the sort

(04:32):
of trigger for it is what couldbe considered just a typo or
miscommunication, the kind ofstuff that you usually don't
want to see in a big incidentlike that.
But for us, it was important tomention that this is a starting
point of this thing to someextent, or the direct trigger of
it.

Liz Fong-Jones (04:49):
The other interesting thing about this
incident is that we did it inthe course of trying to improve
the system and make it morereliable, which was particularly
ironic.

Fred Hebert (05:00):
So every couple of years, or at a given time, we
need to scale up the Kafkacluster that's kind of at the
center of the ingestion pipelinethat we have for honeycomb.
And every time there's also abit of a re-evaluation that
comes through, like, what arethe safety margins that we keep
there, which could be theretention buffer in Kafka that
we have, how many hours, howmuch space?

(05:22):
How do we minimize the costwhile keeping it safe and all of
that.
By the end of last summer, in2020, Liz and a few other
engineers before my time joiningthe company had started this
project of looking into the nextgeneration or the next iteration
of the Kafka cluster.
And it's starting aroundDecember/January to, until the

(05:45):
spate of incidents that was likeFebruary to March I think.
The changes were being put inplace to change the version of
Kafka that we have to go forsomething that confluent does
that handles tiered storage,where instead of offering
everything locally on thedevice, it sends a bunch of it
to S3 and keeps a smaller localbuffer.
For us, the major scaling pointwas always the disk size and

(06:06):
disc usage on the instances thatwe have in our cluster to have
something like 38 instances.
Moving to that would let us dothe same amount of throughput
with more storage and bettersafety on something like six
instances, that could be cheaperto run.
And we would move away fromsomething where the disc is the
deciding factor to somethingwhere we can scale based on CPU,

(06:28):
based on ram, and disk is nolonger this factor.
So.
We were going to update theKafka software, change the
options that were in there,adapt the Confluence stuff, try
to improve this tech at the sametime, because it's a big,
complex, critical thing.
So all the small updates youwait to do until you do the big
bang deploy, because it's scaryto touch it all the time.

(06:50):
The spate of outages came fromall these rotation and changes
that we were slowly rolling outover the course of multiple
weeks and regaining control ofthe cluster.
And in some cases it had to dowith rotten tools.
It had to do with confusingparts of our deployment
mechanisms and stuff like that,which were not necessarily in
the public reports, because itwould not be useful for all our

(07:11):
customers to know about that.
And then we had the biggestissue, which was a 12 hour
outage of, the Kafka cluster isgoing down and out of capacity
because we picked the wronginstance type for production.

Liz Fong-Jones (07:26):
The other interesting bit about this of
course, is that we hadsuccessfully deployed all of
these changes in apre-production environment over
the course of October, November,and December.
And none of these issuessurfaced in the smaller dog-
food cluster.
They only really surfaced oncewe started doing that production
migration.
so we talk a lot about this ideaof"Test in Production, test in

(07:47):
Production!" and this doesn'tmean that we don't test in
Production, but this set ofoutages really exemplifies that
no pre-production environmentcan be a faithful reproduction
of your product environment.

Courtney Nash (07:58):
The very things that you just described there,
Fred could be consideredtechnical, but there's a whole
bunch of sociotechnical thingsin there, right?
Like decisions and forces andpressures.
And, fear of big bang changesand then you batch everything
up.
Those are very muchsocio-technical decisions.
I don't see this language a lotin incident reports and I would

(08:19):
love if you could talk a littlebit more about what you mean by
"safety margin." Are thosecodified, are those feelings
—what is a safety margin athoneycomb?

Liz Fong-Jones (08:27):
This is why I'm so glad that we brought Fred on
board.
So Fred joined the company very,very recently, and this is
Fred's language, right,Socio-technical systems, safety
margins.

Fred Hebert (08:38):
Right, right.
So in this case, it's a thing Inoticed in one of the incidents.
In fact, it's one of the nearincidents that we had.
It wasn't an incident.
And it was the most glaringexample of that.
I have seen here at a company,which is we had this bit where
we were deploying the system.
I can't even remember the exactcause of it.

(08:58):
But at some point the disk onthe Kafka instances started
filling up.
And so we were reaching like90%, 95%, 99%.
It was not the first near thenear outage that we had.
Liz was on the call.
I had the idea of saying like,run this command.
It's going to drop the retentionby this amount of time.
And this size on this, like 99point something percent of this

(09:21):
usage on the Kafka cluster wemanaged to essentially cut back
on the usage, dropped theirretention and avoid like a
disaster of the entire clustergoing down because all the disc
were full and it would've been anightmare,

Liz Fong-Jones (09:34):
It's trading one margin, one mechanism of safety
for another, right?
Like we traded off having thedisc buffer, but in exchange, we
lost the ability to go back intime.
Right?
Like to replay more than a fewhours of data.
Whereas previously we had 20hours of data on disc

Fred Hebert (09:49):
Right for me, that's the concept of safety
margin.
We have that thing where ideallywe had like 24 to 48 hours of
buffers, so that if the coretime series storage has an issue
where it corrupts data, then wehave the ability to take a bit
of time to fix the bug andreplay the data and lose none of
the customer information.

(10:10):
And so for me, that's a safetymargin because that 24 hours.
It's the time you have to detectan issue, fix it, roll it out
and replay without data loss.
And so when we came to this nearincident where almost all the
data was gone, and we have ahuge availability issue that
trade-off was done between thedisc storage and that margin of

(10:30):
buffer, given we're at 99% rightnow, 95%, the chances are much,
much more likely that we'regoing to go down hard on missing
disks than we are corruptingdata right now.
So you take that extra capacityon the disk in this case, quite
literally.
And you give it to somethingelse in this case, just like buy
us two, three hours so that wecan understand what's going on,

(10:52):
fix the underlying issue andthen go with it.
And so rather than having likeincident that goes from zero to
a hundred extremely quickly, wehave three hours to deal with
it.
This speaks to the expertise ofpeople operating the system and
that they understand these kindsof safety margin and measures
that are put in place in someareas.
Those are essentiallyanti-optimizations, right?

(11:13):
The optimization would have beento say,"we don't need that
buffer.
We only have one hour and that'sabout it." And then you move on,
but there are inefficiencies insort of pockets of capacity that
we leave throughout the systemto let us cope with these
surprises.
And when something unexpectedhappens to people who have a
good understanding of where allthat stuff is located, are able

(11:35):
to tweak the controls and changethe buttons and turn some demand
down to give the capacity forsome other purpose.
That's one of the reasons why,when we talk about
sociotechnical systems thesocial aspect is so important.
Everything that's codified andimmutable is tweaked and
adjusted by the people workingin the system.
And I try to be extremely awareof all of that taking place all

(11:55):
the time.

Courtney Nash (11:56):
There's one phrase that was in the report
that, that caught my attention.
And it's, it's what you'redescribing here.
And I just want to ask a littlebit more about it.
I'm going to read the phraseout.
You wrote:"We no longer feltconfident about what the exact
operational boundaries of ourcluster were supposed to be."
I'm really curious, who wasinvolved in those conversations?
you say"we," was that you, Liz,one other person, three other

(12:20):
people?
Take me back into thoseconversations a little bit, and
how you reached that point ofnot feeling confident.
What did that look like to getto that point?

Liz Fong-Jones (12:29):
I think going through the cast of characters
is really interesting because Ithink that's how we wound up in
this situation.
Right?
So originally, two years ago wehad one Kafka expert on the team
and then I started doing someKafka work and then the Kafka
expert left the company.
So it was just me for a littlewhile.
And then the platformengineering manager made the

(12:50):
decision of, okay we think we'regoing to try this new tiered
storage thing from Confluent,let's sign the contract, let's
figure out the migration.
And we thought we hadaccomplished all of it.
And then we had one engineerfrom our team Martin, sign up to
finish the migration rate, likealready running in dog food,
like make it right in Prod.
And then when we started havinga bunch of incidents.
That's when it kind of tookFred, Martin, the platform

(13:13):
engineering manager and me, allthe four of us sit down together
and figure out where do we gofrom there?

Fred Hebert (13:20):
Yeah, and we had extra help from other people
just joining heads.
Ian Dean, who are on the team,if ever they're listening to
that where it's just, at somepoint, the incidents lasts a
long time and people lend a handand help each other doing that.
But there was this transfer ofknowledge and sometimes pieces
fall on the ground and that'swhere some of the surprises come

(13:40):
from.
But the lack of confidence inone aspect, it's just this idea
that people are tired about theincidents and you kind of see
that easily.
The fact that you no longertrust that you know, the
pre-prod environments are notreliable.
We don't know what's going tohappen every time we touch it.
There's this feeling thatsomething explodes.
And for me, it's, it's notsomething for which you can have

(14:01):
like quantitative metrics,right?
It's something you havequalitative metrics for, which
is Do you feel nervous aroundthat?
Is this something that makes youafraid and I'm getting a feeling
for the feelings that peoplehave towards the system is how
you kind of figure that one out.
Someone who feels extremelyconfident are not necessarily
going to take extra precautionsfor the sake of them.

(14:23):
but everyone was sort of walkingon eggshells around that part of
the system, and feeling or selfimposing pressure about the
uptime because people take pridein the work they do.

Courtney Nash (14:33):
The feelings part is really interesting.
Right?
And, how people build upexpertise in these kinds of
systems and start to understandthose boundaries much more,
sometimes very intuitively.
Right?
in the writeup, you mentionedsome near misses and ended up
those were good training for thebigger incident, that big 12
hour incident that eventuallyhappened.

(14:54):
But most people don't say,"yay!We had an incident but we
learned something from it." Canyou talk a bit about some of the
near misses, and that aspect ofit being training for the bigger
incident?

Fred Hebert (15:04):
The probably obvious one is the 99% disk
usage.
One.
Which, you know, Liz sort ofsaved the day on that one by
chiming in and coming through.
But we had other ones that hadto do with a bad deployment
where the script to deploy thatwe thought we had, that didn't
work the way exactly we thoughtit did and caused this sort of
cascading failure or nearfailure.

(15:25):
And in this case, I recallpeople were on a call
specifically Martin and Dean,and then a few other onlookers.
And I came in and watched butLiz made a point of remaining
back on chat and monitoring theupdate/ status situation, and
this was a great move becausethere's also a sort of semi
explicit policy to avoid heroesat the company; one person

(15:48):
concentrating all the knowledgeAnd in the first one where we
have the 99% near outage, Lizwas there to explain to Martin
and Dean to sort of commands tobe running about that one.
And when the other incidenthappened, Liz stayed back and
let them do their thing, whilestill keep an eye from afar
without coming in and saying"Iknow how to fix this one.
I'm going to fix it for you."for me, that was a sort of

(16:08):
transfer of knowledge thathappened over the course of
multiple incidents.
And it was only this visiblebecause they were sort of close
together.

Courtney Nash (16:17):
Liz, I think that's a really interesting and
very obviously conscious choiceabout your role that you played
and allowing the people who runthe systems to understand the
properties of their systems.
Can you talk a little bit aboutwhat incident response looks
like at honeycomb and how that'sstructured and how you all tend
to handle that just in general?

Liz Fong-Jones (16:41):
Yeah, incident response at Honeycomb is
particularly interesting becausewe expect all engineers to be
responsible for the code thatthey push.
That if something that youpushed breaks production, you
are the immediate firstresponder.
Now that being said, we also dohave people who are on-call at
all times.
There are typically two to threepeople on call for honeycomb at

(17:02):
any given time.
So those are kind of the firstpeople who will jump in and help
and also the first people to getalerts.
And then after that, it's are wedeclaring an incident who's
going to become the incidentcommander" and then how do we
delegate and assign the work sothat people are not stepping on
each other's toes.
At the time that the both ofthese instances broke out, I was
neither the person whoproximately pushed a change, nor

(17:25):
a person who is on-call andtherefore it was a matter of
coordinating and saying, hey,can I be helpful here?
What role would you like me toplay?" In the first one, I said
that this looks like it is onthe brink of going really,
really bad.
Here's a suggestion, would youlike to do that?
And the team decided to go withit.
It's not like I stepped in andran the command silently and no
one knew what happened.
I think what's interesting isfor the second one, I was

(17:47):
actually not available that day.
I was not available to hop oncall.
I was really tired in fact, fromthe work in the previous
instance.
So part of me staying out of itwas not my choice.
And also part of it was the teamhasn't asked for my help.
So I'm just going to keep an eyeon things and just make sure
that, nothing is about to go offthe rails." And other than that,
we have some safety room in oursystems for people to try to

(18:08):
solve the problem in whateverway they deem acceptable, even
if it's not necessarily the mostefficient one.

Courtney Nash (18:14):
You mentioned something really important in
the report in a couple ofplaces, which is also not
something I see in almost anyother, incident write-up.
and, and you allude to it, Lizcognitive costs" is the phrase
you use Fred when peopleexperience these kinds of
cascading or repeating orrelated or close together
incidents, people get tired.
these are stressful things, likeyou said, Fred people are deeply

(18:36):
invested and take pride in theirwork.
And it it has all of these otherkinds of costs that aren't just
technical.
Liz, you've alluded to that foryourself, personally, but Fred,
maybe you could talk a littlebit more too about broadly what
those cognitive costs lookedlike for the team over the
course of this set of incidents.

Fred Hebert (18:53):
For me the basic perception, there comes from the
learning from incidentscommunity, and this idea of
blamelessness and notnecessarily having what I like
to call the shallowblamelessness where you say"we
don't blame someone." And whatit means is that we don't do
retribution over anyone, but westill assume that all the
mistakes are coming from someonefucking up at some point.
It has to take this perspectiveof people come here to do good

(19:15):
job.
They were making decisions thatlocally were being rational
based on what they wereperceiving.
And so a lot of theseinvestigations for me are
oriented on the idea of whatwere the signals people were
looking for.
What made this look reasonable?
What was the situation they werein?
And that's what influences howthey interpret things as they
happen.

(19:35):
So the cognitive costs for me,in some cases, this idea that
when you have to make thesedecisions, like how many things
do you need to keep track of?
How many signals are comingthrough?
What are the noises.
What's your capacity to dealwith them.
Are you busy doing somethingelse?
Are you tired?
Are you doing all of that?
Because it all has an impact onthe quality of the work we do
when the kind of decisions, andto think that we can keep track

(19:56):
all of that comes in there.
So for me to cognitive cost isto kind of burden of keeping
everything and all theinteractions in your mind.

Liz Fong-Jones (20:03):
And that also goes to the question of how many
people are on call forhoneycomb, because originally
the answer was one and then itgrew to two.
And then it grew to threebecause the surface of honeycomb
increased such that you could nolonger have one engineer
remember how do all of thepieces of front end work?
How do all the cases of thebackend work?
How do all the pieces of ourintegrations work?
That's why we divided andconquered the problem space at

(20:25):
honeymoon.

Fred Hebert (20:26):
One has their own mental model of how the system
works.
And a mental model is neverup-to-date, it's never perfect.
It's based on the experiencesand the mental model is how you
make predictions.
"I'm seeing this happen.
And by my understanding, thiscould be the cause of this could
be the sort of relationship withthe other components." The
cognitive burden is also thatcapacity of tracking that mental
model, how complex it needs tobe to make good predictions.

(20:48):
It doesn't need to necessarilybe very complex, but the moment
it becomes outdated, the signalsyou see are not interpreted to
mean what is going to happen inthe system.
And this is normal, right?
The idea that all of this andthis sort of drift is normal.
So for me, that all speaks tothe cognitive burden.
There it's.
Yeah, everything is too complexto understand there's many

(21:09):
things going on, people aretired.
What were they seeing?
How were they interpreting that?
And rather than seeing, like howcould we have prevented this
incident from happening?
The question becomes, how do wechange the conditions?
So that next time something assurprising happens we either
come with a differentpreparation or the signals are

(21:29):
made more legible by peopleoperating the system.

Liz Fong-Jones (21:33):
I love what Fred said there about the difference
between the sophistication ofthe model versus the freshness
of the model.
And I think that that issomething that we talk about all
the time, when we think aboutHoneycomb's product philosophy
and design, which is thatHoneycomb shouldn't get in your
way.
Honeycomb shouldn't makedecisions for you.
You need to be in the driver'sseat, you need to be developing

(21:55):
that mental model yourself orelse.
If we take agency away from youas an operator, then you are
going to do a worse job ofoperating the system over time,
because you don't have exposureto the context and the signals
to know when your mental modelis out of date.

Courtney Nash (22:10):
this conversation brings to mind two pieces of
scholarly work that we don'thave a lot of time to get into,
but I'll drop some resources forlisteners, in the list of
resources for this podcast.
One of which is Laura Maguire,she's a researcher at Jeli now.
her work was on managing thehidden costs of coordination,
which is what you were sort oftalking about, fred.
It's not just the individualcognitive load and cognitive

(22:31):
costs, but when the surface areaof your systems becomes more
complex, you need more peopleinvolved in incidents and you
need all those people and theirmental models.
and you have to now deal withall of that.
So it's this whole other systemon top of the system.
And then the other piece is,Richard Cook, who is a
researcher who's spent a lot oftime on complexity and safety
and all these kinds of systems,who talks about above the line

(22:52):
and below the line thinkingI'll, I'll drop some resources
related to that for anyone whowants to dig into that a little
bit more, cause that speaks tothe notion of what we think the
system looks like.
how humans, step in and fillgaps and make things work when
things go wrong.
there's one more thing I reallykind wanted to get to, about
this, which well there's two,one was you kind of alluded to

(23:13):
this to Fred.
He said, oh, well, here's somedetails that our, our audience
doesn't need.
and what I thought was reallyinteresting about the sort of
meta writeup you did, there'stwo versions of it, the way, for
folks, um, you spelunk the theone that I'll link to, there's a
link to a much longer version,has a lot more of sort of the
engineering details and thatengineering background, but
that's like two degrees removedfrom what might be known

(23:34):
internally, obviously athoneycomb.
I would love to get yourperspective on how you write
these, who you're writing themfor.
know, and I think we all knowthat what what's available in
public write-ups is very, it'snot the whole story, obviously.
think yours is much more of thestory than almost anyone else
ever gives us, but I I'd love toget a little bit of your take on
how you all approach that.

Fred Hebert (23:54):
Yeah.
Initially this was a joinedreport of multiple incidents.
And the reason for that is firstone, economical, because we
could have done like four orfive incident reports of near
misses and everything like that.
But everyone felt that it wouldbe taking a lot of time.
And so.
The task I took on as sitereliability engineer was to make

(24:16):
this one overview of what werethe decisions that were made
throughout the project, thatkind of things that were
happening figure out what thesurprises were that we had and
make a sort of inventory of thelessons learned.
And so in that case, being a bitof a retrospective, it's a bit
like being a project historian.
So I go and dig into the chatlogs, the older documents, try
to rebuild that context and seewhat happened and build it that

(24:41):
way.
And in that way, the internalreport is a lot about the
internal audience.
We felt that the deploymentsystem worked this way.
Here's how it works instead, andhere are the fixes that we did
about this one and this sort oftypology of surprises that we
might see when we have thingsabout the incident itself that
you see in the report, all ofthese were already in the

(25:01):
private one and picking theaudience then is it's a question
of"What are the things theymight be able to learn from
that?" I tend to go very, verymuch in depth.
I think right now, the report,the full one that was public is
13 pages.
The internal one, it wassomething like 26 or 27 pages.
And for me, that's kind of mypersonal challenge I write and I

(25:22):
talk a lot.
And so it needs to be taken downand trimmed down because there's
this balance between how muchinformation do you want to put
in there and how much attentiondo people have to give to it, to
get the stuff that's reallyimportant out of it.

Liz Fong-Jones (25:36):
The question basically is how much of this is
going to be similar to whatsomeone else might see, right?
Our build pipeline is bespoke tous, deployment tooling is
bespoke to us.
It doesn't make sense that, youknow, to talk about the details
in which our mental model isbuilt up because no one else has
a mental model of that.
Whereas everyone in the worldwho operates a streaming data
solution understands that theyhave a Kafka that they need to

(25:58):
hear about how the Kafka works.

Fred Hebert (25:59):
And the interesting exercise here is that the full
public report is 13 pages.
But then I think there's theblog post, which is not even a
third of that.
It's about under 2000 words.
and so this one is even moreboiled down, which is,"what is
the most interesting thing aboutyour state of incidents that
people who have like 15 minutes,aren't going to get out of it.
It's an interesting exercisebecause it forces you to kind of

(26:22):
figure out, okay, what's reallythe core thing I would like
someone to remember from this.
In our case with that one, itwas just kind of idea of
shifting the dynamic envelopeand performance that lets you
predict how a thing behaves.
For the full report, there's theinteresting stuff about the
bugs.
We've seen the issues we hadwith some of the Confluence
stuff with some of theprocessors that we have, the EBS

(26:43):
drive issues that we encounterin internal one.
There's this focus or this kindof approach of here's, how we
can build or improve our owntooling.
it really depends on theaudience, right?
Not all the same facts arerelevant to the same people,
depending on where they'reinterpreting from.

Courtney Nash (26:58):
The last thing I wanted to discuss is this
culture of sharing.
I know you've mentioned this inthe context of your customers
and wanting your customers tounderstand what happens when you
have incidents, but maybe youcould talk a little bit Fred,
about your perspective on thethe importance of these kinds of
reports for software andtechnology industry as a whole,

(27:19):
I'd like to know, shouldeverybody do this?
What do you think?

Fred Hebert (27:23):
I think more of everybody should do this.
For honeycomb, we have thisinteresting fact where it kind
of can line up with some of ourtechnical marketing stuff.
So it's easier for us to do thena lot of places.
But I think there's a lot ofvalue there.
The tech industry in my mind isreally keen on commoditization
and externalization, whether itis of components or of

(27:46):
expertise.
So everyone uses theseframeworks, assuming that people
get the knowledge on how tooperate them in the wild, in
their free time, in previousemployers, and then just goes
around and brings that to thetable when they do.
This is deeply entrenched in thetech industry.
And so for me, part of it isthat the way the tech industry
workers have gotten around thatis to have these parallel

(28:09):
systems where they do share theknowledge, the type of
conferences that we have to blogposts and whatnot.
And so Having these reports forme is that sort of idea that
We're benefiting from thatcommoditization and it should be
normal to give back and sharesome of that knowledge back to
everybody else.
Because, you know, we, haven'thad to write Kafka.

(28:29):
we're using Amazon for a lot ofcomponents, and we're making
drastic savings on a lot of opensource projects that people are
usually working in their freetime for.
It's only fairness to return theknowledge to other people as
well.

Courtney Nash (28:43):
So there's the fairness aspect, which I think
is incredibly important.
and a rising tide lifts, allboats.
Software is running so much ofour world now.
And some of that is incrediblysafety- critical: healthcare
systems, financial systems,voting systems, um, God help us
all.
much like Other industries, Ithink notably the airline

(29:05):
industry.
They took on this mantra ofsharing this information of
being transparent and not justtransparent for transparency
sake, but because it was goingto increase the safety profile
of everyone by doing so

Liz Fong-Jones (29:21):
It definitely though was a place where they
had to offer certainprotections, right?
Like when you report a aviationsafety incident to the NASA
system, you are protected fromaction from the FAA.
Right.
I think that that's a huge thingin getting people to self-report
and that's kind of wherehoneycomb can go first, because
it's a competitive advantage tous, to self-report and where we

(29:41):
do have a culture whereengineers speak up freely.
Even if we're going to publishdetails afterwards, because they
feel safe from retaliation, theyfeel safe from blame, and that's
not necessarily the caseeverywhere.
And that's something that we, asan industry, we're going to have
to work on.

Fred Hebert (29:57):
There was recently an article about a big company
having an incident.
And the entire thing is blamingthe one worker for not
respecting procedure.
And there's this superinteresting paper that I know
you can give a reference to inthe show notes.
This is"Those found responsiblehave been sacked; some
observations or usefulness oferror" by Richard Cook.
it has this mentioned in therethat as an organizational

(30:19):
defense, the idea of error, andhuman error specifically, is a
kind of lightning rod thatdirects all the harmful stuff
away from the organizationalstructure and into an
individual.
And so the organization is ableto sort of ignore all the
changes it would have to do interms of operating pressures and
just say oh, this was a one-offand next time we're going to
respect the procedure harder.

(30:41):
having that ability for us toalso put reports out there in
the wild, that act as goodexamples of what we think is a
humane, respectful, and helpfulreport, can help counteract
these I would say directly badones that then smaller companies
or people might emulate justbecause this is what the big
companies do.
So there's this importance ofputting it out there and giving

(31:02):
a positive example in

Courtney Nash (31:04):
doing that.
I think you've done exactlythat.
It's a long road to getcompanies, especially much
larger ones and in highlyregulated and all those kinds of
environments to approach thiskind of a direction.
I want to thank you all forhaving that approach and,
sharing it with us and forjoining me today.
Thank you both so much.

Fred Hebert (31:21):
Yeah.
See you on the next incidentbecause they're going to keep
happening.
Advertise With Us

Popular Podcasts

Stuff You Should Know
My Favorite Murder with Karen Kilgariff and Georgia Hardstark

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

My Favorite Murder is a true crime comedy podcast hosted by Karen Kilgariff and Georgia Hardstark. Each week, Karen and Georgia share compelling true crimes and hometown stories from friends and listeners. Since MFM launched in January of 2016, Karen and Georgia have shared their lifelong interest in true crime and have covered stories of infamous serial killers like the Night Stalker, mysterious cold cases, captivating cults, incredible survivor stories and important events from history like the Tulsa race massacre of 1921. My Favorite Murder is part of the Exactly Right podcast network that provides a platform for bold, creative voices to bring to life provocative, entertaining and relatable stories for audiences everywhere. The Exactly Right roster of podcasts covers a variety of topics including historic true crime, comedic interviews and news, science, pop culture and more. Podcasts on the network include Buried Bones with Kate Winkler Dawson and Paul Holes, That's Messed Up: An SVU Podcast, This Podcast Will Kill You, Bananas and more.

The Joe Rogan Experience

The Joe Rogan Experience

The official podcast of comedian Joe Rogan.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.