Mastering SRE: Insights in Scale and at Capacity with Aimee Knight

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:07):
So, Warren, what's been your experience with SRI.

Speaker 2 (00:10):
That's quite the way of just jumping into the episode,
and I was so unprepared because I'm not even the
guest for today's episode, So I don't I don't have
an answer to that question.

Speaker 1 (00:21):
Honestly, Yeah, for sure, Jillian, And what about your thoughts
on SRA. You're just like, you're a contract not just
a contractor. You're a contractor. You just light stuff on
fire and then if they want to put out you
bill them for it, right.

Speaker 3 (00:33):
I mean really like, I just need to make sure
data doesn't get deleted because that's like a problem that
is that is a problem, Okay, And that's really all
that I worry about.

Speaker 1 (00:42):
It's a security reliability because you can't be exploited if
there's no data left.

Speaker 3 (00:47):
I mean, that's true.

Speaker 2 (00:49):
I will say that there is. There is a book
out there published by one of the hyperscalers. I believe
that attempts to define it in their own terms, but
I feel like, like a lot of things they do,
it's used as a introduction into the topic rather than
used as experts. So I'm really interested to hear what

(01:10):
our guests today will be sharing with us.

Speaker 1 (01:14):
Well, yeah, which is a great segue because clearly we
need some expert opinions on SRE, and so we've got
Amy Night with us. Amy, Welcome to the show.

Speaker 4 (01:26):
Happy to be here.

Speaker 1 (01:27):
I'm excited to have you here. So tell us a
little bit about your background and how you got to SRE.

Speaker 5 (01:36):
Yeah.

Speaker 6 (01:37):
I love SRI.

Speaker 4 (01:40):
It's a little bit of a love hate relationship. It
can be very stressful at times, but I also love it.
I feel like it is kind of when I when
I and this, I'll get into kind of like my
path into SRE, and it just it really suits. I
feel like my engineering brain perfectly and the high level

(02:00):
there is. I come from non traditional background, so self taught,
did the whole boot camp thing twelve years ago now,
which is insane to me.

Speaker 5 (02:11):
I'm really old, or feel really old.

Speaker 4 (02:16):
Like I saw someone on Twitter the other day talk
about like, what did you use before GET and I
was like, oh, my gosh, I'm really old now because
my first job we use material, which I guess is
it but at the same time like it.

Speaker 6 (02:29):
Yeah.

Speaker 4 (02:30):
So But anyways, so backing up, and I used to
do JavaScript jobber, so I kind of started off in
the JavaScript Land, did node.

Speaker 6 (02:43):
And all the front end.

Speaker 4 (02:47):
I kind of focused heavily on front end for a
little while, did note as well, and then I think
it was like twenty eighteen twenty nineteen, I went to
work at NPM, which was like my dream job at
the time until they got acquired, and you know, it's
just a little bit of a bummer because it was
very short lived. But when I was there, we did

(03:09):
cross functional teams and my manager at the time was like,
you seem really interested in like this platform s re
work infrastructure. We can probably hire for a web developer
more easily than we can hire for that role. Do
you want to go over to that team? And I
was like, yes, sign me out, like you want to,
you know, if I have the opportunity to learn a

(03:32):
new skill on the job, Like yes. So I did
SRE at MPM for a while, and then I did
most of my sr rework at Paramount Global, which is
like the streaming application, and did two Super Bowls there well,

(03:52):
the last last year's Super Bowl. And so that's kind
of my journey into SRE And why I say it
is like I feel like SRI is really like my
my perfect niche in engineering and that is because coming
from this non traditional background, I've always been the like
the programmer who is like, sure that new tech is

(04:15):
really fun and interesting, but like, shouldn't that be something
we're working on, like on the weekends on our own time,
Like isn't our job? Like I hate saying this because
people hate me, but at the end of the day,
like I'm just like I feel very fortunate to have
a career in tech, and in the back of my mind,
I'm always like my job, whether I like it or not,

(04:36):
Like if I'm working hours, my job is to make
this company profitable. And while I love programming for fun
on my own, like, I have a responsibility to the
business to make make the business profitable, keep it alive.

Speaker 6 (04:50):
Work doing a lot of startups, so I.

Speaker 4 (04:52):
Understand this, and that's where SRI really fits in because
it's like a lot of roles. Yes, you're accountable to
the business, but I feel like in an SR role
through we can get into like slas, slis SLOs, but
the slas like you are accountable to the business stakeholders
and that is your overarching goal. So basically, in a nutshell,

(05:15):
SA site reliability engineering. I also call it like production
readiness engineering. It is your task to keep the application
running optimally for as as much up time as possible.

Speaker 2 (05:31):
So what I heard was that you at MPM had
so much work and we're doing such a great job
that they're like, here's more work. Be responsible for the
success of the business and not just solving tickets and
pushing out futures.

Speaker 4 (05:47):
No, but honestly, truly, like my manager there, he was
absolutely amazing, as were a lot of the people that
I worked with, absolutely brilliant, talented people. And I don't know,
I'm like distributed systems nerd. Like on the weekends, I
like to watch YouTube videos of distributed systems interviews, Like

(06:08):
I don't know why, this is very interesting to me,
but I've always like just gravitated towards that sort of thing.

Speaker 2 (06:16):
So you don't have to shame yourself for that. I
think lots of us on the weekend, Yeah, we're watching
interviews and other tech things. Yeah, for sure.

Speaker 4 (06:25):
Whenever there's like a big outage of like a big
tech company, I'm always like reading the post mortems and
trying to talk to the like talk about them with
my husband, and he was he's just like he tries
to act interested, but he's just good for He's like,
I love how nerdy you are, Like, okay, I'll just
talk to myself that.

Speaker 1 (06:45):
You've got a really I think that's a super interesting
point where you talked about you were working for working
out on JavaScript development and then you went over to SRE,
And I think that's really important skill to have for
that because, like as the SRE, like it's not just
about the infrastructure or the data dog metrics or whatever,

(07:07):
like a lot of times you've got to dig into
the code and discover like what's actually going on here?
And so how how do you feel like those web
development skills set you up for success as an SRE.

Speaker 4 (07:23):
Yeah, that's a really good question, and I it's a
common question I get from people who are also interested
in SRI because and I also feel like, you know,
backing up a little bit SRE. I know, Warren mentioned
like the Google sr book, but I also feel like
this focus on SRI really came about via micro services buzzword.

(07:46):
I hate saying them, but.

Speaker 1 (07:48):
It is what it is.

Speaker 4 (07:51):
But so to kind of answer your question there, I
feel like the teams that I've been on not necessarily MPM,
but paramount specifically, they had a lot of like systems engineers.
Like the team was very system engineer heavy, so a
lot of these people who like worked in data centers

(08:12):
and those are all obviously very valuable skills as an SR,
but where they were really lacking was like the programming
side of things application development side of things. And I
feel like to have a really good SRA team, you
do need a mixture of both and you can really
like learn from each other. So in the SR, like

(08:32):
in SR roles that I've been in, I am doing
some programming nuts. I mean, you need to be able
to debug the program. You're also doing some programming in
you're doing like scripting. A lot of SR, I don't
think people realize is a lot of like cashing at
the edge and so you're doing a lot of custom

(08:53):
logic that gets deployed to your CDNs and things like that.
So it might not be as intensive as like building
out a future and stuff like that, you still need
those programming concepts to write that kind of code and
then also to work with the developers because you know,
for better or worse, there is like that's like the

(09:14):
that's the telltale like thing that people say about SRI
is like that push and pull between the engineering department
because they have a vested interest in getting their stuff
out to production and then SRI has a vested interest
in keeping the system healthy and functioning.

Speaker 2 (09:28):
Yeah for sure. I mean you did mention that it
could be different in different places. So like from your experience,
what has being an SRI really met? Like is there
a standard day? And I know that's a terrible question
for asking you know, any sort of engineering technical person,
but you know some some expectations around what the role
is that you were doing both at MPM and Paramount.

Speaker 6 (09:48):
Yeah.

Speaker 4 (09:49):
So MPM obviously like a very it was a relatively
small team. We just we had the SR team and
there was a lot of like I said, it was
cross functional, so we're with building out features, but a
team was like a cross functional team where we had
a platform engineer, we had SR infrastructure engineer, we had
a application developer, and all working together to get something

(10:13):
out to production. And like the SR side of things,
there would be like I was saying, I talk about
SR in terms of like production readiness, so making sure
that this feature is ready for production, if there are
like dashboards that need to be made, making sure that
those are in place so that we can properly monitor
when it goes to production, making sure that you have

(10:38):
the proper resources for the machines that it's going to
be running on, things like that. At Paramount, you know,
this is a much larger MPM, has a very large scale,
but it was extremely bursty. And when I say like
extremely bursty, like the traffic would spike, but it would

(11:00):
be very short lived. At Paramount, similar but not quite
to that extent like birsty, as in like you're going
to see a high rush of traffic when a large
event kicks off, It'll stay relatively high, but it kind
of like starts to taper off as the events go
down and things like that. But Paramount SRA like a

(11:24):
larger SR teams, it's usually segmented into different departments. So
you have like an infrastructure focused department, you would have
like a devsec ops because there's a lot of security
that goes into SRE, and you would.

Speaker 6 (11:39):
Have like an observability monitoring team.

Speaker 4 (11:41):
I feel like observability and monitoring is where you hear
a lot of like focus on SRE for sure, but
it is definitely like segmented into different teams, and it
is very different at different places, so.

Speaker 2 (11:53):
You spend all your time looking at graphs.

Speaker 4 (11:54):
That's what I when I was more on the infrastructure
side of things. But yes, like we were heavily with
like observability team to make sure that the dashboards are
in place and they have the data that you need
and everything sod that properly.

Speaker 2 (12:09):
That sounds like the go between, like when get the
application metrics into some place that different team owns, and
then when that team exposes a problem, you're then have
to ferry that back to the application team to tell
them about the issue that they caused in production.

Speaker 5 (12:25):
Yeah, and that's.

Speaker 4 (12:26):
Where like what Forrest was saying, it's complicated what.

Speaker 3 (12:32):
So? Uh?

Speaker 2 (12:33):
Will and I had a little joke going on before
the stream started, we forgot to end the joke, and
the joke the joke basically ended where I decided to
change his moniker in the in the stream that uh
that that amy only only you can stream only you
can see no one else, like none of the none,
none of the audience will be able to see that. Will.

Speaker 5 (12:54):
It's Will, I'm so sorry.

Speaker 1 (12:56):
No, it's hilarious. It's great because well played Warren, well played.

Speaker 6 (13:04):
You look like a forest.

Speaker 1 (13:05):
But well, that is how the joke started, is because
like Warren is asking me something and I was like, dude,
I don't know. I just feel like Forrest Gump because
I just wander around places in my life and great
shit happens. And then people were like, how did you
do that? And like I don't know. I was just
standing there.

Speaker 4 (13:21):
Okay, Will, Will, I'm so sorry. I feel you got
me good, thank you. And now I've lost my train
of thought.

Speaker 6 (13:30):
A little bit.

Speaker 4 (13:30):
But too, I think what I was about to say
is like there's a fine line because obviously it doesn't
make sense for the SRI to be the person that
debugs the code, but in a perfect scenario, ultimately, like
the sort of thing that probably should happen is SRI is.

(13:51):
Usually sometimes you have like a knock team who's kind
of like the first line of defense that does like
a smoke test, like, okay, we got an alert, we
got paged?

Speaker 6 (14:00):
Is the application? Can I reach it?

Speaker 4 (14:03):
Like?

Speaker 5 (14:03):
Can I just go to my browser and reach it?

Speaker 6 (14:06):
Like is it up?

Speaker 5 (14:07):
Is it hard down?

Speaker 6 (14:08):
Or is it is latency?

Speaker 4 (14:10):
You know, starting to go down that sort of thing,
And the knock team is usually the first line of
defense there. If the KNOP team deems it like okay,
something is awry here, then they'll usually.

Speaker 6 (14:22):
Page SRI team.

Speaker 4 (14:24):
So SRI team kind of usually you'll have like playbooks
in place to do like a sanity check of like
jump on one of the Kopernettes pods, you know, kind
of see what's going on? Are these pods airing something
like that. But ultimately what should happen is there should
in my opinion anyways like a close relationship with SRI
and the development team that it's clear like we have

(14:48):
an environment in this micro services system where we can
do a safe roll back, and so what typically would
happen at least in my opinion, my role like I
would deem like okay, and I have a rough understanding
of like what the feature is, what's the impact because
there could be a business impact of rolling back. Ultimately

(15:10):
we want to make that would when I say business impact,
like you have some sort of like feature for Black
Friday and it's peak traffic and this feature is like
driving revenue, so there's like a but it's also like
causing issues for customers, So you need to kind of
make the decision. And this is where the stress gets
high of like what's the financial trade off ultimately here

(15:34):
like to this business person, like is it honestly like
it will bubble all the way up like through the
organization to like the CEO of the company, and you
need to make the decision of like what's the what's
the lesser of the financial impacts here? But long winded
answer like sorre should have confidence from the engineering team
that it's safe to roll back and do like an

(15:55):
instant roll back. And it's also then helpful as an
sary if like, let's say, a lot of times you
will get paged in the middle of the night, and
it's a large organization and sometimes sometimes you have great
engineers on your team who you can escalate too, and
they usually have like a paging rotation also, but a

(16:19):
lot of times I'm not gonna lie, I was on
an SR team and it's the middle of the night
and I page, you know, I'm looking at this pod
that's crashing and I can see that it's like actually
like an error in the code, and I page the
engineering team and.

Speaker 6 (16:37):
What it crickets.

Speaker 3 (16:40):
Just like a group project, Like it's the same Actually
I have page again in twenty minutes and crickets.

Speaker 4 (16:46):
So I've been in a state where like I am
just rolling pods all night. When I say rolling pods,
like restarting pods over and over and over again until
someone wakes up in the morning.

Speaker 2 (16:58):
Wait, so you're when you say knock like network operations center?

Speaker 1 (17:01):
Is that?

Speaker 2 (17:02):
Is that mostly to manage the physical hardware.

Speaker 4 (17:07):
It's hard for me to say I didn't work. They
were kind of like our first line of defense at
some of my jobs, and it works super closely with them.

Speaker 6 (17:15):
But yeah, those are.

Speaker 4 (17:16):
Typically the people that were like the hard like they
were working in the data center, and you know, the
organization obviously sees like value and kind of transitioning them
to different roles and things like that.

Speaker 2 (17:27):
So yeah, so you then have to look at a
lot of application specific metrics, so more on the software
validation side, and then be that that ferry that bridge
back from Okay, there's a problem at the hardware level
that's getting escalated and having to find the application team
and of course they have no on call rotation ever,
because they're not accountable for keeping the systems up and

(17:49):
running right.

Speaker 1 (17:50):
They're just to shift code.

Speaker 4 (17:53):
Yeah, just ship it and hand it over you know,
but that's where you know close relationship also, and like
I say saying like I keep going back to this
like production readiness in it should be that before the
application team hands over something like if saying like, work
with the s RE team in advance to make sure

(18:13):
that if it's a mission critical service in your micro
service architecture, like that the dashboards are in place, you
understand the dependencies like a when I say dependencies too,
like a lot of things that come up in SRE.
And that's why I like this is just I say,
nerd out on distributed system stuff. You can see one

(18:37):
application starting to show some latency, but the actual impact
is more so like a service downstream, and there are
different ways of like understanding obviously, like if this part
of the system is having this percentage of latency, then

(19:00):
I need to understand that these downstream services are getting
hit by this other service X number of times it's
going to start like having this kind of latency. So
you really have to understand like the entire system and
understand those dependencies. That's where it helps, Like that's where
s RE really comes in because when like these application

(19:21):
teams are kind of like siloed working on their specific service.
The SR team is more responsible for understanding like the
trace of a request through the system and understanding those
downstream effects.

Speaker 2 (19:34):
Well, I mean, you just got into some really deep
topics there, Like there's a huge complexity with like queuing
theory and window based you know, back offs there as
well as like the system effect like actual systems, thinking
of how one thing triggers another thing, which triggers another thing.
And I don't remember ever seeing a boot camp that
would have gone anyone prepared for handling this. And you

(19:57):
got through this, and now I have I understand you
were still like being on the spot to actually make
a business impactful decision for the organization.

Speaker 4 (20:06):
Yes, that's where it gets very stressful. And that's where
I just I feel like in this kind of role,
you really need someone who cares and yes, because otherwise
the company's revenue is on the line, whether that be
not even like I was saying an example of like

(20:26):
the Black Friday issue, where like you have to make
the call of is this shopping cart feature more important
than the fact that X number of other users can't
do this other thing that they need to do, versus
even understanding like the slas and is there like a

(20:47):
financial obligation that I need to pay my customers if
X number of things start to go down?

Speaker 2 (20:53):
So did you have like some sort of dashboard that
listed out every single feature and like how much it
was supposed to make for the company, and like how
much the shopping I mean, I feel like NPM and
paramoun are a little bit weird because it's not really
like end user shopping car scenarios. I mean, I don't
really understand paramounts, you know, end user pricing model. I
know there's like plus right, but other than that, So
I feel like it's it's even harder to really understand

(21:15):
what the long term impact is for each one of
those things that's breaking.

Speaker 4 (21:19):
You know, it's very different obviously at different places, Like
so I keep going back to the shopping cart example.
Like I worked at different startups in like the retail industry,
and although I wasn't on the SR team there specifically
as more in like an architectural role, so like kind
of tasked with understanding that I'm working with the demops team.
But yeah, like in a perfect world, you would have

(21:43):
dashboard self you make those decisions, but you don't have
that so you just you have to make the best
call based on your understanding and you know, escalate to
the right people if you can. And that's where I
should really get into, you know, is maybe like shift
the conversation too to like strategies besides just like reactive strategies,

(22:08):
because we've talked about like a lot of like reactive
approaches to things. But part of what SRI is tasked
with is like how do we make this system more
resilient by nature of things we can do?

Speaker 2 (22:21):
And I feel like you're in the perfect position to
think about and sort of ide eight on that. But
don't you end up with like a lot of pushback
because that's sort of like a shift left approach in
the organization with like oh, you know application product teams,
you know this thing that you're doing that has like
a high likelyhood that there's going to be a production impact,
maybe you should do something different. How do you get
that actually across those teams so they can like start

(22:43):
making those actual changes because it's like a mindset or
culture shift for them.

Speaker 4 (22:48):
Kind Of where I'm going with this is more so
on like the redundancy side of things, like SRI team
making the decision and it's you have to work with
like the engineering management leadership to understand and like, you know,
if we're running in like a Kupernettese environment making that
do we either do we do multi cloud? Do we

(23:09):
obviously we do multi region those sorts of decisions, And
then that's where I get into I don't think a
lot of people necessarily realize, like I was saying, how
valuable like content delivery is in this because just by
nature of having as much as you can cashed and

(23:30):
distributed outside of like before the request even gets to
whether it's like your physical data center or or a cloud,
how can we like offload some of that traffic? Because
I feel like a lot of these systems are just
the fallover without that. And I know we mentioned it
like at the beginning of the episode, and maybe I'll

(23:51):
like touch on it more at the end. But one
thing that's been very interesting for me to like sit
back and watch is like Jenny I is.

Speaker 6 (24:00):
Saying that word it is, it.

Speaker 5 (24:03):
Is what it is like twofold.

Speaker 4 (24:07):
First off, like my heart goes out to s on
their sres who work in environments where they have like
developers that are just what this whole vibe coding thing, Like,
I don't think people realize how much like these lms
are hallucinating and producing like content that is just garbage,

(24:28):
absolute garbage soup.

Speaker 1 (24:31):
Poor one out for the SR supporting vibe coding is
what you're saying.

Speaker 2 (24:35):
Yeah, we did have some previous guests on the show
who were like swearing to us that their lms didn't
ever hallucinate.

Speaker 4 (24:43):
I literally was like, I think I won't say the
LM because I'm going to get myself in trouble. But
I was like asking, like going back and forth with
one of them earlier this week, like writing stuff, and
I asked it to revise it, and it started like
putting all these like statistics in, and I was like, I.

Speaker 5 (25:01):
Was using a product that allowed me to.

Speaker 4 (25:07):
Incorporate a lot of basically like a rag guy, so
I could incorporate like a bunch of external sources. And
I was like, there's no way those statistics are in
those sources, like because I know the data that I added.
And I asked it, like where did you get these statistics?
And it's like, you're right, I made them up. I'm sorry.

Speaker 2 (25:29):
At least you've got a conclusive answer there, because We're
seeing this a lot more with seoed articles today, where
not only are they being generated by llms, but the
links that they go to do not back up whatever
is in the article, and like so even if a
human is writing it, it's just some fluff piece. And
if you go there, it's like, oh, sixty percent of
attacks are from this particular source, like email is the

(25:51):
source of all attacks and security. And you go and
you click on the link and it's just some completely
made up garbage that's also not researched by some other
company that nothing to do with the topic whatsoever. So
can you blame the probabilistic you know, statistical engine that's
just predicting the next word that is also going to
pull out links that have nothing to do with the
topic whatsoever.

Speaker 6 (26:11):
Yes, yeah, you.

Speaker 5 (26:12):
Can't blame it, but it's like it's on us to.

Speaker 4 (26:16):
You know, vet it. But what I wanted to touch on that.

Speaker 5 (26:19):
Very briefly was just basically, like I read.

Speaker 4 (26:21):
Something this morning, like the percentage like there's a direct
correlation between these companies that are allowing like more AI
assistive like developer stuff and outages. So my heart, there's
I'd have to like I think I screenshotted it on LinkedIn,

(26:42):
but there was like it was a lot, and it's
just in the back of my mind.

Speaker 6 (26:46):
So there's that side of things.

Speaker 4 (26:47):
And then there's also just like these like you know,
open Ai and all these companies like pushing out it's
like a race to get stuff out to production faster,
and there's like that. I think last year just the
number of outages, like every time I would go to
chat tipt I think like almost like a week, I
was like half the time is down.

Speaker 2 (27:08):
I think the jury is really out on the numbers.
I think we've concluded two things, and whether or not
people are willing to listen to those is a separate topic.
You are trading quality or speed. That is, that is
the trade off, And so you know, I think there
are a bunch of there are a bunch of good
articles recently by people who work at some of these

(27:29):
You say you don't like the word jen Ai. I
really hate the word agentic. That's that's mine companies that
they themselves are not using their own LM to produce
the code, right, And so you know, I see these
CEOs say oh yeah and stand up and be like, oh,
our company like twenty five percent of all the source
code we write is by you know, some sort of

(27:49):
LM And I'm like, what do you what are you
going to do about that risk? You know you're gonna
decrease that in the future.

Speaker 4 (27:56):
Yeah, I should say, like I'm not opposed to these tools,
but they need to be used responsibly and for better
or worse, like humans, Like it's our nature. Like that's
what like the famous like programming quote, like we're lazy,
Like a lazy person makes a good programmer because they
want to automate things. So like there's a lot of
value in these tools, but you hands down need to

(28:19):
use them responsibly and be able to use them, like
with actual knowledge behind it. Like I mentor a lot
of developers and like you know, they're just they're wanting
to get so done. And I'll see like this code
that they wrote, and I'm like that that I know
where you gotta from.

Speaker 6 (28:38):
That's not right.

Speaker 4 (28:41):
I don't know why it's telling you to like install
Python packages in.

Speaker 6 (28:47):
Your own application, but this is not right.

Speaker 4 (28:52):
Yeah anyways, Yeah, but.

Speaker 1 (28:54):
On a side note, I'm really admired that you got
those Python packages to work inside of no js so.

Speaker 2 (29:01):
Well. It just comes to the foreign function interface and
you know, you just shell out to that, no, no
big deal, and your application, you know, just multiple programming language.
You know, what's what's the big deal? It's all it's
all simply under the.

Speaker 6 (29:11):
Hood, exactly. It's all one since but doesn't matter.

Speaker 2 (29:15):
Right for now, for now.

Speaker 1 (29:20):
So I want to come back to talking about playbooks,
because you mentioned those and I got excited. But I
think like an important thing to hang the playbook conversation
on is at talking about being an sr AT scale,
because like being an sr AT a small company versus
being an sr AT an enterprise, I think are quite

(29:44):
possibly two completely different career fields. And and so how
did you address like the scale of of applications you
were dealing with.

Speaker 4 (29:56):
That's a good question too. I think it really so
like talking about this like micro service system where a
lot of people usually are at right now, typically everything
is like a trade off, and that's like a handwaybe answer,
but it's true. And so like when we go back
to like the production readiness stuff, I think it's like

(30:18):
helpful to understand like a like a process of like
kind of rating these micro services. Like it's just like
tier one tier two, tier three, Like like an a
system is probably going to be like a tier one system,
whereas if you have like a recommendations engine that's running

(30:39):
things in the background, that's more of like a tier
three system because critical functionality still works. Check out is
going to be obviously a tier one system something like that.
But there's a lot of things to consider, Like I
feel like, especially like for people that are getting into

(30:59):
this kind of role, it's important to understand like the
traffic patterns, Like if you're working in like a small startup,
you're accountable to your customers, like maybe just during like
business hours, like a banking application or something like that
is not going to be as critical as an application

(31:20):
that's used internationally. Things like that, those are like does
that kind of answer your question?

Speaker 6 (31:29):
Like things to consider, Yeah, for sure.

Speaker 1 (31:33):
And then tying it back to playbooks, like there's a
certain level of scale where you just can't be familiar
with all of these systems and you have to rely
on playbooks. So what are some of the strategies that
you've seen to making sure that you have an accurate,
up to date playbook when you need it.

Speaker 6 (31:53):
Yeah, up to date.

Speaker 2 (31:56):
Everything is legacy as soon as you write it. Like
as soon as you write something down, it is out
of date. That code is already wrong as soon as
it's released.

Speaker 4 (32:04):
That Yeah, So that's I'm going to go back to.
Like building out an S OR team is you need
technical skills obviously, but you really need people who take
ownership and care. Like we're talking about the playbooks. Like
I've definitely been at companies where I will get I'm

(32:27):
brand new.

Speaker 6 (32:28):
I'll get Paige.

Speaker 4 (32:29):
I go to the playbook and I see that like
the playbooks can get it hasn't been updated in five years,
and I'm like, you know, shoot, crap.

Speaker 6 (32:41):
Like what do I do? Like I'm getting Paige.

Speaker 4 (32:44):
Usually, like when you're Paige, it will point you to
the playbook and it's not there. So, you know, the
best way that I've seen people tackle this is like ownership.
Someone needs to own these playbooks. Somebody if a if
A system is like a Tier one or a Tier
two system, like this is part of this, like production

(33:06):
readiness checklist. Like before before the SRI is okay with
deploying this and like a lot of times, like company
like the the SR team is they're not really the
gatekeeper because a lot of these companies, like the developers,
have the ability to ship themselves, but there are obviously
consequences if they ship something and it crashes and and

(33:30):
that doesn't completely come that that is like a shared ownership.
But having there needs to be ownership of these playbooks,
and that's usually done by the s RE team. Typically,
you know, you will get the leadership knows when something
is very critical, and good leadership will make sure that

(33:53):
these teams are communicating on like a regular cadence. I'm
big on like these regular cadences, not necessarily because meetings
kind of suck, but they are important for communication.

Speaker 2 (34:06):
I like that you put the responsibility of the playbooks
on the s RE team. There's actually a debate that
my CEO and myself have been talking about a bunch
where if it's an artifact that gets thrown over the wall,
then it's not going like you can't it doesn't contain
knowledge in a way it contains They're usually actionable things,
and so since the knowledge is there, you won't know

(34:29):
how to adapt to a playbook when you get to
a new scenario that's not encountered before. Right, And most
of the problems that we find in systems that we
automate are only problems that we haven't seen before.

Speaker 4 (34:42):
And I will say too so when I say like
having accountability and ownership, like that's where for better or worse,
everybody hates on call schedules, but those on call schedules
do serve as kind of like a backpropagation of like
if somebody is on if you're on an on call
rotation and you're getting paid frequently, like you have a

(35:04):
vested interest in stopping that because if it gets out
of hand, it's just a huge snowball of you know, obviously,
like you you're starting to like your sleep is interrupted.
So and for better or worse, like people will be like, oh,
you know, you had a rough night, take it easy
during the day.

Speaker 6 (35:22):
Yeah right, that's not.

Speaker 4 (35:26):
You're still getting up and doing your usual hours, but
you're just like I'm dead. So there's a vested interest
in in like sharing that ownership of the on call rotation.
And if you're getting paid constantly, like you're going to
want to make sure that your teammates have the playbooks
in place. And you're like, I just had a huge
outage and these are the steps that I took to

(35:47):
remediate this these are the things that were helpful. So
after my on call rotation, I'm going to go in
and update the playbook and make sure that the steps
that I took are there so that maybe there's like
more junior members on the team they also know and
things like that, because usually too and like these kind
of environments that you'll have like multiple tiers there, so
there's a knock. But then even within s RE, like

(36:08):
I think it's helpful to have like multiple tiers of
SRY support, so you have like more senior level sries
and then more junior level esteres and like maybe maybe
during the day you would page more junior SRY or
something like that, whereas a senior is around to step
in right away, but you want to have stuff in
place for it makes sense.

Speaker 2 (36:27):
What I'm hearing is you're a huge proponent of these
incident management tools that are generating playbooks dynamically on the
fly using some sort of lum please a lot of so.

Speaker 6 (36:38):
Much reviewing them.

Speaker 4 (36:39):
That's fine by me, but like, yeah, these lms are helpful,
but they are they are not the end all, Like
that's not the end.

Speaker 5 (36:47):
Of the road.

Speaker 1 (36:48):
You just gave him my next startup idea, Warren, I
mean you have to compete with some of our past guests,
I think, who are already already promising h dynamic run
books based off I mean, there's a thing I think
Sentry does, and I think there's a couple other tools
now that they have your source code or at least
you know, and then they have the stack trays from
the error and the actual error message, and they use

(37:09):
that to automatically propose a pull request into your get
repository that supposedly fixes the problem so that someone can
review and improve it.

Speaker 2 (37:16):
And you've got to know that at three am, you
are just going to click improve.

Speaker 6 (37:23):
Yeah, yeah, that's yeah.

Speaker 4 (37:25):
All those tools, Like I said, I think they're helpful,
but they need to be used responsibly and with somebody
who is actually like a human being with background that
says yeah your name.

Speaker 1 (37:34):
Yeah. I think we're still discovering, like what the rules
of responsibility are. Like I feel like we're learning quickly,
but we're still learning, you know, some of those being
like you have to give it very discrete tasks, you know,
ask it to do one thing so that the result
that comes back to you is something you can actually
parse in your head. And you know, don't give it

(37:57):
something that you don't know how to do, because then
you have no way to check its work. Or another
one that I use that I really like is treat
it like your own personal intern or junior developer. You know,
you wouldn't like hire an intern and then say, you know, hey,
go architect this four tiered micro service for us. It'll

(38:20):
be due by Friday.

Speaker 2 (38:22):
Oops. I mean you're onto something there, will I will
say that maybe it's just forcing us to finally think
about concretely how human should actually work together effectively. And
I you know, in the last you know what is
the year twenty twenty five, we haven't figured out how
to do that correctly, and now we're throwing in a
tool into the next which actually requires us to do

(38:44):
that work. I'm going to guess it's going to be
like another two hundred years before we got we got that,
you know, actually figured out how to optimally communicate one
human being.

Speaker 3 (38:54):
I'm telling you, I have a teenager, and like I
would just like to point out that we have been
peopling for millions of years and we still do not
have a good way to make the teenagers not really
really dumb Okay, like just it's never gonna happen. It's
never gonna happen.

Speaker 2 (39:11):
Millions is a bit long. I'll give you like four
hundred thousand or so.

Speaker 3 (39:16):
But I.

Speaker 2 (39:18):
Do definitely accept that argument that you know, but you know,
acceleration of evolution, so maybe maybe we'll get there faster
this time.

Speaker 1 (39:26):
Right, Moore's law for humans, that's a thing, right, Yeah, Yeah, I.

Speaker 2 (39:32):
Really appreciate Moore's law. What I don't appreciate is people
saying that Moore's law applies to things that it doesn't
apply to. I mean, I totally agree that the principle
by what is it correlation doesn't necessarily imply causation. Although
you know, you look at lots of things and they
do double at you know, some rate. That's but people

(39:52):
get more law wrong a lot. They apply to things
that don't make sense. It is literally just the size
of the transistors will reduce by half every eighteen months.
That's that's it. That's the whole law. It doesn't apply
like statistically to anything else. Now we do see doubling
and size reduction and other things definitely, like dies on
a chip and et cetera. But you know, there's like
an interesting thing. Does it apply to these new plastic

(40:14):
chips that I was China producing now that don't use
silicon I don't remember the material and that are supposedly
like half or a tenth of the size, you know,
does Moore's law apply there too?

Speaker 1 (40:26):
I have no idea.

Speaker 2 (40:26):
I was just trying to make a joke, okay for us.

Speaker 3 (40:32):
I wanted to backtrack just like a little bit and
point out Amy that I really like your kind of
go gettedness, I guess with understanding actually what the business needs.
Because I work with a lot of people and especially students,
and they just have no idea, like no idea like
what it is, like what is supposed to be the
end result of this kind of like process that they're doing.

(40:53):
And I think it's very important. I mean it's important
from like work, but I think it's also very important
from a career perspective as well, Like would you like
to maybe get new jobs? Like you should? You should
understand what's going on at least a little bit.

Speaker 5 (41:06):
Yeah, Like I said, I don't know.

Speaker 4 (41:09):
I just like still to this day feel very fortunate
to be able to do something that I pretty much
enjoy every single day. And I just I don't know,
maybe I just just like too harsh growing up or something.
But I'm just like I feel like if I'm getting
a salary, then I should be. There is value in

(41:32):
pushing back sometimes when you're like senior enough to have
a level of expertise, like in a professional and respectful way.
But yes, like, at the end of the day, our job,
whether we like it or not, is to make this
company profitable, whether that be features, making sure that users
have confidence in the service.

Speaker 6 (41:53):
Yeah, sorry, stuff like that, They're.

Speaker 2 (41:55):
Going to say, despite leaderships and a strategy to the contrary,
I am. I am sort of curious that you have
any you know, personal strategies for gleaning better insight into
how the business should work or how a feature that is,
say technical, transforms into value for the business. Are you

(42:15):
working with product managers or with someone on the marketing
and sales team? Are you grabbing information from the internet?
Like how are you actually figuring this out?

Speaker 5 (42:26):
Like what is and isn't valuable for the business? I
guess it's for.

Speaker 4 (42:32):
Me personally over my career. It probably depends on the
domain I'm working in. So I've worked in like very
like domains of like the construction industry, Like I don't
have expertise in what these people want obviously, Like I
like my personality, I want to understand the domain that

(42:52):
I'm working in so that I can try to better
understand these types of things. But in that sort of scenario,
like I am going to lean more on the product
people now if it's like a dev focused company. You know,
my own thoughts obviously, like doing this a while, like
having friends that I work with and like hearing their
stories that goes into like what isn't isn't valuable? Like

(43:18):
I'm trying to think of like different scenarios like that
I've been in. There was definitely one later on in
my career where there was like this is like an
SR type thing, so like the security team was extremely
focused on like getting this tool out because they were

(43:41):
getting paged based on things that were happening. But the
tool that they had vetted, while it seemed good, it
was a very like rush job to get it out
in time. And so like on my side of things,

(44:02):
like there was a lot of code that needed to
be deployed to get form the infant being a little vague,
so I'll try if I can. That's fair, I can clarify,
I will, but at the end of the sorry it.

Speaker 2 (44:16):
Works, no, but it works because the more vague you are,
the more trauma that listeners will have because it will
relate to their situation. Like I have like three different
traumas right now, just based off of what you said.
You know, security teams pushing stuff out, you know, not
knowing what to do, you know, not well vetted software
that now everyone is dependent on, like you know, not

(44:37):
a short number of times.

Speaker 4 (44:39):
Yeah, So like and I have, like in this particular scenario,
like a really close relationship, like really got long well
with the team that wanted to push this out. In
the ser side of things, I keep saying, you have
to understand like the tier one, Tier two, Tier three,
and this was getting deployed to like tier one, So
this is like getting.

Speaker 6 (44:57):
Deployed at the edge. So if this is wrong, like.

Speaker 4 (45:02):
There's no like the quest the request doesn't go through,
like it doesn't even it's not even like a latency issue,
like it's it's dead. Yeah, it's I don't know. I
see Lauren like shaking his head.

Speaker 2 (45:13):
So I was gonna say past trauma, like actually will
and I will, and I did an episode last year
about how to handle scaling around the holidays, because you
know it's a time, especially for companies that do e
commerce et cetera, anything with orders, there is a huge
spike and how to be prepared for that. And it's
sort of interesting that if you break down the systems

(45:34):
thinking model you brought it up, there is this perverse
incentive where if you have a deadline, that you are
more likely to rush to meet that deadline, which means
that you cut more corners, which means the thing is
more risky. And so if you create a deadline or
a code freeze to prevent risky code from getting into production,
you are actually encouraging the exact thing that you are

(45:55):
trying to prevent.

Speaker 4 (45:56):
So and like also on that end, and this is
where like it, I guess it does like you start
to like traumatic events and you that is usually the case.
I've also seen the case of we do a code
freeze and a system doesn't get touched for six months,

(46:16):
maybe even a year, and then you decide that you
do want to deploy like a new feature, and boom
it's down and hard down because nobody has touched it
in a year. And so like this certain situation that
I ran into months we had like a reindexing job

(46:38):
and there was so much data to reindex that what
that the machines that it had been running on could
no longer support the reindex process. So we're like screwed
at this point. So yes, so you have to wait,
and there's trade offs in all regards, and as I sorry,

(47:00):
like it's just really your job to try to understand
like not even like what's the perfect scenario, but like
what's the least of like the evils here?

Speaker 2 (47:10):
He dos actually is a really interesting guide for resilience engineering,
and one of the things that they talk a lot
about is what happens when there is like one or
two points of failure, what actual new systems start coming
into play, What does the request volume start to look
like versus the nominal state? And you mentioned cash is
at the edge, and I think one of the most

(47:31):
common scenarios is why do you add a cash well
because you're trying to reduce the load on your system,
which means that you create the ability to create higher
load on your system to hit the cash and as
soon as the cash you know, drops or gets busted
because you're using something like valky because no one's using
Reddus anymore. Of course, which isn't designed to be persistent.
It's going to drop at some point, which means you're

(47:53):
going to get a lot of load to your system
without going through the cash and it's going to cause
a hard failure at that point, and how are you
going to recover? And so it sounds great to add
a cash in, but then you're gonna have to deal
with this quite critical scenario where you are just instead
of being able to handle any number requests, you're going
to be able to handle nothing.

Speaker 4 (48:09):
Yeah, back to like what you're saying the distributed system stuff,
Like I in my mind you brought up like single
point of failure and it's just like the nature of
how things work, like you're always going to have a
single point of failure. And it's not so much like
how do we get rid of this single point of failure,
because I don't even think that's possible, because you're going

(48:31):
to have like traffic management at some point, Like what
if the traffic management fails? It's more so in my mind,
like what is what system seems to be the most
resilient and that's where we want to place this single
point of failure. So like usually it is like the distribution,
Like if you have like a cash distribution, that is

(48:52):
usually the in my opinion, like your wisest single point
of failure.

Speaker 2 (48:57):
This is an interesting way of phrasing and I haven't
heard that perspective before.

Speaker 1 (49:01):
What's your on call rotation look like?

Speaker 4 (49:07):
So it's super different based on the company. I am
a big fan of companies who are hopefully you work
at a company that's large enough that they have engineers
across the globe and so your on call rotation can
fall within the hours that allow you to uh work

(49:27):
optimally and not be up all night. But like I said,
I also firmly believe like everybody should share in the
on call rotation because it shares ownership and if things
start to go awry, then you have a vested interest
in making things better. But the on call rotation like
is it? It is what it is, and you know

(49:48):
it can be a little rough at the beginning, but
it's a great way to learn honestly, like just dive
in because there's nobody knows everything in the system. Like
you're going to hit something that probably somebody has and
hit before.

Speaker 2 (50:01):
When you say everyone, like, what's the size of the
org uh that you're thinking about there?

Speaker 5 (50:07):
You know.

Speaker 4 (50:08):
I've been in teams where, like my own conrotation honestly
was like every other week, which is a little rough.
I've been on teams where it's I want to say,
like maybe once a month or something like that. I
feel like that's pretty reasonable. I think it's helpful to have,
like I said, like the tears of support also to
like hopefully it doesn't get to this point, but it's

(50:32):
definitely a real thing. I've seen, like with myself and others,
where you're on an issue for twenty four or forty
hours straight and you need to hand that off because
after a certain amount of time, like you're just your
brain dead, or like you need to bathe, right, I
mean I guess like if I think of like military

(50:53):
or something like, you can go a long time without bathing.

Speaker 6 (50:55):
So maybe that's not the best example. Like you need to.

Speaker 1 (51:00):
Sleep, especially in a remote workforce culture.

Speaker 4 (51:04):
Yeah, like you need sleep to be able to like
debug things, but there should usually I feel like in
a proper on call rotation, like a handoff phase, if
it's been like more than twenty four hours something.

Speaker 2 (51:16):
Like that, yeah, I will. I think follow the sun
implies that you, like you hand it off to someone
who's still awake at that time and not you. You
just stay awake until the sun comes up.

Speaker 4 (51:27):
That that, in my experience, it typically works the best.

Speaker 2 (51:32):
So this is this may be a little bit of
a tangent, but since you're from the JavaScript Java Jabba podcast,
I have to ask you preference no JS, DNO or BOND.

Speaker 4 (51:43):
I'm old and boring, so I like nodes still No.
I've talked to like the Dino team and like I
I don't have anything against that.

Speaker 6 (51:52):
I just I'm old and boring.

Speaker 2 (51:54):
And I shouldn't say I owe Jos.

Speaker 3 (51:57):
I still like Pearl, so like, I agree.

Speaker 1 (52:00):
I think it's just fitting that she said I like Pearl,
and the rest of the audio cut out.

Speaker 2 (52:04):
After that, The said part is like it's still recording
for her, so we'll find out when the episode released
was She actually said.

Speaker 3 (52:11):
It's only for this podcast, you guys that I have
so many technical problems, Like I'm fine in all of
my other meetings.

Speaker 4 (52:16):
It's just this'll be doing something right now.

Speaker 2 (52:21):
I believe you. Jillian.

Speaker 3 (52:22):
That's good.

Speaker 1 (52:23):
I have no comment.

Speaker 3 (52:26):
That's probably for the best quite possibly.

Speaker 2 (52:28):
How do you deal with the complexity of having to
share so much knowledge with so many people? So the
more people that are working in solution in an area
and a team or means there's more things that are
going to be created, which means there's more things everyone
will have to be aware of to potentially deal with
when an on call incident comes up. Is there a

(52:50):
good metric or litmus test to figure out what the
maximum amount of stuff is that a team can actually manage.

Speaker 4 (52:57):
Yeah, honestly, that is a very good question, and I
feel like I don't have a good answer, but now
that's something I want to go see if I can
dig into more. But I also say, like how we
have like these different like tiers of support. It's hard
if you get into like a more senior role because

(53:18):
stuff is going to bubble up to you, but it's
usually helpful. Like in a large team, there's usually even
like slas to the developer team. So even like a
lower environment is going to like be held accountable to
the business to keep a lower environment up in.

Speaker 6 (53:35):
A certain amount of time.

Speaker 4 (53:38):
Like sort of related to your question, like having those
types of people like work in those environments and that
helps distribute knowledge of like how these different systems work
without being tasked with like the actual production environment, and
then if they do well in those lower environments, and
then they can kind of move on to a production environment,

(53:58):
and that helps spread the nolle.

Speaker 2 (54:00):
Pray that's prey. I mean, I think there's a there's
a corollary here where I often have to advise some
companies on which is we have like five different mobile
apps because you know, there's many different providers out there,
and then there's a website and also web native version
for your mobile app that isn't like an app version
but is responsive like and then there's back and that

(54:22):
goes with it. And the infrastructure is that all one team,
you know, so everyone is cross functional and understand what's
going on. Or do you have like ten teams that
each have the exact same functionality. And I feel like
the answer keeps on being well, they're both wrong and
there's no there's no good solution there, and you just
got to have to deal with the consequences.

Speaker 6 (54:40):
It's awesome.

Speaker 4 (54:40):
I mean, this is why I did startups for a
long time. But as I've gotten like further into my career,
I feel like it's hard to make an impact unless
you've been at the company for really for a good
amount of time because it the larger the code base
and like the domain than and you really do need

(55:01):
that time to understand it, whereas like a startup usually
like a smaller application, and you can understand it more.
But like a lot of these like larger companies, they
have like these huge systems and it takes time to
understand it.

Speaker 2 (55:16):
That's a really interesting challenge. Then we see in the
tech industry the average tenure for engineers reducing over time. Yeah,
does that mean that we're actually more and more incapable
of building larger systems because people aren't staying at the
company to be able to diagnose and build more resilient systems.
Are we still attempting to do that and the systems
we're building just aren't resilient anymore? Or is this just

(55:39):
like you see whatever problems are close to you, and
larger companies with bigger solutions are doing a good job
at retaining experienced staff that have experience in their own
solution in order to help solve those challenges.

Speaker 4 (55:54):
It's it's very tricky because yeah, there's like a trade
off both ways, Like you stay a company long enough,
like yes, you might kind of have some like siloed
information and start to not know how other people are
solving problems, and then that kind of can produce like
some blind spots in your knowledge. But at the same time,

(56:15):
like if you don't stay at a place long enough,
it can be hard to see like how this huge
system interacts and.

Speaker 6 (56:24):
Be able to.

Speaker 4 (56:27):
Really solve like these harder problems because it takes a
while to like get the whole domain, the whole system
in your head. It's a there's definitely like like a
vested interest I feel like in these companies and like retaining.

Speaker 2 (56:39):
People, well, yeah, for sure. I mean there's the benefits
to them and not to the individual and a lot
of times, I mean, I guess, you know, maybe a
different way of phrasing this is if we look at
Dumbar's number and ORG is going to be at max
like one hundred and fifty people, how many years of
experience working in that company do we project someone would
a new hire would have to work before they or

(57:00):
at the point where they are delivering or understanding the
system sufficiently that one hundred and fifty person ORG has
created any thoughts on that.

Speaker 4 (57:08):
That's an interesting question too, like how it kind of
like scales based on the size and the system and
how long you'd have.

Speaker 2 (57:16):
To be there, but a personal preference, you know, like
oh yeah, once I'm there, you know, six months, I
feel like I'm now somewhat accountable for every system or
a year or two years or something like that.

Speaker 4 (57:27):
It probably this doesn't answer your question, unfortunately, but I
would say like one of the things that could be
helpful is like if you start to like segment your
career working in a similar domain, so some of that
knowledge is transferable, like whether you're talking about like the

(57:48):
hyperscaler that you're working on, or if you're more focused
on like CDM distribution, and maybe that's valuable to other companies,
whereas like if it's a you know, if it's a
paylalled type of application, that maybe your CDN experience is
not as beneficial there.

Speaker 1 (58:07):
I think also when you say one hundred and fifty
person or you refer into like the entire company or
just engineering.

Speaker 2 (58:13):
I mean even a subset of engineering potentially, like I
just like at Dunbar's number, like your teams are going
to be like two pizza teams for instance, that are
cross functional. But then there are different types of teams
and different features within the product or product suite that
you're delivering, and that can only get up to one
hundred and fifty. Above that you need to create a
second org and have them work on a completely different
feature set with a completely different focus. You know, you

(58:35):
think of like how in a hyperscale or I'm going
to use Google Workspace as an example, like email, Gmail
and drive, right, I can imagine those being separate, completely
separate orgs for instance, But I would not expect one
org structure where it's fundamentally accountable for every single product
in Google Workspace. There's just too much stuff there. So
that's the part for me that makes the most sense,

(58:57):
that you can't have more than one hundred and fifty
people working on the same product because there's only so
much functional work there, and at a larger size you
have a communication breakdown. You can't pass the information effectively.
So that's just sort of the metrics on the math
that seem to work out.

Speaker 4 (59:12):
Too many cooks in the kitchen, and like I think
for better where some people have like a vested interest
in like trying to offer value, but unfortunately some there's
the values not there.

Speaker 2 (59:22):
Yeah, I mean even to your point of like, well,
let's have an on call rotation in everyone and say
the team in the ORG could participate in right, one
hundred and fifty people means you're on call for your
product once every three years, right, So that doesn't work out, right,
So you know if your August fifty people, then you
get to the okay, once a year, I'm on a call. Well,
that maybe works out potentially if you have a low

(59:44):
number of bugs. You deal with real time applications that
can never really be down, highly resilient, and you run
you're probably running fire drills during during working hours as
if there were an incident to know how to respond
and not just hopefully waiting and never seeing a problem.

Speaker 4 (01:00:01):
I kind of like, I feel like computer systems like
mirror a lot of like real life. So whether it
be like I people who don't understand astory, they're kind
of like, what in the world do you do? I'm like,
I'm basically like an emergency room doctor, and like somebody
comes into the emergency room and my responsibility is like
keep them alive until the right person can come in,

(01:00:23):
and you know, get them into like critical condition, not.

Speaker 2 (01:00:27):
Even not even the surge end, but you know, high level.
I understand how how patients work and the sets of
problems that can happen, and I can direct people on
the right places.

Speaker 1 (01:00:37):
Yeah, the wet stuff on the inside and this dry
stuff on the outside.

Speaker 5 (01:00:40):
I love it.

Speaker 2 (01:00:44):
You said, you said nuclear physics. Well I don't think
you said E M T right right right, Yeah, true, true, true.

Speaker 1 (01:00:52):
But I do spend not to just qualifies my opinion
by any means, but I do spend a lot of
time out in the wilderness, and like the wilderness, the
wilderness first aid stuff I've been exposed to is pretty
much that keep the wet stuff on the inside, dry
stuff on the outside.

Speaker 2 (01:01:12):
Okay, so now I'm imagining bear grills. I don't out
back are you know, frequently known on the internet means
for drinking his own You're unnecessarily.

Speaker 1 (01:01:27):
But the likes he got, that's the metric.

Speaker 2 (01:01:31):
You gotta wonder what he was really going for. Was
he going for, you know, teaching survival skills or was it,
you know, just to become a paid influencer.

Speaker 1 (01:01:39):
I'm pretty sure he's got a mortgage to pay.

Speaker 6 (01:01:43):
Yeah, so.

Speaker 1 (01:01:50):
Someone who's interested in pursuing a career in s E.
Give us like your top rundown pros and cons and
things you'd wish you things you wish you had known
before ending up here.

Speaker 5 (01:02:05):
There's a lot of different paths that you can take
into s E.

Speaker 6 (01:02:09):
One thing I.

Speaker 4 (01:02:09):
Didn't really touch on, and I know we're close on time,
so I won't go into it too much is like
a performance engineer. That's very important for an s E
team to do performance testing all these kinds of things.
That's a huge part of s E. Is like resource allocation,
and you need performance testing in place to better make
those decisions. So performance engineering, like I can say, like

(01:02:32):
from my background, like coming in from application engineering, there's
a lot of value there. Could you understand like how
these micro how these microserver systems.

Speaker 6 (01:02:40):
Like work together.

Speaker 4 (01:02:42):
And then like the systems in Ino your background has
like an ability to debug at like the network level
really well and things like that, and all of those
things come in and offer value to an s R team.
I guess things I wish I would have known kind
of like we talked about the beginning of the episode,
I I'm like very much like go go go, and

(01:03:07):
like early in my career, everybody was like, you're gonna burn.

Speaker 6 (01:03:09):
Out, like slow down. I was like impossible.

Speaker 4 (01:03:15):
And I still say, like I didn't burn out because
I still absolutely love it. But I really did not
realize that like it actually did start to like the
lack of sleep just didn't start to kind of like
take a toll on my actual health a little bit.
So like lesson learned to myself is like, be be

(01:03:38):
cognizant of like your personality and like your you know what,
some of your weaknesses might be there, because like for me,
I absolutely love what I do and I want to
be able to do this for a very long time.
So I kind of came to this realization that I
was like whoa, like if you don't put it, and management,
like my manager was fine, like he would encourage me

(01:03:58):
to do more of it, not like my director like
same thing, like dude, like you you need to take
some time off. So like my advice would be like,
if you're more on that side of the spectrum, like
know yourself and allow yourself that time because it will
make you better in the long run.

Speaker 6 (01:04:19):
Other things.

Speaker 4 (01:04:22):
I would just say, like if it sounds interesting to you,
like don't shy away from it because it is I
feel like a huge there's like a huge surface of
areas you can tackle. But if I'm always like of
the mindset like anybody can learn anything obviously, like some

(01:04:43):
people are like more technically talented, they come in with
more experience.

Speaker 6 (01:04:46):
Those sorts of things.

Speaker 4 (01:04:46):
But if it's interesting to you, like and you have
the drive like.

Speaker 6 (01:04:49):
Go for it.

Speaker 4 (01:04:50):
Those would be my biggest like two takeaways and.

Speaker 2 (01:04:53):
Through the power of MS now you now you two,
everyone can be an expert in whatever topic you know
you fancy. I mean listening to you talking to me
about this, it really makes me think. Like you say,
you really have to care a lot, But caring a
lot means understanding the business and gives you an edge
over everyone else who may not care as much. They

(01:05:14):
won't give them a drive to really dive in and
be able to make the right decision. But then there's
the question that comes up of like if you care
too much, you're liable to burn out from doing that,
and you really need to know yourself. And that's a
tall order I think for a lot of people.

Speaker 4 (01:05:30):
Yeah, Like it was very it was a challenge for
me to be like you can't be the hero twenty
four to seven or like you are not going to
be able to like sustain this, but there's.

Speaker 2 (01:05:44):
Just one more fire over there that you know how
to deal with.

Speaker 4 (01:05:48):
Oh lord, Yeah, I mean I guess I would say too,
Like I'm probably at a place now where the thing
that I did really enjoy is like I did get
to serve in more of like an engineering manager role
on the SR team towards the end my last role.
And you know, everybody is like driven by different things,
but you can kind of see people in the org
that have a lot of potential and starting to kind

(01:06:11):
of like mentor them and hand things off to them
and things like that can.

Speaker 6 (01:06:14):
Be very rewarding and.

Speaker 4 (01:06:17):
Really like the best thing you can do is this
sounds like so hand baby, but like fort multiplier, Like
you know, whether you're an actual manager or in a
senior role or anything like that, like what can you
do to help the company and the teammates like fall
into the pit of success?

Speaker 1 (01:06:36):
I like that analogy important leadership skill.

Speaker 2 (01:06:41):
Learn to delegate, Right, you have to.

Speaker 4 (01:06:43):
Be okay at a certain point in your career with
like not being the hero, because if you're in that
kind of role, like sometimes your value is more so
in enabling others and building out parts of the system
that like help others fall into the pit of success.
And if you kind of like are one of those
people that feels like they always need like the like recognition, like,

(01:07:05):
then like leadership type roles might not be the path
for you because a lot of the recognition is going
to be like hidden behind the scenes of enabling others.

Speaker 5 (01:07:15):
Yeah, but it's warm, cuzzy. You like the girl and he.

Speaker 6 (01:07:17):
Likes it, so.

Speaker 1 (01:07:20):
I agree with you. But it's a it's a completely
different high probably not the best analogy in the world,
but that's the one that just sits in my head.

Speaker 4 (01:07:28):
It's more so like I would I guess I would
almost equate it to like delayed gratification, Like you're not
going to get the like that a girl in the moment,
but it might come like two years later when somebody
that you managed comes back to you and be like, man,
like you taught me this thing and I used it
today and thank you, and you're.

Speaker 6 (01:07:47):
Like really good.

Speaker 2 (01:07:50):
Right. It's interesting that you say that, though, because that
actually means that some of the most enjoyment you get
out of it if that's the right word. Is actually
seeing others execute effectively and seeing back on what you've
taught them, which indirectly has no impact on the business
in a way, especially if they leave and go somewhere else.

(01:08:12):
So you know, there is this weird paradox there.

Speaker 6 (01:08:16):
Yeah.

Speaker 5 (01:08:16):
Yeah, yeah, it's a little of both.

Speaker 6 (01:08:18):
Yeah.

Speaker 4 (01:08:18):
For sure, it's helping the business, and hopefully you help
other people on the team that help the business or
just go on and do good things themselves.

Speaker 2 (01:08:26):
I don't know if this is relevant, but this keeps
coming to my mind in this conversation. There's this SMBC
Saturday Morning Breakfast Cereal webcomic issue where Superman's like, how
can I deliver the most value possible? And you know,
the mayor of the city is like, well, you know
what actually saving people? Like you're pretty slow at that.
If we look at it, you know what would really

(01:08:47):
work is if we just had more energy and to
make this, you know, relevant for twenty twenty five, to
obviously power all of those lms to help people. And
so can you just turn this one giant crank really
fast as you can. That's just your whole life just there,
And like decades later, he's still turning the crack and like,
that's not a very rewarding experience, but that's what gives

(01:09:09):
the most value to UH civilization. And you know, one day,
the years later, the mayor wakes up and says, you
know what, actually, we don't need you anymore. We automated
your job away with nuclear fusion, and now we generate
a limited energy all the time, basically without doing anything.
You can just retire.

Speaker 4 (01:09:30):
Well, hopefully that doesn't happen you kind of You're always
wanting to invent new ways to make yourself valuable.

Speaker 6 (01:09:36):
That's part of being in tech.

Speaker 1 (01:09:38):
I don't see it's running out anytime soon, which is
good because I've got I've got a mortgage to cover, So.

Speaker 2 (01:09:46):
I don't know. I hate to have the dissenting opinion here.
I feel like there is just people making the same mistakes.

Speaker 4 (01:09:52):
Yeah that's true too, Yeah, the same, the same underlying mistakes.

Speaker 6 (01:09:57):
Yes, it's history repeating it.

Speaker 1 (01:09:58):
Yes, so warning, if you could just compile a list
of those.

Speaker 2 (01:10:02):
I tried, so I have tried to, you know, write
down so what happens frequently. I always thought I'd be
doing engineering, But what I do is I write blog
posts and I record podcast episodes, and I get triggered
sometimes while I'm doing other things in my life, talking
with my colleagues and whatnot, in the communities that I'm in,
so much so that I feel compelled to have to
write something. And so there is a list of things

(01:10:24):
out there of problems that I've seen and my recommendations
for it. But some of them are like really nuanced thing,
Like there's one particular little thing in security that everyone
seems to run into, and then there's like a forty
minute conference talk on it or a blog post on
it that I'm sure no one has ever read.

Speaker 1 (01:10:44):
Well, it sounds like a really good point. Put all
those words together into one coherent sentence in the right order. Right,
That sounds like a really good time to move on picks,
and maybe you can point someone to some of those resources.

Speaker 2 (01:10:57):
Uh, I'm not going to plug blog as my pick,
but I'm happy to put a link for this episode
for anyone that wants it. Actually, I'll find the Saturday
Morning Breakfast Cereal episode two and put that. But my
pick actually is going to be based of our conversation.
There is a book by Peter Senge called The Fifth Discipline,

(01:11:17):
and it's all about systems thinking. And there's this one
great example in the book about it's a canonical thing
that I believe it's taught at I think it's Harvard
Business School, but also at MIT in the economics theory.
And that's basically, there is a brewery that makes a
beer and they make enough for a distributor to sell

(01:11:39):
the beer to retailers for people to buy it, and
every retailer sells like your local store sells two cases
every week. And then in some random week, a new
pop song comes out and everyone loves the beer, and
because the band mentions the beer in the song, and
from that point on, every single retailer just increases the

(01:12:00):
number of beers the case they sell by one to three.
And this causes some ridiculous effects to happen throughout the
whole ecosystem. The distributors will fall over, the brewery will
get some massive orders, and basically what ends up happening
as lots of people will be fired for failing to
deliver effectively. And it's just such a small change to happen,

(01:12:21):
and you can see huge impacts, and I feel like
it's a really interesting case study of what happens over
and over again, and lots of organizations and relates out
to a lot what Amy was talking about in today's episodes.
Those relevant.

Speaker 1 (01:12:34):
That's pretty cool. That's called the fifth dimension discipline, fifth discipline, gotcha,
re Peter sin get yeah? Right on, Jillian, what about you?
What'd you bring for a pick?

Speaker 3 (01:12:45):
I don't know. I was going to pick. I'm giving
a talk next week on HPC, And now it occurs
to me that I don't know if that's going to
be like recorded and put any place, So that's maybe
not my pick. I don't know, like Horizons zero down
the video game I've been playing, and maybe the talk.
Maybe the talk too. If it's actually published and recorded,

(01:13:06):
we'll see.

Speaker 2 (01:13:06):
You can always record it yourself on your laptop and
then you'll have the transcript, you can publish the slides,
and then you can you know, release it to all
your faithful audience there.

Speaker 3 (01:13:16):
Yeah, that's true. I mean I could just help somebody
record it on my phone, right, Like, I don't know.

Speaker 1 (01:13:21):
Okay, so we're gonna count on seeing your talk.

Speaker 3 (01:13:24):
Then, yeah, sure, that's the point. It's supposed to be
self promotion, like, right, that is what I do.

Speaker 1 (01:13:31):
Amy. I know you've done this before. What do you
have for a pick this week?

Speaker 4 (01:13:35):
See I was debating if I should pick something technical
or non technical. I guess I will try to break
it up a little bit. I'm going to pick a
band that I am like absolutely obsessed with and I
can't shut up about it, And because other people won't
listen to me anymore about it, this is a great
opportunity to share it. My husband is really tired to

(01:13:57):
be talking about them and blasting them, so and that
is sleep token. I don't know if anybody who's ever
heard of it or heard of them, but I will
drop a link. I first heard about them like maybe
like two or three years ago, and they have a
bunch of new songs out this year, and I am, yeah,
it's when I've had like a long, stressful day at work.

(01:14:18):
I don't know why, Like I listened to them and
it's like peaceful to me to fall asleep toobe. So whatever,
maybe other people will like it too.

Speaker 2 (01:14:25):
What's the genre?

Speaker 6 (01:14:27):
That's the thing?

Speaker 4 (01:14:28):
So they are like they're all over the place, like
I guess traditionally more like metal. They're almost like a
little bit like ed M, a little bit metal, like honestly,
like a tiny bit hip hop something like they're just
like you listen to your song and you're like, oh,
this is like really relaxing, and then they start screaming

(01:14:49):
and then there and then it's like back like a
hip hop's kind of vibe, and I'm just like, what's happening.

Speaker 2 (01:14:56):
It's just like you're gonna say, like, yeah, it's like
it's like heiden or Dominate the Christmas Donkey. I just
use that music to falls leap that night. I'm just
like really like, not my personal choice, but uh, you know,
if it works for you, that's pretty interesting.

Speaker 4 (01:15:10):
I like metal, so why I absolutely love them, So yeah,
that's that's going to be my pick.

Speaker 1 (01:15:17):
That's awesome. I was just as you were saying it
and you're like, oh, it's so just what I want
to listen to after a long, stressful day. I was like, oh,
this would be so great if it's a metal band,
and then you're like a surprise, guess.

Speaker 4 (01:15:27):
What I mean. They're not metal in the sense of
like like I really like this band Motionless and White.
If people have heard of them, they're not metal in
that sense. They're like a little more chill than that.
But yeah, right, on, I'll drop a link before.

Speaker 6 (01:15:44):
We hang up.

Speaker 1 (01:15:45):
Cool. Cool, all right, So my pick I switched it
since we started recording because it's now a relevant topic.
On Netflix, there is a series called The bear Grills
Celebrity Hunt, and so it's all of these celebrities who

(01:16:07):
are taken to Costa Rica and then they compete, and
the losers of the competition have to go into this
area that's like walled off and they have to survive
in there for one hour while bear Grills is hunting them.
So they have an hour to either escape the bear
pit or get caught by bear Grills. And it was

(01:16:30):
actually just super fun to watch. And I brought it
up at work saying, Hey, for our next off site,
we should go to the bear pit and have bear
Grills hunt us down.

Speaker 2 (01:16:40):
Where bear grills will actually be well buttoned.

Speaker 1 (01:16:44):
I'd be cool with that.

Speaker 2 (01:16:45):
Yeah, I feel like that may create some sort of
psychological safety issues for coming back work later.

Speaker 1 (01:16:52):
Yeah, there's probably a downside to that, but let's not
focus on that.

Speaker 5 (01:16:56):
Makes me think.

Speaker 4 (01:16:57):
My husband, Like, at first I thought I was like,
what are you watching? But he watches this like it's
probably like a famous YouTuber people probably know I can't
think of him, but like he's always out in the
wilderness like doing things, and my husband loves watching him.
And initially I was like, this is crazy, but now
I kind of like it too.

Speaker 5 (01:17:14):
All Right, he's usually in Alaska.

Speaker 1 (01:17:17):
There have been a lot of those episodes where my
kids were watching something like that's stupid. Why are you
watching wasting your time on that? You know, like two
hours later, I'm still sitting next to him watching it.
But anyway, Amy, thanks for being on the show. This
was a fun chat. Thank you for your insights and
your expertise and sharing your thoughts on sre Warren Chillian.
Thank you both for jumping in and co hosting today.

(01:17:39):
It's good to see both and for all our listeners,
thanks for listening.

All Episodes

Episode Transcript

Popular Podcasts

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

24/7 News: The Latest

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Mastering SRE: Insights in Scale and at Capacity with Aimee Knight

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}My Favorite Murder with Karen Kilgariff and Georgia Hardstark

24/7 News: The Latest

Dateline NBC

All Episodes

Mastering SRE: Insights in Scale and at Capacity with Aimee Knight

My Favorite Murder with Karen Kilgariff and Georgia Hardstark