Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:04):
You're listening to thePlatform Engineering Podcast, your
expert guide to thefascinating world of platform engineering.
Each episode brings you indepth interviews with industry experts
and professionals who breakdown the intricacies of platform
architecture, cloud operationsand DevOps practices. From tool reviews
to valuable lessons from realworld projects, to insights about
(00:27):
the best approaches andstrategies, you can count on this
show to provide you withexpert knowledge that will truly
elevate your own journey inthe world of platform engineering.
Welcome back to the PlatformEngineering Podcast. I'm your host,
Corio Daniel. Today I'm joinedby Tony Meehan, co founder and CTO
of Prequel. Tony'sbackground's wild. He spent a decade
at NSA working onvulnerabilities, so probably knows
(00:50):
a tad bit more about securitythan I do. Led engineering at Endgame
and Elastic, and now he's on amission to make software more reliable
by shifting how we think aboutdetection and failure. He's also
one of those folks whogenuinely loves bugs. Don't we all?
I gotta hear your weirdestproduction bug story at some point
in time, but we're gonna getinto all that today. Tony, welcome
(01:11):
to the show. Can you tell us alittle bit about you and how you
got into the space?
Yeah, Cory, I'm excited to behere in the dude shed with you out
in LA. I mean, where to begin?I just... yeah, a little obsessed
with bugs. Definitely kind ofpicked up the itch at th NSA looking
for bugs for 10 years. I don'tknow, I think finding bugs is, it's
(01:32):
like the best way to learn howsoftware works and have a better
understanding of systems andit kind of scratches an OCD itch.
You know, a little obsessed,but just my entire career, that's
sort of been the theme. Andso, you know, after the NSA, I ended
up joining a startup calledEndgame, where we were building an
endpoint security product. Andso I got to do more of that and ended
(01:52):
up joining Elastic a couple ofyears later, which was super fun.
I loved working there. Butyeah, I think I've always sort of
had this nagging obsessionwith finding bugs and really wanted
to find ways to... you know,my background's cybersecurity, but,
you know, building softwareproducts, you would always have,
you know, an outage, anincident. And it was exactly the
same experience as looking fora vulnerability as it was to find
(02:16):
out the root cause to someoutage or some big bug. So definitely
scratch that itch as well. Andwe were just excited about taking
some of the lessons that welearned in the security community
in how to build people, builda community, have people work together
through tools and take that tothe space of reliability problems.
And that's kind of how weended up here. So, anyway, it's awesome
(02:36):
to be with you. I lovelistening to your podcast and yeah,
great to talk to you today.
Oh, I appreciate that. Thankyou, I appreciate that. That's cool.
So it's funny thinking aboutbugs and failures. I feel like as
a software developer, we'rewriting software, we're opening a
PR, my team's looking at thisPR, going to merge it. There's so
much headiness around that PR,right? There's the person who's not
(02:59):
familiar with the feature thatyou're working on, reviewing the
code, trying to make sure thatthere's no failures, bugs, weird
quirks that are gettingintroduced. But then there's how
that code can interact withthe rest of the system once it's
merged. The idea of trying tofind these failures across the entire
code base, I feel like is anoverwhelming idea for an individual
(03:20):
software developer, even ateam working on this. When you start
thinking about failures notjust as incidents to clean up after,
but as problems that you aretrying to detect before they happen,
how are you thinking throughthese problems? And where does this
integrate with my tooling as asoftware developer? Where do I put
this in? Is it in my buildlocally? Am I putting this in CI/CD?
(03:40):
Is it analyzing just my PR orthe whole code base, kind of reflecting
on what's getting mergedtogether? Would love to learn a little
bit more about that.
Yeah, good question. A couplethings maybe to talk about there.
The first is, yeah, it's, youknow, putting up a PR, there's abstraction
upon abstraction uponabstraction. You're new to a code
base, it's a big team, there'sjust... there's a lot of complexity
(04:02):
and a lot of hidden complexitythat's abstracted away from people.
And so I mean, failures andbugs and unexpected interactions
are inevitable. And one of thethings that... it is overwhelming.
It's an overwhelmingexperience. And as you build up even
more experience as a softwaredeveloper, these things are still
going to happen.
Actuall I
just saw someone tweet this today.
(04:23):
They were talking about theylearned something the hard way. And
I think that's a bummer. Ithink learning something the hard
way is a bummer becausesomeone else somewhere has run into
the same problem before.Almost always. You know, of course
you can have bespoke bugs inan application you're writing, but
at the same time there'scommon developer anti patterns that
you can introduce intowhatever you're writing. So I think
(04:45):
the starting point for us,like a good anchoring point is like,
you know, contrast it with,you know, in the security community,
anytime there's a new problem,you have this massive community of
threat researchers that areposting their blogs where at the
end of it they'll talk about,here's how you go and find this problem.
But you know, in, inreliability and software bugs, it's
like you're on your own, like,good luck, you have to go do the
(05:06):
investigation. And then afterlike hours or days of looking into
it, you know, sometimes you'llstumble across like, "Oh my gosh,
someone else has run into thisexact same problem, and I'm the one
that learned it the hard way.And here's how I could have detected
it, here's how I could havemitigated it." And I think the thing
that we get really excitedabout is like, "How do we just start
there?" Like, let's start witha detection. How do you leverage
(05:29):
community knowledge so thatother people that have learned it
the hard way can give you thebenefit of that experience so you
don't have to. That's likekind of the starting point. And then
like kind of maybe, and we canget into this, but like, kind of
how our approach works is wework with this community of problem
(05:52):
detection engineers. We haveour own reliability research team
where we write rules that kindof codify this knowledge of failure
- like misconfigurations,known issues in open source software,
or developer patterns, like Isaid - and take that detection and
couples it with a mitigationin the form of a rule and then actually
(06:14):
runs that on the data wherethe data sits. So instead of sending
data, you know, continuouslysomewhere else - that can get expensive
or unpredictable - we actuallybring the rules somenewproblemcomesupandyouknow,Apache,Kafka,orusingsomethingelse,youknow,likeanORMforSQL
(06:36):
and finding, that thecommunity is finding. And so that
way if some new problem comesup in Apache, Kafka, or using somet...e
,
Yeah, and I feel like for anyengineering team, that's great, right?
I mean, I've seen ORMs wherethey were SQL injection proof, but
then there's like this onecorner case where it's like, you
sure as heck can SQL injectthat part, right? And it's hard.
(06:57):
Like you, as a developer, you'r trusting that this library
is tested. Right? And it mayor may not be a CVE, but SQL injection
is going to happen sometime,right? And to find that is hard.
S as a developer, justworking on your product for a customer,
that is a tough problem spaceto think about. But I feel like for
platform teams where we'reintegrating with an unknown number
(07:22):
of plugins, I feel like thatis even more of a... I've got to
tie into Terraform andwhatever Terraform modules these
people are remotelyreferencing. I've got to tie into,
you know, this configurationto hit a web server to push some
metrics. Lik there's a lotmore integrations that we face as
platform engineers. I feellike this fits great in the general
(07:43):
engineering population, butfor platform engineers, I feel like
this is pretty critical inmaking sure that the systems that
we're building are introducingthings like server side request forgeries
and whatnot.
Yeah, yeah. I mean, look, westarted off this journey really focused
on developer bugs andapplications, but as we worked more
with customers - we're stilldoing that - but what we have learned
(08:05):
is you're exactly right. Allof these interactions between different
systems, Argo, Terraform,Vault, and then all of the sort of
infrastructure components justto, you know, connect applications,
the message queues and justall of these different things, just
these interactions - there'sjust always something going wrong.
There's always like this chaosthat's happening. Sometimes you're
(08:26):
able to recover from it andsometimes it's just sort of this
lingering thing thateventually is going to become some
customer issue. And I think,as we saw that - and then we would
go do research on theseproblems and see that there were
many people that had alreadyrun into them, talked about how they
fixed it or mitigated it andit's some open source GitHub issue
or whatever it was - we gotexcited about the idea of how do
(08:49):
we actually codify thatknowledge in the form of intelligence.
We call them commonreliability enumerations.So it's
a rule that would allow you tobasically automatically know about
this thing instead ofdiscovering it later. So that, that's
kind of where the idea camefrom. You know, you asked, something
earlier about a, you know,interesting production bug. Actually
there's an origin story theretoo. When I was at Endgame in 2017,
(09:12):
I think... I actually ended upblogging about it because again,
it's like... it's definitely ahigh tense moment because it ende
up being like a six-daypartial outage for foracoupleofcustomers
at Endgame.
Oh no. Yeah, that's rough.
We would have these messagebackups and they just wouldn't get
processed. We were using NATS,which is great. We actually still
use NATS now, but we weresuper early adopters. We ended up
(09:35):
spending several daysdebugging the problem and discovered
there was a deadlock in theclient library that we were using.
And about four or five weeksbefore that there was a GitHub issue
where a bunch of people in thecommunity had discovered this problem
and like, "Here's how you canfix it." And so we ended up five
or six days later, found theexact same issue. And it was like
this bittersweet moment oflike, "Okay, we've tested this, we
(09:57):
know this is a problem, thisfixes it." But also, "Oh no, what
do we do if this happensagain? How would we know about something
like this next time?" So Ithink that kernel, that experience,
the most interestingproduction outage, I think blossomed
into what we're doing today.It just, in a good way, haunted us.
Yeah, I feel like I had thatvery similar thing happen to me recently
(10:19):
and I was just banging my faceagainst a wall for hours. It was
a minor update, I think it waslike the Erlang OTP version that
we use and it changedsomething in how OpenSSL worked.
And all of a sudden... it'slike a minor update, our entire test
suite passed... as soon as ithit production, no email notifications
(10:41):
went out - just ceased. And sowe immediately rolled back. But then
it was like, we need to dothis upgrade for some other stuff,
but there's something aboutthis that just breaks th way authentication
works with SMTP. And like it'sjust all of our... all of our stuff,
like everything's just workedfor... the code base, it's like this
part that hasn't been touchedin three years...just ceased to work
after like a minor... a minorversion upgrade too. And it's just
(11:03):
like what has happened? Wecouldn't figure it out. And like,
w couldn't get that likemagic incantation of Google to like
surface it. And it wasliterally just like searching for
hours - cuttin this, cuttingthat, like trying to figure out exactly
what it was. And then we got asearch term that like hit somebody,
on this like very specificlibrary, that was like, "There's
this weird scenario that I'min that it's not working." And it
(11:24):
was just like 38 conversationsdown on GitHub was the answer and
it was just like, "Okay, theanswer is bump to the next minor
version up."
Yeah, yeah, yeah. And that'sactually, that's kind of the funny
thing is that a lot of thetime the answer is like, "Yeah, you
gotta upgrade to this newerversion that just came out, you know,
a month ago that fixes yourspecific problem you're having right
now." I love thoseexplorations and investigations to
(11:47):
figure out like, "All right,we don't know why this is broken,
but we gotta figure it out."And then that moment where you finally
do figure it out, it's like,"Oh, my gosh, this is awesome. We
figured it out. This isgreat." It's like a nice endorphin
release. It's a fun experience.
It is, it is. Okay, so isthere something... is there a rule
that Prequel has today that ifyou... that you're like, "Okay, I
(12:07):
know this one's a problem."?Is there one that will give people
listening right now anxietythat they're like, "Oh, my God, I
didn't think about that."
Oh, what an interestingquestion. If you've been around building
software applications... evenfor a short amount of time, but long
enough... like, you probablyalready have this anxiety, like you
already know where the bodiesare buried. Like, "Oh, man, this
(12:28):
is. This is not going to begood." I think the thing that's really
interesting is when you ca dosort of this... We've built this
distributed matching enginethat allows you to do things sequence
of events, like A followed byB followed by C, and to do correlations
on those things - like, "Hey,on the same IP address or host name
and then with negativeconditions too." So, like, false
(12:51):
positives is a thing that youhave to kind of pay attention to
because if you're tellingsomeone there's a problem and it's
not a problem - you do thatlong enough, they're going to ignore
it. And then there's a problemand it gets ignored. So I say all
of that because some of therules that we have, they'll look
for if you run containersinside of certain cloud environments
(13:11):
that have a C groupconfiguration that prevents child
processes when they crash -like an OOM crash - it prevents the
main container from crashing.So you never know this is happening.
It's like a silent OOM. Youknow, when you see, like, nginx start
having worker processessilently OOM because it's trying
to process too many ingressobjects at the same time, that then
(13:31):
produces these 500s for yourcustomers. Like kind of stringing
all these things together, Ithink... that's one of the things
that we've seen a couple timeswhere peopl thought things were
going fine and then we wouldsort of piece together like, "Hey,
we see this problem happeningin nginx coupled with some stuff
that we're seeing inKubernetes events as well as this
application.", like puttingthose three things together. I think
(13:52):
people didn't even know therewas a problem with nginx because,
again, the container was justrunning. So yeah, I think it probably
depends on what technologyyou're using. Like there's also problems
with RabbitMQ that don't evenproduce metrics to trigger alarms.
There's some example of thisfor every technology. There's probably
too many to go through. Butyeah, that's a good question.
(14:13):
Yeah, it's a good segue toobecause everybody that's now panicking
about silent out of memoryissues with nginx and RabbitMQ right
now needs to go check out atleast two open source libraries,
right? At least two.
Yep, that's right. That's right.
So Prequel's open source. Youhave already open sourced them. So
I want to get into what theyare and then like what brought you
(14:34):
all to open source them, beingthat, you know, they were previously
closed source. So it's CRE and Preq?
Yep, that's right.GitHub.com/prequel-dev and then CRE
and Preq. And so CREs arewhere the community is working together
to publish these CREs - theserules that describe problems and
mitigation and how to detectthem. So it's kind of like marrying
(14:57):
that knowledge in a way thatmakes it shareable and automatically
updatable. And then Preq ishow you actually use those to go
and detect the problems inyour environment. And so that tool,
Preq, runs on Mac Windows,Linux. Runs in Kubernetes. We have
lots of exciting thingsplanned for it. You basically take
those rules and run it on yourdata. And the way you can plug it
(15:17):
in is it can run standalone,or you can run it as a kubectl plug
in, you can run it inside ofyour kKubernetesubernetesclusterasaCRONjob.
Oh cool.
There's lots of different waysto consume it. And yeah, I'm happy
to get into the motivationsfor doing the open source, but those
are the two tools that we justlaunched a couple of weeks ago.
Yeah, yeah, let' talk aboutthe tools a bit and then we can get
into motivations. I lovetalking about... especially like
given all the big licensechange recently... I love seeing
(15:39):
companies still open sourcingproducts and like kind of what drives
them to do it.
Yeah, it matters.
It does, it does. So Preq, youcan run it locally too. So I can
bring it into like a precommit and like start to see this
stuff before I even open a PR.I feel like that's one of the things
that's like disheartening,,asa developer doing TDD I sit there,
thesetests,Iwritethiscode,Igetitworking perfect,
Ipushit uptogitand then allofasuddendependabot's like,you're Dependabot's
(16:03):
like, "You're a fool, you didthat wrong." And then I have Andso
youcanbring I did something,right? And so running locallyand
right into a pre-commit,buildbeforeeven locally, and have
in.
Yeah, and you don't even haveto contribute back either. I mean,
we want people to contributeback. But there are a lot of people
that are writing rules todaythat are very unique and specific
(16:23):
to what they're doing and yetthey still benefit from updates from
the community whenever thereare new rules published, you know,
every couple of days.
So let's talk about... so CRE,like that's the root of it. That's
where the rules are then. AndPreq's the tool that you run. So
let's maybe talk about likeCRE a bit. So like, wha are these
rules? Like are they languagespecific? Are they like protocol
specific? Like what level ofknowledge do you have to have to
(16:43):
like start working on anddeveloping these type of rules?
The most important part aboutthe Common Reliability Enumeration
schema is that it's a schema,it's really just a set of fields.
If you know YAML, you knowCRE. So you know how to write a CRE.
Oh yeah, we all know YAML.
Yeah, exactly. You know, it'ssort of. There's that famous XKCD
(17:04):
article about like the querylanguage to solve all query languages.
This is the last one. Youknow, it's not a new language, it's
just YAML. And very simplyit's just describing a problem, its
severity, its impact, how easyor hard it is to mitigate the actual
language. What is the cause ofthis problem? What is the impact
(17:24):
other people in the communityhave seen? Like when I saw this problem,
this is what would happen. Andalso like if there's a mitigation.
So like, like you Said thelike comet that was buried 30, 30
comments deep in a GitHubissue. Like, here's the how to fix
it. Like how do you surfacethat up to the top. Yeah. And then
coupling that information,like what is this problem for? How
do you fix it with the actualway to find it. So that way when
(17:47):
you find it, you'reimmediately like the Google search
with that term that you, youknow, you finally found has already
been done for you. Like it'sright there. You just go do that,
that thing and then there'sreferences to all of those results.
So I mean that's at afundamental level, that's, that's
the idea and the way thelanguage actually works or way the.
It's YAML. But the way thedescription of the problem works
is it's just like I said, it'sa sequence of events you could also
(18:10):
do set, so order doesn'tmatter. But you're describing these
conditions that must be trueor not true within a window of time
with correlations that canhelp you find that problem. And for
preq, the open source tool,the data sources that you can run,
the rules on are things likestandard DIN log data, configuration
data, and then the enterprisecommercial version has a much richer
(18:33):
set of data sources that youcan run it on, like process events,
kubernetes events, time seriesdata, lots of other data that you
might be interested in looking at.
Very cool. So like, as far aslike SRE and like kubernetes, you're
going to just attach it to allthe events that are happening, have
that like kind of running onthose events as they're coming through.
Yeah, exactly. Yeah. Yep. Likeif you're, you know, if you want
(18:55):
to know, hey, do I have adeployment with too many replicas
scheduled on the same node orin the same cloud region and I want
to know about it becausethere's a risk of an outage taking
down my service. Like you cando things like that.
Oh, that's cool. So it's not.So it's not just like CVEs and like,
oh, I found the SSRF. It'slike, yo, this, this right here is
(19:15):
going to absolutely ruin your day.
Yes, yes.
When, when US east one goesdown again.
Yeah, exactly. It really isabout trying to prevent people from
finding out the hard way. Wewant to take advantage of that one
person that first time thatfound out the hard way. Let's make
that the last person that hadto find out the hard way. And then
take, take this knowledge andspread it out and Use it in an automated
(19:36):
fashion so that whenever thathappens to someone else, it's detected
and it's mitigatable, like immediately.
Yeah. Oh my gosh, I wish Iknew about this weeks ago.
Yeah, well, I mean, look, it'sa new idea, it's a new approach.
You know, we were doing thisin security for a long time, but
again, in reliability, whenthere's a problem, you just, you
(19:56):
go to your dashboard, you goto your dashboards, you probably
have, you know, tens orhundreds and you're just sort of
around for a while and thenyou narrow in and then you do exactly
what you just said. You'regoogling, you're asking people. You
know, it's a long process,drawn out process to kind of like
Neo from the Matrix. Learnwhat's happening here. What do I
need to learn about rightthis, like what's happening right
(20:16):
now? So yeah, I think we'reexcited about, you know, instead
of starting with aninvestigation, how do you start with
the detection.
Ops teams, you're probablyused to doing all the heavy lifting
when it comes toinfrastructure as code wrangling,
root modules, CI CD scriptsand Terraform. Just to keep things
moving along. What if yourdevelopers could just diagram what
(20:36):
they want and you still gotall the control and visibility you
need? That's exactly whatMassDriver does. Ops teams upload
your trusted infrastructure ascode modules to our registry. Your
developers, they don't have totouch Terraform, build root modules
or even copy a single line ofCI CD scripts. They just diagram
their cloud infrastructure.MassDriver pulls the modules and
(20:57):
deploys exactly what's ontheir canvas. The result, it's still
managed as code, but withcomplete audit trails, rollbacks,
preview environments and costcontrols. You'll see exactly who's
using what, where and whatresources they're producing, all
without the chaos. Stop doingtwice the work. Start making infrastructure
as code simpler withMassDriver. Learn more at MassDriver
(21:18):
Cloud.
Let's say I get something'sdetected. Let's say that I have this
running maybe so I can run itin a GitHub action. Yeah.
Yep.
So that will just like in my,in my build I'll just see that boom,
that workflow fails and here'sthe issue. And then there's like
a link to what the resolutionis or does the tool actually suggest,
(21:39):
like suggest the change.
So the, the pre tool itself.So maybe a couple things here where,
where you can run it. Peopleare running it in CI jobs, they're
running in Jenkins builds. Buta lot of People are actually getting
a lot of advantage or a lot ofbenefit from running in production.
So they'll run in production,QA and Jenkins. So they try to find
issues early, but sometimessingle slip through. And then you
(21:59):
can also run it as like, youknow, a build job, a CI job. And
then when a problem isdetected, you are presented with
the CRE schema and the ruleand the mitigation, the references.
But you also have anopportunity to automate it with a
runbook. So you can do, youknow, things like create a JIRA ticket,
send a Slack notification, oryou can even execute like an arbitrary
(22:20):
binary or shell script, giventhe input of what was found and take
some specific action. And thenthere's even rules that you can specify
in those automated runbooks,those automated actions. So for this
cre, if this happens, I wantyou to do these three things in order.
Nice.
It's sort of all about like,hey, we do want a human to be able
to like, make a judgment onthis call. But you could also automate
(22:41):
it, you know, if you feel verycomfortable with that automation.
That's a cool integrationbecause I know. So, I mean, I'm think
as I'm hearing this, I'm like,this is so cool. But I'm like, I'm
immediately afraid to like,throw it in the code base because
you're like, man, if this justlike stops all my builds. Because
we've, you know, we've gotthese things that we just don't know
about. But. But to be able totie it into a runbook where it's
like, hey, I want to know andI want a ticket opened.
(23:02):
Exactly.
I just don't want to halt thebuild. Right. I would love to see
warnings, but that is, that ispretty rad. Right? And now I feel
like that's. That's actually abit of a boon because now I feel
like this is one of thosethings that's so hard. This is actually.
I love this. This is one ofthose things that sucks so hard is
like communicating debt tolike, your project managers, product
owners, et cetera.
Right?
And so it's like, oh, hey,person, that's probably not looking
(23:24):
at actions. There's a wall ofstuff that's wrong here. And they're
like, I don't understand anyof this, but they do understand tickets,
right? And if your roombucksstart opening up a bunch of security
vulnerability tickets, like,that could be the thing that helps
you push through, like thistrue SRE idea of like, error budget.
Like, we have problems andit's hard for us to communicate it
to the rest of the org. Butnow it's just like, look, this is
(23:47):
stuff that we need to focuson, and it's tickets, and somebody.
Somebody's gotta schedule itor close it right now. Right. It's
so much easier than showingsomebody a wall of a failed build
and being like, I told you,we've got some security things we
need to deal with. Like, thatis tight.
We actually do have severalcustomers that have specifically
said giving our productmanagers and leadership visibility
(24:09):
into, like, the daily chaosthat we have to wrangle and fix is
actually providing almost asmuch value as actually preempt, like,
early on fixing those issuesbefore customers are impacted. Yeah,
because, yeah, it is a lot ofsort of unseen work that platform
engineering teams are havingto deal with on a daily basis. And
I think giving light andvisibility to that is very helpful.
(24:33):
Yeah, that is tight. Iremember there was a company I worked
for maybe eight or nine yearsago, and we had. We had this debt
problem where it was like, theproduct had existed for, like, 10
years. It was revenue positivefrom, like, day one. So this company
just went and went and wentand went and went. And there was
just so many debt, like, justleft by the wayside. And it was just.
Everything was always in thepursuit of revenue. And this company
(24:54):
was very good at making money,but, like, the code base was just
torturous to work with. And wehad a really hard time communicating
the debt, like, to the team.And we actually built some tooling
internally to surface it. Andso, like, we had these, like, comments
that you could put in that waslike, hey, this is impacted by another
piece of debt. And so youcould put in, like, the ticket number,
and it would actually build adashboard, like, relating back all
(25:16):
the tickets that were, like,slow and, like, off from their estimates
that were tagged with debt.And it would be like, you'd go look
at a ticket that was like,hey, this is the debt, and you'd
see, like, 48 PRs referenceit. And so it was very visible. And
what happened was the productmanagers started seeing this, how
this debt was impacting thefeatures that they were trying to
(25:36):
get out. And now all of asudden, that empowered us to start
prioritizing debt. It was. Itwas easy to communicate. I feel like
this is. This is great forsecurity teams and SREs that are
like, we have problems. Weknow there's problems. And, yeah,
that is a really cool integration.
And I think it's also. I thinkthe thing that gets us excited is
how do you take that one or.One or two person team, small team
(26:00):
somewhere.
They're all small teams.
Yeah, yeah, exactly. And justlike, how can we all benefit from
one another's collectiveknowledge? Like, how do we do that?
How do we enable that? Yeah.And that, to us, is very exciting
because the exact sameapproach worked in security 20 years
ago because it was the samestory then. It was a small team of
(26:20):
security people. And oncepeople found a way to share, kind
of instantly when there was anew problem, the game kind of changed.
And I think there's anopportunity and reliability to do
the exact same thing.
That is pretty neat to go backinto CRE really quick. The rules
are, they're all open sourceas a community. Everybody's. Everybody's
putting those rules back inthere. What quality gates are there
(26:43):
to make sure that peoplearen't like, kind of poisoning the
well?
That's a very good question.And this is perhaps where, you know,
the NSA background is helpful,because he'll get.
He'll send people after you.No, no, no, no.
That's actually not what Imeant. No, it is what I meant. That's
what I meant.
It is what I meant.
Whenever you're looking into aproblem, it's really important to
(27:03):
have a reproduction. You needa way to reproduce the problem so
that you can validate, youknow, what the problem is. You can
see it.
Yeah.
And you can validate that. Youcan detect it and even fix it. And
so one of the kind of corerules that. That we have for any
submission for CRE is that youhave to be able to demonstrate the
reproduction. So we're notdiscouraging people from using, you
(27:26):
know, AI to, you know, helparticulate some of the words for
your title and that sort ofthing. But the end of the day, without
a reproduction where you canprove the problem's happening and
prove that the rule works, wecan't accept the submission. And
I think that's like the.Probably one of the most important
quality gates for acceptingrules in a community is to demonstrate
(27:48):
that you have the reproductionand it's not just a video. It's like
you got to actually have ashared repository somewhere where
there's an actual reproductionscenario that anyone can run. It's
a scientific method. You gottabe able to allow the community to
replicate the test you did.
That's pretty cool. I mean,it's rigorous, but I mean, that's
how you stop people from justpopulating it with junk, turning
(28:11):
it into adware for theirsecurity firm. Right. That is very
cool. And I feel like sobringing up LLMs there, like being
able to use an AI to punch upthe titles and whatnot. How do you
see the world of LLMs fittinginto this? I feel like, tell me to
bleep any of this if I haveto, but I feel like a partnership
between you all and eitherlike GitLab and GitHub, where it's
(28:33):
like they have issues, theyhave just walls of comments. I feel
like there's just. There's somany of these cres out there that
people have found that arejust like lost. 38 Comments deep
in GitHub that's exactlyright, yes.
I mean, first of all, I thinkhaving a reproduction, it actually
does require a lot of work.And the good news is that there are
(28:55):
many projects out there, likeIstio and others that have troubleshootin
guides. Through theirexperience of people finding problems
and doing the reproductionsbecause they ran into the problem,
they've like taken all thatknowledge and put them in these guides.
And you can actually writerules from those things fairly quickly,
which is kind of cool, butit's still a human doing it. So I
(29:17):
think there's a couple ofthings that we get really excited
about with AI. The first isactually using it in the pipeline
of reproductions. So I don'tknow if you've ever used OpenAI as
codex, but it's actuallypretty cool. Like you could watch
it check out a Dockercontainer, download your GitHub repository,
you give it a task to say,"Hey, I want you to go increase my
test coverage to 50%." Andit'll go and do that and it'll actually
(29:40):
test it and then put up a PRand you can actually watch it do
its work. And I think one ofthe things that we've been really
excited about is leveragingmodels in a very similar fashion,
but for the reproduction. Andso I think that's one way that we
get... we're excited about thefuture of scaling a process like
this with AI.And that's sortof an important thing that we thought
about with things like LLMs. Ithink another piece that's really
(30:02):
important is the schema isnice. It marries this mitigation,
the impact, the referenceswith how to detect it. But sometimes
nothing beats a really goodstory, especially whenever it's concise.
And I think LLMs actually do areally good job of summarizing content,
especially maybe complexcontent that's like, "Hey, first
(30:22):
this thing happened over here,then this thing happened over there."
So another thing that we'vebeen doing with LLMs is when a CRE
is detecting a problem, we'llactually take an LLM and say, "Okay,
we'll give us like the coupleof sentences that describe the problem
and walk us through it step bystep, just a couple sentences at
a time, and use the actualcontext of the rule." And the cool
benefit of this is that you'renot taking gigs of RabbitMQ data
(30:48):
and putting it into an AImodel and telling it like, "Hey,
tell me what happened." Youactually have this intermediate step
that's reduced your tokencount. It's more focused. And so
the actual content you'resending to the LLM is much less and
it's cheaper and it scalesbetter. You know, your CFO might
be happier. So I think that'slike the second thing that we get
(31:08):
excited about. And then thethird piece is just in rule creation
itself. Just like imagine whenyou're doing a development in Cursor,
like the same exact experienceapplies to writing a CRE.
I don't know that it makes aCFO happier. I think you can only
make a CFO less mad. I don'tknow that you can make them. I've
never met one that you couldmake happier. You can make them less
frustrated.
Yeah, yeah, yeah. Okay. Fair,fair, fair, fair.
(31:30):
If you have a good cfo.Congratulations. Sorry.
That's fine.
I'd be curious, like, you saidsomething there, like, about the
Istio team, like it seems especiall these teams that are managin
extremely popular open sourcelibraries, they actually have a wealth
of this information maybecodified back here [signals to his
(31:51):
head with his hands]or intheir git repos. Is there like a
means of... almost like aframework of ho all these open source
projects can get this stuffback in?
Actually, that's an excellentquestion. One of the things that
we started working on in thelast couple of weeks is partnering
with open source projects.Because again, you're building up
(32:11):
this wealth of knowledge ofknown issues. And it's not just the
open source maintainers. Youknow, a lot of these open source
projects have commercialcompanies behind them with customer
success teams that havescripts that they run of all of their
known issues. And so they'redeveloping all of this stuff themselves.
Just imagine a world where youcan take all of that knowledge and
share it and put it in like arepeatable way that's detectable.
(32:35):
That gets really exciting tous because it just kind of speeds
up all of those teams andmakes that knowledge something that
can be automated by a machineand then Leveraged by AI.
Yeah. And it's like, you know,if you are a for profit company that
has an open source tool likethat, t share your private knowledge
is good for you because it'sgoing to increase your open source
(32:56):
adoption which is going toincrease your pipeline for your enterprise
product. Right?
Yeah. And you asked earlier,sort of maybe what's the motivation
behind launching an opensource project? I mean, I think there's
a couple things there. I meanI was at Elastic for four years and...
2019 to... maybe it was five.It was a while, it was long enough
(33:16):
to kind of see firsthand howopen source communities just make
the ecosystem better. That wasreally exciting to me. Elastic's
open source community isamazing. I think when you're building
a community and leveragingknowledge, it's really important
to pu the mission first. Themission is what matters. We want
(33:38):
a world to exist wherelearning it the hard way doesn't
ever happen to anyone else,just happens once. And so I think
in order to make that true,there shouldn't be a paywall between
that objective, you and thatobjective. So the open source I think
aspect of it, it's just reallyimportant from a mission perspective,
like how you actually achievethis goal. So that's why we went
(33:59):
Apache 2, that's why welaunched those two projects with
that license. And I think it'sgoing to pay off. In the long run,
we want the world to be abetter place and I think open source
is an important... I meanlook, every commercial product that's
ever been created uses opensource. It's like that's just state
of facts. So yeah, I thinkthat gets us excited. The mission
(34:21):
first focus with open source,that's the way to do it.
Yeah, I think this is aproduct and I think this is a space
that kind of rises the tide,like it lifts all boats. Becaus
I think the reality is there'sso many teams that are using these
tools that aren't securityexperts. And it's like, it's funny,
like when I talk to friendsthat are outside of engineering and
(34:42):
software and cloud, they'relike, what do you mean you're all
not security experts? I'mlike, dude, nobody... there would
be zero software on the planetif every software developer was a
security expert. We would bein 1992 still. Right? And so the
reality is everything that wedo is impacted by the security constraints
and experience of other companies.
(35:03):
Yep, yep, exactly.
And their outages. Right? Soit's like it is hard and it's like,
you know, just knowing in thetime that I develop outside of CEOing,
like I have had exactly one ofthese. And it's like we would have
been... I think we would havelaunched this feature we were working
on like three days faster ifthis CRE wouldn't have cropped up
(35:24):
on us. Right?
Yeah, exactly. Just to makesure it's crystal clear... I do this
because we have a backgroundin security... CREs and Preq are
actually not for... it's notsecurity. It's specifically only
reliability. There are lots ofcool tools out there like Snyk and
others - like you've actuallyhad some conversations with folks
at Snyk before. They're doinga great job handling/detecting vulnerabilities.
(35:48):
And we've actually kind ofabandoned that world because it is
so big. And we get reallyexcited about trying to take those
lessons, those sameprinciples, but to an entirely different
space, at least to us, whichis reliability problems. Just normal,
plain old, you know,interesting software bugs. The sort
of overlooked, I feel like,but very important because whenever
(36:08):
you have an outage, it'stypically because of a bug and not
because of a vulnerability.
Even more important because, Imean if you look at the surveys like
year over year from StackOverflow, State of CD, like the amount
of people with cloudoperations experience is going down
relative to the number ofsoftware engineers that we have,
because we're just producingthem out of boot camps - which is
great. We need more softwaredevelopers - sorry, guy from Claude
(36:31):
that disagrees with me, but weneed more of them. But we also need
more operations experience,right? And like that SRE-ness is
like... a lot of people's SRE,their reliability is directly, or
I guess inversely, tied totheir cloud costs. How do they solve
problems? They just overprovision. You want to start getting
your cloud cost under control?It's not buying a cloud cost tool,
(36:53):
it's investing in SRE.
Yeah, right.
Being able to have morereliable systems with less compute
is how you save money. Not byjust buying a tool that's like, "Hey,
this Aurora is expensive."It's like, "Yes, I know this Aurora
is expensive. I have 85 gigsof RAM in it because I want to make
sure it doesn't go down." It'slike get somebody that knows how
to run the thing.
Actually, it's funny you saythat. One of the biggest values that
(37:20):
we've seen customers get fromtaking this approach has been in
reducing their cloud costs.Because you're right. In the past
when there have been issuesand problems, you kind of have a
couple of levers you can pull.One is, okay, let's go take some
people off some high priorityfeature and go investigate this problem
and fix it. Another one is addmore replicas, scale it up and hope
(37:43):
it happens less. Anddefinitely that's something that
I think you can... People arepulling that lever all the time because
it's fast, but in the long runit does end up costing you a lot
more money.
Yeah, it was funny I was justtalking to somebody the other day,
like one of the things I loveseeing in Terraform, one of the things
that I try to do is I try toexpress the configurations in the
(38:04):
developer's language. So like,rather than say, "Hey, developer
(who probably has noexperience with AWS instances) which
instance type do you want? Doyou want an RG6 extraextra,extraextralarge?"It'slike,"I
wantonethat'snotgoingto wakeme up at 2am." That's what I want
as a developer, right? I liketo present my Terraform in like very
much the developer's language.So it's like, "Hey, how much growth
(38:24):
rate are you expecting on thisdatabase?" And then calculate like
the instance type like behindthe scenes for them.
A so I feel
like, you know, a lot of timesyou'll go into organizations where
they haven't had somebody withSRE or Operations experience and
you look at an Aurora instanceand it's got 15 replicas and you're
like, "Why does it have 15replicas?" And people are like, "Don't
(38:44):
know." Like, "Why are they R6extra larges?" and people are like,
"We... that's just, that'sa...". It's like this whole thing
is expensive and we don't knowwhy. And it's just like being able
to understan why andunderstand like the reliability of
it is, I think, something thatmany organizations are missing. And
I feel like a lot of them seethat symptom of high cost and they
try to treat cost rather thantrying to treat more professional
(39:07):
approach to reliability.
Yep, totally. What's excitingabout technologies like Terraform
and Docker is it's almost likethis manifest approach of describing
what you want. For Terraformit was infrastructure, for Docker
sort of like softwareorchestration, but the same doesn't
really exist for how to detectproblems. You know, what you're gogoingtonnahave
monitor for. And I thinkthere's a real opportunity with CREs
(39:29):
to do the same thing thatTerraform and Docker did for their
respective spaces. It's like, instea of coming up with a detection
only after the problem, howabout we actually say let's subscribe
to the types of detections andmonitors that we want to have in
place first... so originally,when we're actually constructing
and building this project,this software.
Well, I know we're gettingclose to time. I have a few more
(39:51):
questions that are a bit morerapid fire I'd love to ask you. How
do you feel about that?
Let's go.
So first one, what is theweirdest or most memorable bug that
you've actually chased down?If you can talk about it.
Oh, gosh, there's many. Youknow, I mentioned this problem that
we had in 2017 with our NATSclient in the deadlock. I actually
(40:15):
think that one still sticksout in my mind as one of the most
interesting ones. You know,putting aside for a moment the pressure
of having like a six daypartial outage for some customers,
that was not fun.But we justhad this mystery of like, why is
this happening? We can'texplain it. And we ended up going
into a lot of... I actuallyended up blogging about it, like
on a Medium article in 2017.The NAS team, they're great. They
(40:39):
ended up like retweeting itand we talked about it. But I think
the reason why that one was sorewarding is because sort of the
start of the journey is like,there's no obvious answer to why
this is happening. We have noidea what, what's going on here.
And we had to write some extratools to do some additional introspection
into the messages to kind ofreally hone in on this deadlock.
(41:01):
And then once we had thetheory, like, we had to test the
hypothesis. And so actuallytesting that out and like seeing
the reproduction and thenseeing the reproduction go away with
the fix was just like... Ilove a story that begins with, I
have no idea how this ends.Like, no clue. Like I was just...
no, no idea, but we'regogoingtonnahave tofigureitout. And
(41:22):
then when you finally get thatanswer, especially after it's hard
and like grueling and you'relearning something new, you'r learning
new skill sets to actuallysolve the problem. To me, that's
like a really rewardingexperience. And it's honestly why
I get so excited aboutbuilding a community around problem
detection because there aremany people out there that have gone
through the exact same work.And wouldn't it be great if you could
(41:46):
benefit from that? I thinkthat's the thing that gets me really
excited about it.
Yeah, I think the thing that'sso cool with like this whole idea
is like, there's so manysoftware projects that I've seen
where they have like a wholesection of their test suite around
making sure regressions don'tget reintroduced. An it's just like
that's a way of treating aproblem, not addressing a problem
by treating a symptom of it.Okay, so assum there is no CRE for
(42:10):
something you've justdiscovered. Your years of expertise
in hunting bugs, like, whatare some tips and tricks, if there
isn't a CRE for this, of likehow to figure out what is causing
this issue that you've learned first?
What a good question. Maybe Iwould think about... So Brendan Gregg
is someone that you shouldknow. Go check out his blog. He talks
actually about a little bit...he call it the use process. There's
(42:32):
sort of a process that hedescribes for how to go and find
problems through data thatyou're collecting. And I think that's
a good read. So go check that out.
Okay, we'll put that in theshow notes.
Yeah, it's good. There's sortof like a general approach and then
there's like specific skillsto build that fi into this approach.
So when a problem happens, youbasically are a detective. You're
(42:54):
like a homicide detective. Amurder has happened and you have
to start putting together atimeline. Like, what has happened?
When did it first starthappening? You're going to have witnesses
you have to go interview.Those witnesses could be logs, it
could be humans, it could beTCP, like dumps from Wireshark. You
know, whatever it is, you'vegot to go and collect a bunch of
(43:16):
evidence. And that evidencemight be lying to you. It might be
like a red herring. It mighttake you down a rabbit hole.
A river full of red herrings.
It doesn't matter. And I thinkthe other thing is like, i you feel
like you start making theseassumptions, "Oh, I bet I know it's
this. Or it's probably that."You're wrong. You're probably wrong.
And so again, a detective, agood detective is going to just follow
(43:38):
the evidence. He or she'going to look at the timeline and
pull all this stuff together.And as you're putting the timeline
together, different theoriescan start emerging. Like, "Okay,
well, if I connect these dotsin this order, it could be this",
or "If I connect these dots,it could be that." So then you start
thinking about what is themost likely hypothesis here that
(44:01):
would explain how these thingsare happening? And yo also see holes
like, "Oh, we don't knowwhat's going on here. If the hypothesis
is this is what's happening,well, we are missing a piece of evidence
that would actually make thismore likely. So let' go interview
that witness." Like, we forgotto interview that, you know, whatever
that is. So I think that,like, that general approach is really
important when it comes tohunting down a bug. It's, you're
(44:24):
a detective, you're puttingtogether a timeline, you need to
go interview witnesses, andyou can't make any assumptions. And
once you've collected enoughdata, you look at it, you see what
theories emerge from it. Ifyou have a theoryb,bututthedata doesn't
support it, you either need togo find more data or eliminate the
theory, and then you starttesting it. I don't know. That's
a really good question. Thisis what I love. That's what I love,
that is like it, right there.
(44:45):
That's funny. I'm like, I wishI could go back in time and just,
like, vet this against, like,every time it's happened. I am definitely
a jump to my gut, like, "Oh, Igot this." And I don't know how many
times... this problem that ran into the other day, I thought
I knew exactly what it was. Idrove about eight hours of effort
into, like, "I know what thisis", and I was absolutely wrong.
(45:06):
And it was just, like, I gotto that point where I'm like, "I
better go look for some data."
Yeah
Like, I better go... And itwas just like, dude, I spent so much
time like, "Oh, I knowexactly, exactly why this is happening."
It's so funny you say that.
Nowhere close.
I do it too. Honestly, I do ittoo. It's, "I've seen this story
before. I bet it's this." It'slike, almost always when I see a
(45:27):
problem come up, that's likemy first instinct. And I have to
fight it a little bit, though,like, "Okay, but I could be wrong.
I could be wrong." All thatmeans is that you've just sort of
had the experience build upover years, where you can accelerate
that timeline and witnessinvestigation process, and come up
with the theory fairlyquickly. But as long as you're still
validating it and like goingto look - I think that's, you know,
(45:50):
stil the right approach
It's been really great havingyou on the show. This has been super
fun. So we'll put the link tothe open source projects in the show
notes, but where can peoplefind you on social? On LinkedIn,
X, Bluesky.
Prequel.dev is the website.All of our socials are on there.
You know we're on the Blueskysand the LinkedIns, but I would just
(46:12):
send people to prequel.dev. Check out our blog. Our reliability
research team is alwaysputting out new content. Like you
asked that question earlier,"What's the most interesting rule?"
Every couple of weeks we'reputting up a new story that ends
with a rule at the bottom ofit. So I would just go check out
our blog. Definitely go checkus out on GitHub, throw us a star.
But better yet, try us out.We're looking to grow the community
(46:34):
and excited about buildingthis future together with everyone
in it.
And again, just a reminder foreverybody, it sounds like it's easy
to bring this in. You don'thave to get it into your Kubernetes
cluster, you can bring it downlocally on your MacBook, Linux, whatever.
Try it out, move it to CI, putit in production when you're ready.
Yes, exactly. You can use thissoup to nuts without ever talking
to us.
Hey, is there a thing an OPSperson loves more?
(46:57):
Exactly.
Awesome. Well, it was greathaving you on the show and thanks
so much for the time.
Thank you for listening tothis episode of the Platform Engineering
Podcast. Have a topic youwould love to learn more about. Let
us know at coreyassdriverCloud. That's C O R Y at massdriver
(47:23):
Cloud. Catch you on the next one.