An Oral History of the Kubernetes Revolution with Brian Grant

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
From my perspective, one of the things that it did is it created
an infrastructure ecosystem that was broader than any single cloud.
Welcome to Fork Around and Find Out the podcast about
building, running, and maintaining software and systems.

(00:31):
Managing role-based access control for Kubernetes isn’t the
easiest thing in the world, especially as you have more clusters,
and more users, and more services that want to use Kubernetes.
OpenUnison helps solve those problems by bringing
single-sign on to your Kubernetes clusters.
This extends Active Directory, Okta, Azure AD and other sources as

(00:53):
your centralized user management for your Kubernetes access control.
You can forget managing all those YAML files to give someone access
to the cluster, and centrally manage all of their access in one place.
This extends to services inside the cluster
like Grafana, Argo CD and Argo Workflows.
OpenUnison is a great open-source project, but relying
on open-source without any support for something as

(01:15):
critical as access management may not be the best option.
Tremolo Security offers support for OpenUnison
and other features around identity and security.
Tremolo provides open-source and commercial support for OpenUnison
in all of your Kubernetes clusters, whether in the cloud or on-prem.
So, check out Tremolo Security for your single sign-on needs in Kubernetes.

(01:36):
You can find them at fafo.fm/tremolo.
That’s T-R-E-M-O-L-O.
Welcome to Fork Around and Find Out.
On this episode today, we are reaching a quorum.
We have three of us.
We have Brian Grant.

(01:56):
Thank you so much for coming on the show.
Hi.
Thanks for inviting me.
It’s so weird that you didn’t say, ‘ship
it.’ Like, it’s just, like, we have made—
We have forked.
We forked the—
Oh—
—podcast.
—we did fork.
Oh, my God.
[laugh] . This is brand new open-source.
It’s a brand new fork.
And again, we’re reaching, I don’t know, the [unintelligible] consensus here,
Autumn and I have elected Brian as the leader of this episode [laugh] . And so—

(02:20):
Poor Brian’s like, “I just got here.
How did they, like, put me onto this new responsibility?” Like—
Look Brian is responsible for that joke, right?
I’m putting all the—
Oh, you should have seen the dad jokes that happened before you got here, okay?
We were trying to figure out the YouTube Short thing, and I was
like, “We don’t even have any shorts.” And Justin’s like, “It’s cold.
I have pants on.” And I was like, “Oh, my God.”

(02:42):
For people not familiar, Brian is one of the three founders,
or people that started Kubernetes within Google, and—
Well, actually we had roughly five.
So, there was Joe Beda and Brendan Burns on the cloud side,
and Tim Hockin and I on the internal infrastructure side.
Does that mean that we, like—that people

(03:03):
blame you when they scream at Kubernetes?
Oh, I am to blame for a lot of things in Kubernetes.
We could talk about that.
Probably more things than any other single person, maybe.
Maybe Tim now because I’ve been away for a while.
Yeah, were you one of the first people that wrote it in Java?
Was that the first—
No, that was the prototype.

(03:23):
Brendan—
Brendan wrote the first one in Java?
That sounded scary.
[Beda] was very early, and worked on the initial implementation.
Also Ville Aikas, who’s at Chainguard now.
But he didn’t really carry over once we started building it for real.
He did something else.
And then there was Craig McLuckie on our product side.
But when we started, we weren’t in our location.

(03:45):
We didn’t have a manager, you know?
We just started working on it.
And you were coming from internal infrastructure.
You were doing Borg, and Omega, and everything
else inside of Google and kind of shifted—
What’s Borg and Omega?
You got to, tell—
That’s what—I was just going to go into that.
That was great.
Okay good.
Because I feel like sometimes when, like, you work in a certain type of
software, we forget that everybody doesn’t know, like, necessarily what that is.

(04:07):
So, tell us all the things.
Yeah.
So Google, many large companies and even a lot of small
companies, built all of its own infrastructure tooling.
So, it had an internal container platform called Borg, which was actually the
reason… the motivation for adding cgroups to the kernel, to the Linux kernel.
So, that was just added and rolling out the time I joined Google in 2007.

(04:31):
So, that was just like brand new in Borg.
And the reason Google created Borg was they had two previous
systems, Babysitter—which Borg ran on when I started on
the Borg project—and Work Queue, which ran MapReduce jobs.
So, they had one, kind of, batch queuing system and one service running system.
And what they found was that they didn’t have

(04:54):
enough resources for all the batch workloads.
There were a lot of idle resources in the serving workloads,
especially during certain parts of day, in certain regions.
So, they wanted to make a system that could run both
types of workloads to achieve better resource utilization.
Resources were scarce for years and years and years,

(05:14):
you know, even though they had a vast fleet of machines.
It’s funny to think you’re like, “Oh yeah, no, we have hundreds
of thousands of nodes, and we have not enough resources.” [laugh]
. Yeah.
Yeah because the main services were run all the time, they were
adding new services, they were moving services that were previously
not on Borg onto Borg, they were moving acquisitions onto Borg.
There just weren’t enough resources.

(05:36):
Borg project kicked off around the beginning of 2004, so before
I joined, around the time that Tim joined Google, I think.
And—
Yeah, Tim just hit 20 years.
That’s amazing.
Yeah.
Yeah, yeah.
So, I was only there 17 years.
[laugh] . Slacker.
Come on, Brian.
You said, “Only,” like, it was not that long [laugh]
. But you joined directly to the Borg team?

(05:58):
No.
Actually that was one of the teams that—so I came into an acquisition before.
I did something that was not of interest at the time, which,
if you read my LinkedIn page, you may know what that is, but
I did high performance computing on GPUs way, way too early…
[laugh]
. Like 2005.
Nobody cared.

(06:18):
That was the problem [laugh]
. What’s it like working on something and knowing that
it’s going to be great, but it not being the right time?
Like, is that so frustrating?
Because, like—
Yeah, I’ve done that a few times.
You said a few times [laugh] . Not once,
but he’s like, I’ve been in that struggle.
The startup was PeakStream, and the challenge was that there

(06:39):
weren’t people who needed extremely high performance computing
in something that was not, like, a Cray supercomputer.
Because I did supercomputing in the ’90s and worked for
a national lab and things like that as well, and you
know, they had their own kind of big, metal machines.
But there were people who needed high performance
computing, but not of that scale or cost.
The people who did were willing and able—and able is

(07:02):
an important part—to actually hire experts to squeeze
every last cycle of whatever chip they were using.
There are also a bunch of other challenges, like—I
mean, at the time we—it was before Nvidia launched CUDA.
So, Nvidia was working on CUDA.
Our founder was from Nvidia.
They called it GPGPU back in those days, General Purpose Computing on GPUs.

(07:24):
So, that was starting to attract interest, somebody wrote a book.
PeakStream was one of the companies, kind of,
starting in that area, one of the earliest.
And back in those days, the chips, I mean, they weren’t designed for it.
They were designed for graphics, right, so they
didn’t really have a normal computing model.
They didn’t do I.e.,EE floating point; they did

(07:45):
something was that was sort of floating point-ish.
And they didn’t do integer computation
either because the shaders didn’t need it.
So, they didn’t do 32-bit integer computation.
They did some simple computations.
Like, indexing into memory, normally, the way that works in a CPU is the
memory unit has an adder that takes an integer memory address of whatever

(08:07):
the word size is on the machine, like, 64 bits these days, on those chips,
and it does an add of an index, an add and a shift to scale the index.
So, if you’re loading something that’s four bytes or eight bytes or
one byte, you know, it does the shift appropriately and adds the index.
So, that’s an integer computation.
You get the memory location that you want.
These chips didn’t do that.

(08:28):
They actually indexed into memory using floating point.
And I make—can I just say, like, already, off the bat, we’re less than
ten minutes into this episode, and we’ve already explained [laugh]
, like, deep chip, like, addition, for how these things are working.
And also, like, we got our first, like, ‘um, actually,’ which
I feel like I need, like, a sound bite for when you were
correcting me on the founding of Kubernetes, which was amazing.

(08:50):
I love this already.
This is going to be such a good show.
Well, also that, but like, for context, when you’re
talking about this chip work, how long ago was that, Brian?
This was 2005.
And think about how relevant this is today, and how
much people want to get all they can out of chips.
I looked at the APIs for XLA and some of the recent machine-learning GPU

(09:10):
interfaces, and they’re very, very eerily similar to what we did [crosstalk]
. That’s what I’m saying.
Like, how crazy—y’all didn’t see Brian’s face when we said, “How many
times have you worked on stuff that, you know, was too early,” but
talk about how relevant that is into what people are doing right now.
Yeah.
Well, the thing I did before PeakStream was another interesting

(09:31):
hardware-software thing, which was a company called Transmeta.
Do you sleep, Brian?
Have you slept in the last 20 years?
Like, what [laugh] —wait, how long have you been
in tech, total because you’ve done some things.
We’re only ten minutes in.
More than 30 years.
You’re the whole dotcom bubble.
That’s awesome.
Well, I went to grad school during the dotcom bubble, mostly, in Seattle.

(09:53):
So like, a lot of students were dropping out to go
to Amazon and things like that, in the mid-’90s.
One of the first Web crawlers was designed by
another student at University of Washington.
So yeah, I was watching that, and I don’t know, I didn’t really feel the pull
of startups at that time, but when I did approach finishing my PhD, I considered

(10:14):
both industry research and startups; I didn’t really think about academia.
And the startups did appeal to me more.
I did turn down a startup you may have heard
of—which is Google—when it was 80 people.
80 people [laugh]
? You turned down Google at 80 people?
That’s a startup.
You don’t want to go there.
It’s who knows what the future looks like for that.

(10:36):
It didn’t work well for me compared to the other search
engines, so I was like, I don’t want to move to California.
Alta Vista was awesome.
Yeah, I know.
Alta Vista was awesome.
And I searched for, like, I need a preschool for my daughter.
Like, how do I search for that?
And I just—it was just awful.
So, I was not impressed by that.
And I also was interested in the technical area I was working, which
was dynamic compilers, which is, you know what Transmeta was all about.

(11:01):
PeakStream also used that, a dynamic compiler.

So, I built three dynamic compilers in my career (11:03):
one in grad school; I
worked on one at Transmeta, as well as a static compiler; and PeakStream.
Because everybody does that on a Tuesday.
Like [laugh]
. [laugh]
. That is awesome.
So, what finally led you back to Google to say yes the second time?
I was acquired.
Oh, right.
You’re right, PeakStream was acquired.

(11:23):
PeakStream was acquired.
You were like, “I didn’t even have a choice.”
So like, now they’re like, “We still want Brian.
We’re going to buy the whole company to get you in here.” [laugh]
. They were going to get you at, like, some point.
I did the pitch, so you know, it’s not completely involuntary.
The other potential acquire was… well, I won’t say
that, but we did have other potential acquires.
But you know, we hadn’t found product-market fit because

(11:44):
customers like high-performance trading or seismic analysis,
you know, these kind of they could hire the high performance
computing engineers to actually build what they needed.
So, in addition to all the exotic hardware bugs, which I could talk about
for a long time if we wanted to do that because that was super fun, but like,

(12:04):
the 1U and 2U server boxes would put in the cards in an orientation such
that the fans on the GPUs would get in the way, and even if they widened the
space between the slots, it would then blow into the motherboard and melt it.
Sure.
So, like, this is a thing that happened.

(12:26):
[laugh] . A small problem.
What do you mean, just melting [laugh]
? Yeah.
There’s a lot of heat.
It would melt solder, it would melt plastic.
Well, you’re probably at, like, 300 watts, 400 watts of GPU, even more.
Like, that heat got to go somewhere.
Yeah, and the quality was also a problem because, for graphics,
they’re like, just most of the pixels need to be right.

(12:47):
If one pixel doesn’t compute the right value, make it zero, and
it will be black, and nobody will notice in 1/24th of a second.
So, their bar for correctness was not the same as Intel.
Like, after the FDIV bug, Intel was just,
like, super paranoid about correctness.
And so, [crosstalk]
— Oh man, I was just reading about that bug.
That was so big.
I completely forgot about that.

(13:09):
Can we give the listeners some context?
What was the FDIV bug, guys?
The FDIV bug, the there was a bug in the floating point division unit where it
sometimes would give the wrong result, and that was not considered acceptable.
The computer didn’t math, and this was a problem.
And it was in the chip itself.
Actually, there was a Bluesky thread.
I will find it, and we will put it in the [show notes] . Because it

(13:29):
was an amazing—they had, like, they decapped the chip and looking
at the trace, like, here’s where the bug is, physically on the chip.
Y’all are missing Justin’s very excited face—
I love it.
—because his face, it’s like Christmas morning, and it’s crazy.
Yeah, so that was—at Transmeta, we had a lot of that
because the industry was undergoing a lot of change.
First of all, we were changing everything.
We had a new approach.

(13:50):
We were doing dynamic binary translation in software from x86 to a custom VLIW.
Like, not emulation layer?
Like…
In software.
Okay.
It was an emulation in the software, in a hidden
virtual machine that the end-user could not access.
What could go wrong?
Actually, all that worked, awesome.
[laugh] . The hardware was the problem.

(14:10):
Well, the industry was transitioning from 130 nanometer to
90 nanometer, which the leakage characteristics just changed
dramatically, and from aluminum wires to copper wires.
And we changed our fab to TSMC, a little fab that nobody had ever heard of.
And month after month, we were looking at these photos
of, from an electron scanning microscope, saying, you
know, this is the reason the chips don’t work this month.

(14:33):
There’s a thing called the vias.
So, the chips are multiple layers, alternating silicon
and metal, and the metal is the wire layers that connect
all the gates together, all the transistors together.
The metal layers all need to be connected because the
electricity comes in on the pins on one surface of the
chip and needs to flow through all the metal on the chip.
So, there’s a thing called vias, which is holes in the chip and

(14:56):
the metal needs to drip down through as part of the process of
manufacturing these things, at microscopic, like, atomic-level scales.
So, there’s all kinds of things in the viscosity of the metal, where, if
it’s not exactly right, it won’t go through the hole because it’s so small.
So, if you can imagine, like, raindrops collecting on a sheet of
plastic, or something like that, and not falling off, kind of like that.

(15:18):
So, we would see these pictures of, oh, this via
didn’t go through, that via didn’t go through.
Oh, this one actually went through, and splattered
across, and shorted a bunch of wires together.
So, we had a bunch of photos like that for, I forget
how many months, like, six months or something.
It was a long time for somebody trying to get a product out.
Yeah.
So, that was exciting.
Then once we got the chips back for the 90 nanometer generation, which was the

(15:39):
second generation chip design—and I just started at a—fortuitously the week that
project kicked off, so I was there from the beginning on that chip generation.
The software was all new, the static compiler was new, the dynamic
compiler was new, the boards were new, the chips were new, like,
the fab was new, the process was new, like, everything was new.
So, of course, nothing worked, right?

(16:00):
Yeah.
I was going to say this then trying to figure out what’s wrong is just—
Yeah.
So, we had a 24-hour bring-up rotation, so there’s always people
in the lab trying to figure out what’s wrong and working around it.
So, my parts, eventually, after the hardcore bring-up
lab, where it’s like, well, we don’t have a clock signal.
Why don’t we have a clock signal?
Well, the phase-locked loop has a problem.

(16:21):
Well, what can we do to electrically make the phase-locked loop work?
Once I got to the point where they could kind of run, I got a board on my
desk with a socket I could just open and close, and there were balls on the
chips, rather than pin so I could actually just get a tray of chips and slap
one in and close the socket and turn it on and try to debug what was going on.
Because different chips had different characteristics.

(16:43):
Probabilistically, there’s a distribution.
If this is not interesting, by the way, you can stop me any time.
No—
No, it’s so interesting.
I did not expect this to go this direction, and I absolutely love it.
But also, we have so much other stuff I want to talk about.
This is, like, 20 years ago, and at some point Google bought the company.
Why did Google buy it and what were you doing when you joined?
Because you said you weren’t on the Borg team originally.
I honestly, we had the same investors as Google, Kleiner and

(17:06):
Sequoia, and actually, when we started PeakStream, I worked
in the back of the Sequoia office for a few months before we
found our office, and got a company name, and things like that.
I was the third engineer there, but not a co-founder.
Effectively, there was some hope that maybe the technology could be useful.
And actually, my investigation into the data centers in Borg

(17:28):
was one of the things that convinced me it was going to be quite
challenging, but also we didn’t find a customer within Google
for that, for dense floating point computation at that time.
Like, the computations were more sparse for
the types of things they were doing back then.
So yeah, we spent a few months talking to lots of people
in the company and tried to find something useful, but

(17:49):
then said, well, it wasn’t going to be actually useful.
So, then I pivoted the team that was brought over to focus on something that was
a problem, which was Google’s—about half of its code was C++ and half was Java.
And this, 17 years ago, it was just at the beginning
of the NPTL, the new POSIX threading library in Linux.

(18:12):
Before that, there was this thing called Linux
Threads that was terrible and not really usable.
So, when Google started—and there was no C++ threading standard, right, so
you had to write your own threading primitives, effectively, to do stuff.
And you know how memory safe C is, right?
So, Google had developed all of its own threading primitives.

(18:32):
They were pretty low level though.
And the first engineer hired into Google decided, well, we’re
scraping the entire web; we need throughput, which was true.
And the chips at the time were the Pentium 4, which was what Transmeta
was competing against, which—well, anyway, I won’t go back into CPUs,
but determination was made that the most efficient thing to do would

(18:54):
be to write a single-threaded event loop and run everything that way.
And that was true at the time, but very shortly
after 1998, when Google started, multi-core happened.
Chips changed everything.
Now, multi-threading is good for CPU utilization and latency.
Java had a very strong threading model from very early on, so all the

(19:18):
Java code was actually in pretty good shape, but the C code was not.
There’s a lot of single-threaded code in
Google, so I started an initiative to fix that.
So, an opportunity was the Borg team.
Other opportunity was, make everything on Borg run better.
So, I ended up doing the latter.
I started a bunch of projects to help the new POSIX threading

(19:42):
library roll out, and the fleet to develop some new, easier
to use threading primitives to develop documentation.
I mean, back in those days in Google, it’s like, engineering was maybe 10,000.
Biggest I’d ever worked for at the time.
You know, I’d done two startups before that.
So, I thought, “Oh, man, Google’s so huge.” Little did
I know 17 years later, it would be 20 times bigger.

(20:02):
It’s so crazy that you were almost the 80th employee now that,
like, a thousand that probably seemed so big in context, but—
Ten thousand.
Oh, ten thousand.
But then, like, it’s just massive now.
Hundreds of thousands, yeah.
It was big, but in those days, I could do
things like having a company-wide tech talk.
So, I did.
I started an initiative called the multi-core initiative to

(20:24):
actually promote threading in C++ and to make it work better.
So, we built a multi-threaded [HP] server, which is still in use—HP server—and
some threading primitives, and worked on documentation and thread profiling
tools, some annotations in the compiler that would, kind of similar to
annotations in Java, where you could identify areas that are supposed to be

(20:47):
locked and ensure that the [mutex] was used properly, and things like that.
This is a random question, but what was your PhD in?
Because how did you get started in, like, GPU and, like, these very in-depth—
My background was systems.
So, as an undergrad, I worked on networking
and operating systems and supercomputing.
And I also started grad school working in the supercomputing area.

(21:11):
I did three summers at Lawrence Livermore National Lab.
And I worked on the climate model, and some group communication
primitives, and porting to MPI, which was brand new back in those days.
I actually went to one of the MPI spec development meetings.

That’s the Message Passing Interface (21:25):
MPI.
So, I transitioned into compilers because there was an interesting
project doing runtime partial evaluation, and that’s where
you take some runtime values in the program and use that to
generate specialized code that just works for those values.

(21:46):
You know, so in some cases, you could get kind of
dramatic speed-ups from the code from doing that.
So—
You’re doing, like, dynamic tuning for the code?
Or was it like—
It’s not tuning; it’s compiling.
So, you know, if you have some computation that uses some value like an
integer, there are standard compiler—if that’s a constant, like, five,

(22:07):
there are standard compiler optimizations, like, constant folding that
will take that value and pre-compute any values that can be pre-computed.
If that value is input at runtime, you don’t know what
the value is, then you just have to generate all the code.
And if there are conditionals based on that code, you have to
generate branches and evaluate those conditionals, et cetera.
If there were certain values, even data structures that were known

(22:30):
to be constant, you could potentially do some pretty impressive
optimizations, like unrolling loops, which allows you to pre-compute
even more things and reduce the amount of code dramatically.
So, the biggest speed-ups we would get
were, like, 10x speed-ups from doing that.
So, for example, if you had an interpreter, an interpreter would
normally have an execution loop where it would read some operation

(22:52):
to interpret, you know, dispatch to something to evaluate it.
Like, if it’s an add, it will go add, and return back.
So, if you actually gave the interpreter a
program as a constant value, what could you do?
Well, effectively, you can compile it into
the code instead of interpreting it, right?
So, that was sort of the most impressive use case.

(23:14):
Not super realistic, but there were some cases like that could be done.
So, to do this, what you have to do is analyze where all the values
flowed to that you wanted to take advantage of, and split the

program into two (23:26):
one piece would be a compiler that would do all
the pre-computation, and the other piece would be the code that
would be admitted once you had the values that were pre-computed.
And over and over again, it seems—I mean, like your graduates,
your doctorate, all this stuff, like, you’ve been doing this
optimization over and over again, and now you’re at Google,
you’re doing this C++ multi-core, we got to do this thing.

(23:47):
Let’s fast forward to, like, where do you go from, you’re
doing Borg stuff to, like, this Kubernetes thing comes out?
Like, this is, like, hey, we want to do something else
that’s going to—we want to open-source it, we want to do
something generalized from the stuff we were doing internally.
Okay, yeah.
I mean, after about a year-and-a-half, I got about as far as I could on
making things multi-threaded, so I transitioned to the Borg team, 2009.

(24:09):
After short order—Borg was about five years old at that point in time—it
was clear that it was being used in ways it wasn’t really designed for.
So, I came up with an idea for how to rearchitect it.
I started the project called Omega, which was an R&D project to rearchitect it.
Worked on that for a while.
After a couple of years, cloud became a priority.

(24:31):
I mean, before that, it was not really a priority for the whole company.
You know, the Cloud team was a pretty small team in Seattle,
and most of Google was down in Mountain View in the Bay Area.
Google had App Engine for a couple of years, but kind of core cloud product that
people would think about would be the IaaS product, the Google Compute Engine.

(24:52):
And my understanding was, App Engine was basically, like,
a customer front to run jobs onboard directly, right?
It was so restricted that it didn’t have a layer in between.
Did have a layer.
Okay, s—
It had a pretty elaborate layer, actually.
And Cloud Run shares some DNA with that.
But the restrictions behind App Engine were because of that platform
layer between because it was just, like, you have to architect

(25:13):
your application in a specific way to make it run here, and we
will take care of all the infrastructure side of it for you.
Yeah, a big part of that is—well, there’s multiple parts of it.
One is sandboxing, so you can actually run stuff
multi-tenant before we had hardware virtualization primitives
that could be used for sandboxing, like, in gVisor.
I mean, eventually it moved to gVisor.
But before there was gVisor, there was like

(25:36):
a Ptrace sandbox or something like that.
But all the networking stuff in Google is different and exotic.
You know, like things don’t communicate by HTTP.
You know, I had some RPC system that sort of
predated use of HTTP as a standard networking layer.
None of your normal naming DNS, service discovery,
proxying, load balancing, none of that stuff works, right,

(25:59):
because they have all their own internal stuff for that.
Yeah.
Yeah, all the internet things are just like, “Ah, that’s not for us.”
And the compute layer, you know, they did want the sandboxing to
be really strong, so, yeah there are a bunch of reasons for the
restrictions, but they’re, you know, based on the technology at the time.
And it’s before Docker containers, things like that.
Right.
Yeah, but in this so you have this research project, basically internal,

(26:21):
like, to rearchitect Borg into this Omega thing, and then what happens there?
Like, where’s that transition?
I mean, it turned out to be not worth it and somewhat infeasible to roll it out.
We did partially roll out pieces of it, internally, and
it kind of made things more complex operationally during
the transition, and it just didn’t provide enough value.

(26:43):
The install base was really big, and it was growing faster
than we could write new code, and that was a time when the
Borg ecosystem was just exploding internally, and new things
were being added at all the layers at a very rapid pace.
And changing the user interface was actually one of the most

(27:03):
problematic parts of it was also pretty much a non-starter because
there was like, zillions and zillions of lines of configuration
files and about a thousand clients calling the APIs directly.
So, it was just too much.
So, some of the ideas were folded back into Borg, like labels
in Watch, which if you know Kubernetes, it may sound familiar.
And other parts were turned down, but

(27:24):
Kubernetes—you know, as cloud became important.
GCE, Google Compute Engine, GA’ed, at the end of 2013.
And that was also when Joe Beda kind of discovered Docker, and said,
“Hey, look at this Docker thing.” Management directors and above
were kind of trying to figure out, how can we apply our internal
infrastructure expertise to cloud, now that it’s becoming a priority?

(27:44):
So, I shifted off in that direction, and we started exploring, well, there’s
this group put together by a couple of directors called the Unified Compute
Working Group, and actually the original motivation, nominally, was to
produce a cloud platform that Google could actually use itself someday.
Because App Engine was considered too restrictive, and

(28:08):
Google Compute Engine was VMs, and Google had never used VMs.
Like, it just skipped that.
It used containers, more or less, processes, Unix processes from
the beginning, so there was no way they were going to use VMs.
They’re, like, way too inefficient, they’re too opaque,
they’re hard to manage how to be container-based.
So, you know, some of the original things were, yeah, it should be like, Borg.

(28:29):
And I’m like, wait, wait, wait, wait, wait.
We just spent years trying to [unintelligible] Borg.
Let’s not do it just like Borg.
Kubernetes actually ended up being open-source Omega, more
or less, based on a lot of the architectural ideas, and
some specific features, even, like, scheduling features.
So, some of the more unusual terminology was just lifted whole cloth from
Omega, like taints and tolerations, for example, as just one example.

(28:54):
So, there were a bunch of things from Omega we just simplified.
Wasn’t the pod aspects in Omega?
Like, the grouping?
It was, yeah.
So, the pod was one thing I felt was really important,
and I tried to introduce it into Borg around 2012.
That was super hard to introduce at the
core layer, since the ecosystem was so big.
But Borg’s model was, it had a concept called an [ALEC] , which was an

(29:15):
array of resources across machines, and the idea was that, you know,
it’s kind of your own virtual cluster that you can schedule into.
But nobody, almost nobody, used it that way.
What teams did was they had a set of processes they wanted to deploy
together, usually an application and a bunch of side cars for logging

(29:37):
and other things, and they wanted those things deployed in sets.
So, you know, I talked to the SREs, and they said, “Ah, we just
want this.” That led to the concept, which, at the time, was
called Scheduling Unit in Omega, and for the experiment in Borg.
And that was just what in Borg was a set of tasks from jobs.

(29:58):
And tasks weren’t even a first class resource in Borg.
Jobs were the first Class resource, and jobs were arrays of tasks.
So, you had
this weird, challenging model where you had an array
of resources across machines, and you had multiple
arrays of tasks that you wanted to pack into those.
So, if you needed to horizontally scale, you needed to

(30:20):
grow your ALEC first, and then grow your jobs after.
And if you wanted to skill down, you had to do it in the opposite order.
And technically, you could do it in either order, and things
would just go pending and not schedule into the ALECs, but
that created a lot of confusion, so people tried to avoid that.
But the pod or scheduling unit primitives was
just a lot easier for how people were using it.

(30:42):
I have this set of things.
I want those deployed together, just as if they were on one machine.
Just do that.
If you want to scale, that’s a unit to scale by.
I remember, like, in like, Mesos time, like, it was always
like, oh, well, don’t try to schedule things together.
Just write better code.
And I’m like, that’s not how the real world works.
Like, that’s [laugh]
— Yeah, we had a bunch of cases, like, we have very complex
[fat] clients for interacting with storage systems that

(31:04):
were just super challenging to rewrite in all the languages.
And Google restricted the languages you could use in production.
For a long time, it was C++ and Java.
Python was added, but it wasn’t as widely
used, not for serving workloads anyway.
It was used more for tooling.
Eventually Go came around, but you know, that was decades later.

(31:27):
But rewriting the Colossus client to interact with the files, the
distributed files to work, for example, you know, if that’s tens of
thousands of lines of code, you don’t want to do that multiple times.
So, how those things evolved over the years, I mean, eventually there was a
way that was created for running those things without the normal sidecar model.

(31:47):
They would all do it in the same container, effectively.
But there were a bunch of reasons to have side cars for various reasons.
And for anyone that wants to read more about this—I mean, we’ll link your
blog posts in the [show notes] —“The Technical History of Kubernetes” is a
collection of a lot of your old Twitter threads that gather a lot of these
pieces together, which was a great combination of them all in one spot.

(32:09):
“The Road to 1.0” post also has kind of a different perspective.
It was more once the Kubernetes project started,
how did it evolve for the first couple of years.
Now, I’m going to make a jump here.
Kubernetes is 10, going to be 11 years old.
I mean, for me, it is more than 11 years old.
Yeah,
exactly.
You’ve been on it for a while.
But like, as far as, like, an open-source, the official, you know,

(32:32):
stance of hit ten in 2024, what has this shift for what Google
was trying to do with Kubernetes initially, and the open-sourcing
of everyone else also using this in various places, what has
that done to the landscape of infrastructure and applications?
From my perspective, one of the things that it did is it created

(32:52):
an infrastructure ecosystem that was broader than any single cloud.
Because at the time we started Kubernetes, there was
the AWS ecosystem, and that was pretty much it, right?
Like, obviously Google had before GCE was GA’ed, it had
pretty negligible usage on measurable market share, I think.
And that was the time that the Kubernetes project started.

(33:14):
Even Azure wasn’t really very, very present.
And even now, ecosystem-wise, I look at Infrastructure as
Code tools, for example, there are a bunch that work for
AWS only, and there aren’t very many things that work only
on the other clouds in the open-source ecosystem, at least.
But Kubernetes sort of created its own island, where

(33:36):
you could have this rich ecosystem that works pretty
much anywhere, it works on-prem, it works on any cloud.
People have differing opinions on whether it’s a good thing or a bad
thing, but I view it as mostly a good thing that you have a large
ecosystem of tools that work everywhere, and that was not the case before.
And for the especially for the people who are on-prem and what

(33:56):
the thing that was available before was Mesos and OpenStack.
And Mesos, in my opinion, it’s kind of overly complicated.
The scheduling model just didn’t work at a theoretical
level, and the open-source ecosystem was not as strong.
Like, a lot of the big users just built their own frameworks and

(34:19):
then open-sourced them, and that’s sort of death to the ecosystem.
But, you know, even those who did, the tooling was not
compatible across frameworks, so it’s just super fragmented.
So, it didn’t really have the potential to grow
this sort of ecosystem that Kubernetes did.
And then when we created the CNCF, you know, taking inspiration

(34:40):
from what happened in the JavaScript area, where there was
the Node.js Foundation, and I forget what the foundation
was before they unified, but there was another foundation.
And a couple of things like Express went into the Node.js Foundation,
but most other projects were not accepted into that foundation, so they
had to find a home in some other foundation, and that was really awkward.

(35:03):
So, one thing I wanted to do with CNCF was ensure
there was a home for all those other projects.
Before CNCF was really ready, Kubernetes project itself kind of
became an umbrella project and took on a bunch of those projects.
Like Kubespray, for example, for setting up Kubernetes clusters with Ansible.
But, you know, as soon as after we created the—initially, it was

(35:24):
called Inception, I think, but then, you know, after became the
sandbox, then kind of the doors really open to all those projects.
So, I think that’s been very positive for
experimentation and developing of new things.
You know, it does give you a paradox of choice, it makes things a
little bit hard for figuring out what you should actually use versus
what’s available, but overall, I see it as a very healthy development.

(35:51):
Running Kubernetes at scale is challenging.
Running Kubernetes at scale securely is even more challenging.
Are you struggling with access management and user management?
Access management and user management are some of the most
important tools that we have today to be able to secure
your Kubernetes cluster and protect your infrastructure.
Using Tremolo security with open unison is the easiest way, whether it be

(36:12):
on prem or in the cloud, to simplify access management to your clusters.
It provides a single sign on and helps you with its robust security
features to secure your clusters and automate your workflows.
So check out Tremolo Security for your single sign on needs in Kubernetes.
You can find them at fafo.fm

(36:33):
slash Tremolo.
That's T-R-E-M-O-L-O.
I feel like you guys did a great job with almost unifying a lot of things and
just kind of having, I don’t know—were you and has anybody ever done anything

(36:57):
with Kubernetes that you were just, like, almost offended by that it’s so—
[laugh]
. [laugh]
. This is your baby.
You’ve seen it go from so many—I have, like,
three questions, but I want to start here [laugh]
. Well certainly, there were a lot of things I was very—that
I didn’t really imagine that I was very happy about.
Retail edge was one of these scenarios where I

(37:17):
wanted to make sure Kubernetes could scale down.
Borg, I think the minimum footprint is, like, 300 machines
or something at the time I worked on it, so there’s no
way it could scale down to something you could just run.
And Mesos kind of had that problem, too.
It had a lot of components, multiple stateful components.
Cloud Foundry required a bunch of components.
So, I wanted it to be
able to scale down to one node, so it just

(37:39):
has one stateful component, which is etcd.
It doesn’t have, like, a separate message bus, you know,
although that was a design that could be considered.
But the reason was for doing kind of local development,
like, Minikube or Kind type things, mostly.
Retail edge was sort of really fun that, you know, it’s
like in every Target store, been on spacecraft, and ships,
and all kinds of other places I never really imagined.

(38:02):
In terms of offended, you know, I remember one time—
Like, have they ever made it, like, overly complicated when
you were trying to make it simple or just something that
you’re just like, “Dude, I was trying so hard to prevent this.”
There is.
I mean, early on, I was very concerned about fragmentation,
which is why I helped create the conformance program.
So, all the attempts to sort of fork it and do something a
little bit different, and there were some cases like that where

(38:25):
some people said, oh, I just want to run the pod executor.
I just want to run Kubelet, but I need to make changes.
No, no, no.
You actually need to make sure that the API works.
When Kubernetes was sort of young and
vulnerable, I think that was a big concern I had.
Or other cases, like, the Virtual Kubelets,
you know, I didn’t want to fork the ecosystems.

(38:48):
Like, oh, only certain things work with Virtual
Kubelets, or only certain things work with Windows.
So, on Virtual Cubelet, I kind of started sketching a bar
for what I think compatibility would need to be required.
Minimum cubelet [laugh]
. I honestly think that your work in that aspect really shows, though because
even when people say that Kubernetes is difficult, there’s a reason why so many

(39:10):
people use it because it really does have that whole ecosystem that is really,
kind of—I think open-source can be so political, and the fact that there’s
so many different projects, but they all kind of align is really impressive.
Were you involved in the naming because, like,
Kubernetes naming, like, just cracks me up.

(39:30):
No.
Honestly, the naming was outsourced.
There’s, like, a search for potential names
and a trademark search, and things like that.
That aspect is pretty boring.
Lawyers got involved.
And [laugh]
— Yes.
[laugh]
. You know, it couldn't be named what the code name
was, so, you know, that was never a contender.
But did you have any influence on the fact that it’s Greek, right?

(39:51):
Like, all the different—
I mean, it did start a trend.
Istio, for example, for a while, everything was getting a Greek name.
I now work at a startup that has a Greek name.
This is
how this works [laugh]
. That’s what I’m saying.
Like, I feel like, just the continuity of the naming started, kind of, a lot
of the way that people start choosing to name their open-source projects.
And kind of, you almost make sure you could relate the fact

(40:13):
that these projects were related by their naming, you know?
I thought that was cool.
It seems like Kubernetes was the first to really do that.
Docker did it as well.
There were a bunch of shipping analogies
and… and Helm sort of followed that pattern.
I mean, themes were big for any technology.
Like, config management, you had Puppet, and you had Chef, and you had
all these, like, words that, like, oh, it has to be the cookbook and the—

(40:34):
My opinion, Salt took it to an extreme.
Kubernetes had so many though.
And the fact—
Salt with the pillars and everything else, yeah, you’re right.
[laugh] . That’s true.
Okay, so with your experience, right, you’ve gone through the
chips, you’ve gone through supercomputing early, you were in the,
you know, C and Java, and now, with people wanting to rewrite
everything—you saw when they wanted to rewrite everything in Java,

(40:57):
right, now, everybody wants to rewrite everything in REST, right?
You saw supercomputing before it was cool, and now everything is chip boom, AI.
Are there patterns that you see that, like, either you’re
excited about or alarmed about, or is it weird seeing
it go from where you started with all these things?
And it’s kind of like the same but different?
[Kind of] same, but different aspect is, you know, I think what keeps

(41:19):
software engineers employed, so I can’t argue too much with that,
but redoing the same things over and over in slightly different and
hopefully better ways is, I think, something that will continue to exist.
Like, now, everything with AI, right?
So, it’s very reminiscent of the dotcom bubble in that
sense, where everything’s like a retail store, but dotcom.

(41:42):
Mostly, there were a few big winners there, like, you know,
Amazon, eBay, but you know, most of the companies did not succeed.
A lot of the kind of existing companies got their act together and
put together a web storefront, right, and now that’s easier than ever.
So, I think AI will kind of be similar where, you know, there’s a

(42:03):
bunch of startups that are experimenting in cases where they are
sort of doing something that people already do, but just with AI.
Sprinkle little AI on it.
Yep [laugh]
. That will probably end up being a product feature.
In the positive case for them, it will end up being
an acquisition that makes it into an existing product.
It is super challenging for big companies to innovate,

(42:23):
certainly a challenge that Google has, I think.
Honestly, Google always had it.
So, if you think about what are the big products at Google, a lot
of them are acquisitions, even things you think of, of Google is
all about ads, I mean, most of that technology is acquisitions.
Yeah, DoubleClick, and—yeah.
I always find it interesting where it’s like, it’s not that you can’t innovate
at a large company, it’s that it’s really hard to get that to actually have

(42:46):
impact because I know so many cool, innovative internal projects that have
been at all these big companies, but the only way they get it to be an impact
at the company is they have to leave, go make a startup, and they get bought
by the company, and [laugh] now they have a say of like, oh, now it’s the
innovative thing that I was doing here ten years ago, but you didn’t believe me.
That’s also how we reward certain innovation.

(43:07):
Like, people are always trying to figure
out the projects that go in their promo doc.
And if you don’t reward a certain type of innovation, you’re—
Yeah.
—almost strangling it.
That system is very rigged for a certain type of innovation.
And it’ll be, like, the dumbest projects that they waste the
stupidest amount of money on, and it has absolutely no value,
and then—when people talk about empire-building, you know what I

(43:29):
mean?—and then somebody actually built something that’s helpful and
cool, and they have to go [laugh] [unintelligible] and come back.
I mean, like, even look at Meta.
Like, it most—look at all the acquisitions they’ve done.
I mean, a lot of times, when these things start, like PeakStream,
for example, it’s not clear that something is going to be—whether
it’s going to succeed, whether it’s going to be important.

(43:49):
It’s a risk, right?
Like Nvidia played a really long bet on compute on GPUs.
ATI, at the time, decided not to do that,
and they ended up getting acquired by AMD.
And AMD doubled down on graphics.
And they actually won all the consoles, laptops, mobile
phone deals, like, all of them away from Nvidia at that time.
For a long time, basically the national labs were the customers of that

(44:11):
stuff, but now, it’s everybody, so the long bet has really paid off.
But that really requires a lot of faith, I think.
It’s crazy how—you know how, at one point, Apple invested
in—wait, was it Windows invested in Apple, right?
And then how AMD was doing better than Nvidia at one point, you know?
Like, just the way that the—just, it’s so hard

(44:31):
to know what is going to work out, you know?
Like, look at where we’re talking about the dotcom, and remember when
we had Rich on Ship It, and he was talking about how huge WebMD was—
WebMD, yeah.
—right?
And then we were just talking about Amazon versus eBay.
Who even buys stuff on eBay, anymore [laugh]
? I just bought stuff on eBay.
What are you talking about [laugh]
? You and, like, five other people [laugh] . You know what I mean?

(44:52):
Like, Yahoo was so big, and now nobody uses that, and it’s just crazy.
And I feel like I haven’t even been involved in tech that long,
and I can’t even imagine the things you’ve saw in 30 years, Brian.
Like, you’ve seen it go—
So, as far as doing things too early, multiple times, Transmeta’s chips
were low power, general purpose computing chips, and they went into

(45:15):
devices like ultra-light laptops, tablets, wearables, smartphones, in 2000.
The year 2000.
No way.
Did you bet on anything or really believe in anything, and then nobody
thought it was cool, and then now you’re like, see, [laugh] like, I told you.
Well, so in Transmeta, yeah.
And I really liked what Transmeta was doing.

(45:36):
And that was kind of my dream job because in school I had
electrical engineering classes, and computing classes, and
things like that, but I started programming when I was ten.
The first computer was a kit computer that my
dad built, a 6502-based KIM-1 kit computer.
And it had no persistent memory, no persistent
disks, nothing, and no ROM with firmware.

(45:58):
So, every time you turn on the power, it’s a clean slate.
There’s nothing.
There’s no assembler.
There’s nothing.
It just had an LED display and a hex keypad.
So, I would have to type in the program from
scratch every time you turn on the power.
And back in those days, those Byte magazine would have
6502 assembly programs, and I would have to manually—
Flip them all [laugh]
? —manually assemble them, and type in the

(46:20):
hexadecimal machine code and then run it.
But anyway, when we got an Apple II, we’d turn
on the power, and there would be a prompt, right?
There would be a program running, and that was just so amazing for me.
So, you know, Transmeta, I really learned, from the time you
turn on the power, what happens, how does the computer work?
Like, I worked on the code that decompressed

(46:42):
the firmware out of the ROM, for example.
I worked on frequency-voltage scaling.
I worked on the static compiler.
So, we had software TOB handlers that ran through my static compiler.
Like I dealt with things all at, like, this crazy, super low level.
If the instructions didn’t get scheduled, right, the chip had no interlocks.

(47:02):
What an interlock does is, if you have one instruction that writes the
register, and another instruction that reads from that register, an
interlock will stall the CPU pipeline until that register value is written.
There’s like a scoreboard that keeps track of these things.
Transmeta chips, in order to be low power, is trying
to cut circuit count, so it didn’t have interlocks.

(47:25):
That leads me right into, like, the last thing I want to talk about here
because we have this—Kubernetes thing exists, we have this extensible API that
you helped make it conformant so it is consistent for everyone in whatever
environment they’re in, and in one of the ways that we’ve been seeing with that
is this notion of using that API and this notion of control loops to do more
infrastructure managements, things like Cloud Connector at GCP, ACK at AWS.

(47:51):
And they’re reimplementing some of that, like you mentioned
in, like, very cloud specific, like, this is my cloud
implementation of this thing because I know the APIs.
And in most cases, those are now generated from the APIs, right?
Like, we’re not manually writing this stuff out again.
Like, with Terraform, we had to do a lot of manual stuff to make providers work.
And there’s this new wave of Terraform-like things that are happening,

(48:14):
which is also, again, you started taking a risk there and looking
into this more, and what do you see coming in that area next?
Well, for the Kubernetes-based controllers, and in general, what I’ve
seen, I came up with the idea for what became Config Connector around
the end of 2017, when Kubernetes initially had third-party resources,

(48:35):
and then that was redesigned to Custom Resource Definitions, CRDs.
CRDs were in beta for a really long time.
It had a lot of features that were hard to get to the GA level.
But it was starting to become popular at that time.
People were writing controllers to manage,
like, S3 buckets and individual cloud resources.
I saw it as a way to solve a couple of problems for Google.

(48:57):
And Google had a Deployment Manager product that had a
bunch of technical, non-technical challenges at the time.
Kubernetes and Terraform started at the same
time, so Terraform is still pretty early in 2017.
You know, Ansible was way more used at that time than in Terraform.
We did have a team that had started to maintain Terraform,

(49:18):
and it had a semi… I would say, semi-automatic, ability
to generate the Terraform providers from the APIs.
And that still remains true.
It’s still semi automatic, it’s not fully automatic.
And I actually wrote a blog post about some of the
challenges with APIs that make it hard to automate.
And I don’t think Google’s APIs are the only ones that have these issues.
Kubernetes was growing a lot by the end of 2017.

(49:42):
I think that’s when AWS launched EKS, and VMware, and, you know, pretty
much everybody, even Mesosphere, had, like, a Kubernetes product.
So, it seemed like with a Kubernetes-centric universe, maybe
it would be something you would want to do, and it would
provide that more consistent API that you couldn’t get from
the providers, so something you could build tooling against.

(50:04):
You know, there are some big Google Cloud customers that adopted
it, but overall, not remotely as many as have adopted Terraform.
And it’s much less popular, especially for—even amongst GKE customers, it’s
not nearly as popular, and most of those platform teams know Terraform.
And they’re used to Terraform, so they manage infrastructure with Terraform.

(50:26):
I think the one potentially sweet spot for it is for resources
that application developers would need to interact with, like,
database, or a Redis instance, or message queue or something like
that, from the cloud provider where you could, in theory, provision
it using the same sort of tooling that used to deploy your app.
Although, you know, these days—people used to love Kubernetes
in the early days, that was always very gratifying.

(50:47):
Some users would say, you know, it changed their lives and things like that.
These days, with the larger number of people using
it, you get some people who don’t love it as much.
You know, anything widely used has that.
Terraform has that, too.
Helm has that.
But yeah, it just hasn’t really materialized, people managing resources there.
Crossplane is probably the most prevalent way, although, you

(51:09):
know, not on GCP because GCP customers want to use something
that’s supported, and GCP endorses and things like that.
So, ACK and the Azure service operator, I’d be interested
to know how many users there are, but just looking
at, kind of, social media posts and things like that.
I feel like it kind of came out of this notion, especially in, like,
the serverless worlds, where once you deploy a Lambda function, you’re

(51:32):
like, oh, I need my queuing system, and my S3 bucket, and my database,
and I want them all the deployed from the same CloudFormation stack.
And people were like, oh, I could replicate the same thing with containers,
and get that same sort of feeling of, I don’t care about the infrastructure,
but someone has to care about how that infrastructure got there, and who
runs those controllers, and how they’re authenticated, and where they go.

(51:53):
And usually that used to be a service of something like CloudFormation,
and now it’s something that, oh, the platform team has to run 87 different
controllers for every different connection that we want [laugh] to put in there.
Right, yeah.
And upgrading controllers in CRDs is still pretty challenging.
I actually wrote a blog post about using KRM, the Kubernetes
Resource Model, for provisioning cloud infrastructure as well.

(52:15):
There are a bunch of challenges with using the Kubernetes tooling,
like a lot of the cloud APIs are designed so that you call one,
it gets provisioned, some IP addresses or allocated or something.
You get that back in a result.
That may take 20 minutes, it may take a long time, then you need
to take those values and pass them as inputs to another call.
And that requires orchestration at a level that—you know, in

(52:39):
Kubernetes, everything—the controllers are all designed, so
you just apply everything and the controllers sort it out.
And if you don’t design your infrastructure controllers to do the same
thing, the Kubernetes controlling functionality doesn’t actually work.
So, like, if you deploy a set of resources with Helm, and you can’t actually

(52:59):
provision one thing until the other thing is already provisioned, and your
controller doesn’t do the waiting, Helm’s not going to do the waiting.
Like, you’re just hosed.
So, you could actually do that, you know, if you want to design the
controllers to work, like the built in controllers in Kubernetes.
That’s a lot more work because the APIs don’t work that way.
If you wrap the Terraform providers, they don’t work that way, right?

(53:21):
So, that’s another big layer that you would have to build in your, sort of,
meta controller over the underlying controllers to actually make that work.
And you know, there ends up being this demand, for the people who do adopt
it, to have every infrastructure resource they want to use covered by it.
So, all the work just goes into that, and the work
doesn’t go into, like, fixing the usability problems.

(53:42):
So, I think Crossplane has at least a partial solution to that,
but you have to do it in their composition layer, so the user of
Crossplane has to specify those dependencies, at least in some cases.
That just makes it feel more like Terraform again.
Yeah.
You’re basically just making a new module, right?
It’s just, like, a module in a different form.
Honestly, I don’t think it’s going to be dramatically

(54:03):
more popular ever than it is right now to do it that way.
There’s just not enough benefits.
There are some benefits, but they are kind of killed by how people use it.
So, for example, the composition layer in Crossplane, effectively is a
templating layer, so now you can’t just go change the manage resources

(54:24):
directly because it will create drift with the composition layer.
And if you need to template the composition resources using Helm,
now you’re storing it in a Git repo in some templating form,
Go template format or whatever, and that’s hard to change and
hard to write, right, so you can’t build tooling on top of that.
The big benefit of using KRM could be that you could actually

(54:46):
build controllers or tooling that actually just automates
the generation and editing of those resources for you.
The way people use it, they pretty much destroy
that potential benefit of using a control plane.
I have a question.
So, you know how you said that chip job was your dream job at the time, right?
What’s it like having a career as long as yours, and doing the

(55:08):
things that you’ve done, do you just keep getting the next dream job?
And what was your favorite out of those dream jobs, you know?
Yeah, it was pretty serendipitous.
I wish I could say, like, I really planned my career, but I really didn’t.
I loved all the jobs.
I loved Transmeta and PeakStream.
They were amazing and awesome.
I learned a lot, and it was very exciting for a while.

(55:31):
And then, you know, Google, working on Borg, and especially
Kubernetes, you know, Kubernetes is definitely the most
industry impact—and CNCF—anybody could pretty much ask for.
So, for the next thing I’m planning to do, that was definitely a
consideration when I spent six to nine months deciding what I wanted to do.
The opportunity to have industry impact again, will it be as big as

(55:52):
Kubernetes, mmm, maybe not, but it could become—you know, has that potential.
We just went down this whole deep path of, if anyone doesn’t know what
Crossplane is, and doesn’t know what Kubernetes is, and doesn’t know
what ACK and Config Connect, I’m sorry we didn’t explain that very well.
But basically these are all—the book I wrote at the time, we call
it Infrastructure as Software, where it’s basically like Terraform

(56:13):
in a for loop that keeps applying something or driving to a state.
And what you were describing was all of the pains that I’ve
lived over the last decade of trying to template Helm and
all these other things of, like, oh well, you know what?
Like, at some point templates aren’t good enough, for all the reasons
you just—like, the configuration drifts, and the ability to do
complex things, and all that stuff just becomes really difficult.

(56:34):
But as a user, I just want the template.
I just want the, give me some sane defaults, and I
just give you a little more data for what I want.
But, in my head, what you just kind of described—and my last question
here is, what I’m kind of curious about is, what you were describing,
of all of these problems, how that relates to something like System
Initiative, where System Initiative took a different approach of it’s

(56:55):
not Terraform, it’s a direct model to database, sort of—the UI, the
GUI on top of it is a representation of the actual infrastructure,
based on the actual API calls and what’s actually in the database.
And being able to modify those things directly
is one of its strong points, from what I’ve seen.
Is that what you’ve seen as well?
Is that something that you think is the actual ultimate goal?

(57:18):
Well, I definitely think that Infrastructure as Code
as we know it has reached a dead end, more or less.
I think in my entire career of more than 30 years, what we’re
doing today feels very similar to what I did in the late-’80s.
It’s, you have some build-like process that generates some stuff
that you apply to some system, and the actual details of the

(57:39):
syntax, and the tools, and whatever has changed a little bit,
but it feels pretty much the same as what I did in college.
So, I understand the reasons for how we got there.
It’s pretty expedient.
I don’t mean it in a disparaging way.
I actually mean it in a very complimentary way, but
Infrastructure as Code tools were easy to build.
They really hit a sweet spot in terms of making it easy.

(58:01):
For example, Terraform, the orchestration it does is pretty simple, the
compilation it does is pretty simple, the model is pretty straightforward.
The providers are pretty easy to write.
They don’t ask too much of the provider author.
And even for using it, it feels like scripting.
Need to provision a few resources, you can write some Terraform.
Once you learn the language, works—except for some baffling decisions—like

(58:25):
deleting stuff by default, it works in a mostly predictable way, right?
So, it’s pretty expedient, you know, is a pretty useful tool.
It got pretty far.
But at scale and for some people, it’s not that easy to use.
And actually adding that kind of scripting layer on top of the
APIs, much like Crossplane and the other [unintelligible] -based

(58:47):
tools where people are, you know, using Helm on top of it,
compositions on top, kind of takes away the power of the APIs.
So, APIs as the source of truth is what enables interoperable
ecosystems of clients and tools to interact with those APIs, right?
You publish an API, and you can build a GUI on top, and a CLI on

(59:09):
top, and automation tools on top, terminal consoles, and all kinds of
cool things, ChatOps, whatever, like, you can build all that on top.
And if you wrap it and say, “No, no, no, you have to go out
to Terraform, and check it into Git, and get it reviewed,”
and you’re saying, “No, you can’t do that anymore,” right?
And I think that’s a huge limitation to
what we can do with Infrastructure as Code.

(59:30):
And it’s not just Terraforming.
That’s just the most popular one.
The same is true of Pulumi, and anything else out there.
And that was just some, like, deep-seated, GitOps is not the
answer you’re looking for, sort of like, vibes there [laugh]
. Yeah, I think GitOps—I have a couple of blog posts about GitOps.
GitOps, I think, solved certain problems.

(59:50):
The core benefit that I see from GitOps—I mean, retail edge,
it has a networking benefit, so there’s, like, a specialized
benefit there, and if you have a large number of targets, you need
something that retries better than a pipeline and stuff like that.
But what GitOps does is it creates a one to one binding between the resources

(01:00:11):
that are provisioned or created in Kubernetes, if you’re talking about
GitOps for Kubernetes, and the source of truth for that configuration, right?
So, I think there’s value in that, especially in the world where
you’re saying you have to go change that configuration to do anything.
The unidirectionality of it, where if you want to make a change, you
have to change your configuration generator, program, or template, or

(01:00:34):
you have to change the input variables, you have to check that into Git,
go through your CI pipeline to deploy it, and that is… very restrictive.
It’s very slow.
It creates a lot of toil.
Why do I have to go edit Infrastructure as Code by hand, right?
So, different people are exploring different solutions for

(01:00:56):
not writing the Infrastructure as Code by hand, like, you have
the Infrastructure from Code tools that are generating it.
I don’t really think that’s the answer.
You have the System Initiative is kind of interesting, although
kind of challenging to sort of understand exactly what it is.
But I do think it’s good that folks are exploring alternatives.

(01:01:16):
I don’t think just kind of building more generation layers that
still have the same overall properties, like the unidirectional flow,
are going to provide dramatic benefits over what we’re doing now.
Like, people are, of course, try to use AI to
generate Terraform and other Infrastructure as Code.
Like, I’ve tried… doing that.
It works kind of okay for CloudFormation.

(01:01:38):
Works less okay for Terraform, in my experience.
That could be another whole podcast.
But I don’t think that ultimately changes sort
of the overall math in the equation, right?
Like, you still have to have humans that understand it, that
can review it, and make sure it’s correct and not hallucinated.
And—
You need some experts that have more context than the system itself.

(01:01:58):
Like, so there’s someone outside of the system
that knows, is this safe or the right way to do it.
Which, everybody’s plan to automate it is going
to make it harder for those humans to have that.
Yeah.
Then you also have to deal with configuration drift, and you
know, all the other problems that are kind of independent of
the configure—Infrastructure as Code tool that you’re using.
Brian, this has been awesome.
Thank you so much for coming on the show.

(01:02:18):
Where should people find you online if they want to reach out, if they want
to ask you more questions, if they want to, I don’t know, like, get in touch?
Yeah, I a—thanks for having me on.
I’m BGrant0607—it’s a trivia question, what the numbers
stand for—but on LinkedIn, Twitter, BlueSky, Medium.
And I mean, I’m also on Hachyderm and some other

(01:02:42):
Mastodon things, but that seems a lot more fragmented.
And also still on Kubernetes Slack and CNCF Slack, as just Brian Grant, I think.
Well, thanks again so much.
Anyone that has questions or wants to reach out, we actually
don’t have a Slack instance for Fork Around and Find Out.
We’re not doing any, sort of like, real-time chat for this.
BlueSky is like—social media is kind of where I’m trying

(01:03:04):
to gravitate towards for these sorts of conversations,
if you have other feedback or want to reach out.
I don’t want to check another chat system and log into another system for it.
Like, I’m already there.
Autumn and I are both there.
We have the Fork Around and Find Out BlueSky handle which will be posting
these episodes, so feel free to leave comments and send us messages on there.
And yeah, we will talk to you all again soon.

(01:03:39):
Thank you for listening to this episode of Fork Around and Find Out.
If you like this show, please consider sharing it with
a friend, a coworker, a family member, or even an enemy.
However we get the word out about this show
helps it to become sustainable for the long-term.
If you want to sponsor this show, please go to fafo.fm/sponsor, and reach out
to us there about what you’re interested in sponsoring, and how we can help.

(01:04:03):
We hope your systems stay available and your pagers stay quiet.
We’ll see you again next time.

All Episodes

Episode Transcript

Popular Podcasts

New Heights with Jason & Travis Kelce

Dateline NBC

24/7 News: The Latest

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}An Oral History of the Kubernetes Revolution with Brian Grant

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}New Heights with Jason & Travis Kelce

Dateline NBC

24/7 News: The Latest

All Episodes

An Oral History of the Kubernetes Revolution with Brian Grant

New Heights with Jason & Travis Kelce