Move K8s Stateful Pods Between Nodes

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:04):
Kubernetes Pod Live Migration.
That's what Cast AI calls it when theydynamically move your pod to a different
node without downtime or data loss.
They've built a Kubernetes controllerthat works with CSI storage and CNI
network plugins to copy your runningpod data, the memory, the IP address,
and the TCP connections from onenode to another in real time Welcome

(00:28):
to DevOps and Docker talk, and I'myour solo host today, Bret Fisher.
This YouTube live stream was a fun onebecause I got to nerd out with engineers
Philip Andrews and Dan Muret from CastAI on Kubernetes pod live migrations
and how it works under the hood.
We talk about use cases forthis feature, including hardware
or OS maintenance on a node.

(00:49):
Maybe for right sizing or bin packing yourpods for cost savings or moving off of a
spot instance that's about to shut down.
Or really anytime you need to move adaemon set that would cause an outage if
the pod had to restart or be redeployed.
I don't know of any other turnkey wayto do this on Kubernetes today, but

(01:10):
I've got a feeling that Cast AI has gota winning feature on their hands, and
I'm glad that we got to dig into it.
Over 20 years ago, the virtual machinevendors built this live migration feature
into their products, and finally, in 2025,we're now able to do that in Kubernetes.
Let's get into the episode.

(01:31):
Welcome to the show.
All right, Both of thesegentlemen are from Cast AI.
Philip is the globalfield, CTO at Cast AI.
What exactly does Global CTO do?
handling, a lot of our large customers.
a lot of the strategic partnerships,usually new technologies, so working
with a lot of our, you know, customers onshowing them new technologies, helping in,
proof of concepts with new technologies.

(01:51):
it's been kind of a cool role.
basically on the technicalcustomer facing side.
I get to do a lot with ourlargest customers in solving
some of the the hardest problems.
Nice.
when you don't know what thetitle is, it makes it sound like
you just live in an airplane.
it sounds impressive.
and we've got Dan Muret or Muret.
I really don't know my French, so that'sprobably, a horrible pronunciation.

(02:12):
Dan is here, he's the senior salesengineer with CAST AI or one of them.
I'm gonna make you the sales senior salesengineer so that you sound very elite.
Uh, welcome Dan.
who could tell me about what's theelevator pitch for Cast AI because
I've known about you all for years.
I've visited your booth at KubeCon atleast a half a dozen times over the years.
what's the main benefit of Cast AI?

(02:33):
Sure I can grab that one.
when Cast AI was founded, it was tosolve a problem that our founders
had in their last startup years ago,which was every month their AWS bill
was going up 10% regardless of whatthey did the month before to try to.
Mitigate and manage costsacross the in infrastructure.
they sold that startup to Oracle, whenthey finished that Oracle time, they went
and figured out how to solve this problem.

(02:54):
They said, well, Kubernetes isgonna be the future platform.
We're gonna make our bet on Kubernetes.
And the only way to solve thisproblem is through automation.
Because doing things manuallyevery month, one, it's tedious
and takes up a lot of time.
And two, it's just, not helpful, right?
You save a little bit every,you know, it's two steps back
for every step forward you make.
With Cast AI, it's fullyautomation first, right?

(03:15):
We made the effort that everything wasgoing to be automated from the start.
when it comes to node auto scaling,node selection, node, right sizing,
workload, right sizing, everythingto do with Kubernetes, everything
we implement is automated.
And that's where the live migrationpiece came in being able to automatically
move applications around within thecluster without having downtime.
and that's where applicationperformance automation comes

(03:37):
in, Moving from this applicationperformance monitoring mindset.
Datadog has made a lot of money on that.
I dunno if I'm allowed to say that.
but they've done very welland have a fantastic platform.
we love Datadog, butyou get data overload.
You end up with metric overload andactioning on those is very hard.
Where we need to go from here,especially with the AI mindset that

(03:59):
we're moving into, is automationof that application performance.
And that's what Cast AIis leading the way in.
Nice, when you all reached out and Ilearned about the fact that you now have
live migration, it took me back to almost25 years ago when that was first invented
for VMs at the time it felt like magic.
It did not seem real.
We all had to try it to believeit because it seemed impossible.

(04:23):
To, move from one host toanother, maintain the ip,
maintain the TCP connections.
surely I'm gonna freeze up andit's gonna be like a frozen screen.
we all just assumed that.
eventually, and maybe at first itwas a little hiccupy, I think, if I
remember correctly, like 2003, 2005.
It was one of those where it wasn'tquite live and they didn't, it was
like very short amounts of gaps.
then eventually it got goodenough that it was live.

(04:44):
I was running data centers, forlocal governments at the time.
So I was very interested in this.
'cause we were runningboth ESX and HyperV.
So I was heavily invested inthat feature and functionality.
So when I saw that you were doingit to Kubernetes, my first thought
was, why did this take so long?
why don't we have this yet on everything?
Because it's clearly possible,it's technically possible.

(05:05):
Obviously, it's not super easy andrequires a lot of low level tooling
that has to understand networkingand memory and you know, dis disc
rights and all that kind of stuff.
So I'm excited for us to getinto exactly how this sort of
operates for a Kubernetes admin.
And I really feel like this show's gonnabe great for anyone learning Kubernetes or
Kubernetes admins, because to talk aboutthe stateful set, the daemon set problem

(05:29):
of, we've got stateful work, everyone'sgot, I mean, everyone I know almost
has stateful workflows in Kubernetes.
I would say, I don't know about youall's experience, but to me it's
an exception when everything isstateless in Kubernetes nowadays.
Do you find that to be the case
we talk to customers all the time, right.
And yeah.
Used to, you know, a few, justa couple years ago, right.
It was a lot more stateless.

(05:49):
Web servers, whatever.
now we're definitely seeing a shiftto, more stateful workloads whether
it's, legacy applications beingforced into Kubernetes as part of a
modernization project or whatever.
We're seeing a lot more statefulworkloads in Kubernetes for sure.
Particularly amongst the Fortune onehundreds, fortune five hundreds, right?
'cause you've got this modernization,and I put it in quotes, where

(06:10):
modernization means taking somecrusty old 15-year-old application,
containerizing it and shoving it inKubernetes and calling it cloud native,
the flawed approach of lift and shift.
and you end up with a lot ofapplications that are in Kubernetes.
They're listed as a deployment, butyou can't restart them without your
customer having a significant outage.
it goes against everythingKubernetes was built on.

(06:32):
But that's the world we live in today.
when we first launched Live migration,and I posted about it on LinkedIn,
some of the first questions Igot was, why is this even needed?
If you're doing Kubernetes correctly, livemigration shouldn't even be a real thing.
yes, but 95% of the customers I deal with.
Don't do Kubernetes, the right way.
Well, yeah, I honestly think wecould argue that's a great point.

(06:53):
when Docker and Kubernetes wereboth created, it was like stateless.
'cause it's easy, you know,move everything around.
It's wonderful.
But I think that to, I mean, this channel,if there's anything consistent about this
channel over the almost decade that hasexisted, it's that I, it's all containers.
Like I don't care what the toolis, we're doing it in containers.
the large success of containers isbecause we could put everything in it.

(07:14):
so many.
Evolutions or attempted evolutionsin tech have been, well, you're
gonna have to rewrite to, you know,functionless, you know, or functions.
you're serverless, you're gonna have towrite functions now or you're gonna have
to rewrite this language or whatever.
and that's, I think that's like thesecret sauce of containers was that we
could literally shove everything in it.
It's also the negative.
And so there's, there's a thing that,I don't know if I learned it from a

(07:37):
therapist or whatever, but often ourweaknesses are just overdone strengths.
And I feel like the strength of containersis that you can do everything with it.
you can put every knownapp on the planet in it.
They will eventually workif you figure it out.
the overdone weakness is that we'reputting everything in there which makes
managing these infrastructures very hard.
you have to assume everything'sfragile Until you are sure

(07:58):
that it's truly stateless.
even stateless, people say statelessand what they really mean is it's, it
doesn't care about disc, but it definitelycares about connections, which is when
we're trying to talk about stateless,that's not technically accurate.
Like when we say stateless,we should probably mean also
doesn't care about connections.
at least once the connections are drained.
it's an interesting dilemma we all havein the infrastructure of we have the

(08:19):
power to be able to move everything anddo everything, but also everything's
super fragile that we're running at thesame time, so how do we even manage that?
We've encountered a lot of teamsthat, swore up and down they were
stateless, right up until youstarted bin packing their cluster.
They said, wait, wait, wait.
why are we having allthese restarted pods?
We're like, because we're binpacking and we're moving things, and
we're getting better, optimization.
They're like, but my container restarted.

(08:41):
Well, yes, that's whatcontainers do in Kubernetes.
Right.
And for those that are maybe just gettinginto Kubernetes or haven't dealt with
large enterprise forever workloadswhere they just can't be touched.
I've had 30 years in techof don't touch that server.
don't touch that workload.
It's fragile, it's precious, butit's also probably on the oldest
hardware in the least maintained.
so the idea of, one of the performancemeasures that any significant size

(09:05):
Kubernetes team is dealing withis the cost of infrastructure.
And then we keep getting told, I thinkthis was just in the last year at
KubeCon, that even on top of Kubernetes,we're still only averaging like 10% CPU
on average utilization across nodes.
Like we, we still are struggling withthe same infrastructure problems that
we were dealing with the last 30 years,even before VMs, before virtualization.

(09:28):
That was the same problem we had thenbecause everybody would want their
own server and they always had to planfor the worst busiest day of the year.
So they would buy huge servers,put 'em in, and they'd sit idle
almost all the time becausethey barely got 5%, utilization.
So I can see where like one of the corepremises of something like a application
performance tool is that we're gonnasave tons of money by bin packing.

(09:52):
can you explain the bin packing process?
what does that look like?
so one of the big things with Kubernetesis the scheduler will typically
round robin, assign pods to nodes.
if you have 10 nodes in a cluster,your pods will more or less get evenly
distributed to those nodes in the cluster.
You can manage that with certaindifferent scheduler hints and
certain scheduler, suggestions tosteer that towards, you know, most

(10:15):
utilized, least utilized, et cetera.
But at the end of the day,you're gonna have spread out
workloads across your nodes.
Bin packing is basically thedefragmentation of Kubernetes right back.
You know, Bret, when you were firststarting out, when I was first starting
out, and you could actually defragmenta hard drive and you get to move the
little Tetris blocks around the screen,
days.
being able to do that in a Kubernetescluster can mean massive, massive

(10:35):
savings on the actual utilization of thatcluster because now you free up a bunch
of workloads, a bunch of nodes in thecluster that are no longer necessary,
you can delete those off and when youneed them, you just add them back.
that's the joy of being in a cloudenvironment, you can use the least amount
of resources when you don't need 'em.
So in, for instance, at, you know,your off busy hours, your nighttime
hours, and then when you start needing'em again, you spin 'em up, you add

(10:57):
more to it, you scale up during theday, and being able to do that process
over and over again every day is howyou can optimize your cloud resources.
What we see is that.
People have so many stateful workloadswhether it's stateful in real state or
stateful in, this is a really poorlyarchitected application, or stateful in
this application takes 15 minutes to startup and I, it's a monolithic and I can only

(11:20):
run one copy of it so I can't move it.
All of those things causes, so youcan't bin pack a cluster, right?
You can't move those things around.
So what ends up happening is people justend up with these stateful workloads
scattered throughout all 10 nodes.
And even if the 10 nodes are only60% utilized, you can't get rid
of any of them because it'll causesome kind of service interruption.

(11:43):
And that's where live migrationallows you to move those
stateful sensitive workloads.
So now those 10 nodes can go downto six or seven nodes without having
a service interruption, even ifthere's less than ideal workloads
scattered throughout the cluster.
stateful versus stateless versus like,where's the scenario for where we need

(12:05):
a live pod migration and like to thosethat are perfect in all their software
and they run and they control allthe software that runs on Kubernetes.
I don't know who those people are,but let's just say they exist.
Then this isn't needed.
Every database has a replicaor database mirror, so you
can always take a node down.
Every pod properly has propershutdown for ensuring that connections

(12:29):
are properly moved to a new pod.
By the way, I used to do a wholeconference talk on TCP packets
and, resetting the connection tomake sure it moves properly through
the load balancer to the next one,and having a long shutdown time
so that you can drain connections.
that world of shutting down a pod is somore complicated than anyone gives it.
everyone treats it like it's casualand easy, and it's just not if

(12:50):
you're dealing with, hundreds ofthousands or millions of connections.
there is a lot of nuanceand detail to this.
and I often end up with teamswhere they implement Kubernetes.
It's a sort of predictable pattern, right?
They implement Kubernetes,move workloads to it.
They think Kubernetes gives theirworkloads magic and then they just
start trying moving things aroundand they realize when their customers
complain that, The rules of TCP IPSload balancers and connection state,

(13:14):
like all these rules still apply.
you have to understand those lower levelsand obviously disc and writing to disc
and logs for databases and all that stuff.
that's still there too, I think.
I think the networking is whereI see a lot of junior engineers
hand waving over it because quitehonestly, the cloud has made a lot
of the networking problems go away.
So we don't have to haveCisco certifications just
to run servers anymore.

(13:36):
We used to, but now we can get awaywith it till a certain point, in
the career or complexity level.
And then suddenly you're having to reallyunderstand the difference between TCP and
UDP and how session state long polling orweb sockets, how all these things affect.
Whether you're going to breakcustomers when you decide to restart

(13:57):
that pod or redeploy a new pod.
I love that stuff because it's supertechnical and you can get really into
the weeds of it, and it's not, I wouldn'tcall a solved problem for everyone.
my understanding of somethinglike a live migration is,
takes most of those concerns.
it doesn't make them irrelevant, butit does deal with those concerns.
am I right?
in terms of networking we're talkingabout live migrations, having

(14:18):
to be concerned with ips and,connection state, stuff like that.
yeah, so with being able to do thenetworking move of things, to your
point, reestablishing those sessions.
one of the big things wesee is long running jobs.
if you've got a job that's running foreight hours and it gets interrupted
at six, you've lost that job.
Even if you try to, move it from one to,there's some checkpointing involved, a

(14:41):
lot of times, like on a spark workload,the driver will just kill the pod
and restart it if it senses any kindof interruption in the networking.
So the networking's super important there.
Being able to maintain that.
Long running sessions, web sockets.
To your point, we've actually testedthis extensively with web sockets.
Web sockets, stay connected.
And we're still in thoseearly vMotion days.

(15:03):
there is a slight pausewhen we move things.
it took vMotion multiple years beforethey got it kind of really ironed out.
We're moving probably fasterto them 'cause we have a lot
of experience, you know, of theexperiences they went through.
and the research that'shappened since then.
So I think we're moving pretty faston shortening that time window.
But what we found is you queue up allthe traffic and once the pod is live on

(15:23):
the new node that traffic is replayedand all the messages come through.
So even on something like a web socket,you don't actually lose messages.
They're just held up for a few seconds.
And that's extremely importantfrom maintaining that connection
state, like you were mentioning.
one of our customers that we'reworking with this heavily on,
they run spark streaming jobs.
So they're 24 7, 365, pulling off aqueue, running data transformations

(15:46):
and detections, and then pushingsomewhere else for alerting mechanisms.
If they have a pod go down, it takes abouttwo minutes to get that process restarted,
pull in all the data that they need.
Again, that's two minutes of backlog.
They have super tight SLAs.
They have a five minute SLA frommessage creation to end run through
the entire detection pipeline.

(16:07):
So if you've got a two minutedelay on that shard in your Kafka
topic, that's a huge chunk of thatfive minutes that you just ate up.
You're talking all the other pipeline.
It's very easy to startmissing SLAs there.
It's, you can't take maintenancewindows if you're 24 7, 365 and
you're doing security processing.
you don't.
You can't be like, well, security'sgonna be offline for 10 minutes

(16:27):
while we move our pods around.
Like, that's just, that's notacceptable in that world mindset.
so keeping that connectivity, keepingthe connection state, being able to
keep everything intact, keeping theKafka connection, keeping the, spark
driver connection is all super importantwith being able to move that entire
TCP/IP stack over from one node toanother, during that migration process.

(16:48):
Yeah, and I mean, we're really talkingabout a lot of the different kinds of
problems that come with shifting workloadslike being able to say, you know, walking
into an environment and sort of beingyour own wrecking ball and, your own chaos
monkey and saying, I'm gonna go over hereand push the power button on this node,
or I'm gonna properly shut down this node,do you have everything set up correctly so

(17:09):
that connections are properly drained thatis a, such a. I would say a moving target,
especially because every time we've hadthese processes where I've had clients
where we go through this like exercise ofwe're going to do maintenance on a node
and we're even gonna plan for it, andthen we do it, and then we fix all the
issues of the pods and the shutdown timingand, the Argo CD deployment settings

(17:31):
that we need to massage and perfect.
And then, you know, six months later, ifwe do it again, the same thing happens
because now there's new workloads thatweren't perfected and weren't well tested.
if I can make a career outof actually being like.
a pod migration guru, like that soundslike my kind of dream job where we crash
and break everything and then we trackall of the potential issues of that and

(17:53):
we are like a tiger team that goes podby pod and certifies this is like, yep,
this pod can now move safely without risk.
because we've got everything dialed in.
We've got all the right settings.
I feel like that's a workshop opportunity.
maybe sell something on that becausethere are so many levels of complexity
we haven't even talked about, likedatabase logs and database mirroring
you can't really spin up a new nodeof a database and let it sit there.

(18:16):
Idle is a pod while you're waiting forthe old one to shut down, they can't
access the same files, blah, blah, blah.
it just depends on the workload,on how complex this all gets.
But I'm assuming also that when we talkabout something like live migrations,
we're not just concerned with networking.
We're also somehow shifting to storage.
I'm guessing there's certainlimitations to that where you're not
replicating on the backend volumes.

(18:38):
You're, I guess you're justusing like ice zy reconnects or
how's that, how does that work?
we haven't really gotten into thesolution, but I know you're only on
certain clouds right now, and I'm assumingthat's partly due to the technicals of
the limitations of their infrastructure.
Right, exactly.
each cloud has different kindof quirks, around how they
function, what the different,technologies look like around them.

(18:58):
somebody had asked about being ableto move, you know, larger systems
are the limits around it and their,it depends on the use case, right?
If you're talking spot instances,being able to move from one spot
instance to another spot instancein a two minute interruption window
on AWS, depends on how much data.
If you're trying to move 120 gigs ofdata physics is working against you.

(19:19):
you don't have enough time in that twominute window to get enough through
the pipe over to the new, system.
Now if you're talking small pods,if you're talking less than 32 gig
nodes, you can move that fast enough.
64 gigs, maybe you're on the edge.
Depends on how much other networktraffic is tying up the bandwidth,
64 gigs is getting on the edge thatyou can move in a two minute window.

(19:40):
that other example I was talking about,those long running spark streaming
jobs, if they're running on demand,live migration, is still a massive
benefit because now you can do serverupgrades without taking an outage.
You can create a new node runningyour new patch version of, Kubernetes,
running your new Patched Os, andmigrate the pod from one to the other.
Your time to replicate is less important.

(20:03):
Even if it takes you three minutesfour minutes, or five minutes
to replicate the memory fromone box to the other, who cares?
It's not gonna be paused for thatlong, because what we're doing
is we're doing delta replication.
So you replicate a big chunk, and thena smaller chunk, and then a smaller
chunk until the chunk is small enoughwhere you can do it in a pause window.
And so now when you're moving ahuge service from one to the other.

(20:24):
Same thing.
If you're talking like NVME local storage,we've got another customer we're working
with and it's a different set of problems.
They have a terabyte of NVME that they useon local ephemeral disc on every node that
needs to be replicated from node to node.
When they do node upgrades, thattakes about 20 minutes, to replicate
all of that from one node toanother, even on high throughput
discs, on high throughput nodes.

(20:45):
But if it's happening in thebackground, while everything else
is humming along nicely, who cares?
Keep replicating it over.
You keep going down to Deltas, andthen once your deltas get small enough,
you pause for six to 10 seconds,depending on how big the service is.
And then you slide it over.
a lot of these things are being solved.
We're actively reducing these pausetimes by being able to do more prep

(21:06):
behind the scenes, being able to domore processing behind the scenes.
everything is, operatingas a containerd plugin.
I saw somebody asked, about on-premwe will be supporting on-prem, we
will be supporting other solutions.
The big catch there is everybodyhas some different flavor of
networking and different flavorsof things behind the scenes.
The actual live migration piece right nowcould apply to any Kubernetes anywhere.

(21:31):
It's the IP stack that gets a littletrickier because you've got cilium
running places, you've got Calico runningplaces, you've got VPN, V-P-C-C-N-I
running places, you know, everybodyhas different networking flavors.
So being able to maintain networkconnections when you do the
migration is largely the moredifficult part of the whole process.
Hmm.
Being able to move the pod isn't thatbad, so if you've got workloads that

(21:54):
you can reestablish connections and theconnection resetting is not a big deal,
but you don't wanna have to restartthe pod, that's fairly straightforward.
We could pretty much do that today acrossany containerd compatible Kubernetes.
It's specifically the networkingthat causes a lot more hardships,
because everybody has adifferent flavor of networking.
for AWS, we were able to forkthe open source AWS node CNI, and

(22:17):
create our own flavor of it thatnow handles the networking piece.
So we're using the open sourceAWS CI code, and we've modified
it, and now it works just fine.
for our purposes, we're doing somethingsimilar on GCP, working with the gc.
GPFI recall is using cilium underthe hood for their networking.
So we're gonna be, building asimilar plugin for their cilium side.

(22:38):
Yeah, and the nice thing is, Iguess if you build it for cilium.
would it work universallyacross any cilium deployment?
In theory, I mean, I'm just thinking ofthe most popular CNIs and if you check
those off the list, it suddenly gives you,you know, a lot more reach than having to
go plowed by cloud or os by os, you know?
and,
Exactly.
our first iteration of this back inJanuary, February, the first version

(23:01):
that we demoed was actually Calico.
A lot of people were like, I don'twanna have to rip out my cluster and
rebuild it with Calico as the CNI.
we were able to figure outa way to work with the, VPC
CNI, as a backing basis there.
So Calico's pretty much already built.
we've got AWS CNI now built,cilium our next target, that saw
somebody asked about Azure, Azureis probably gonna be early 2026.

(23:25):
we'll be E-K-S-G-K-E and thenwe'll work on a KS, and then we'll
work on-prem solutions after that.
So on-prem will probablybe sometime in 2026.
Yeah, I can remember, goingback to the two thousands.
I can remember when we went fromdelayed migrations or paused
migrations to live migrations.
I can remember reading the technicalpapers coming out of, VMware and
Microsoft and they were talking aboutthe idea of, these deltas continually

(23:48):
repeating the Delta process untilyou get down to zero or like you can
fit it in a packet and then that'sthe final packet kind of thing.
I don't know why I remember that allthese years later, but I do remember
that I thought that was some pretty coolscience, like some pretty cool, physics
across the wire because back then we werelucky if our servers had one gigabit,
200 gig workloads or anything like that.
this actually led me during myresearch and, we could talk about

(24:10):
the idea that there are, there areattempts in Linux over the years
to try to solve this universally.
I did some research before the show andsaw some projects around ML workloads in
particular a lot of, engineers, whetherit's platform engineering or just the ML
engineers themselves interested in thisbecause of the, sort of the problems of

(24:32):
large ML or AI workloads today where youcan't interrupt them if you interrupt
them, you have to basically start over.
it's sort of a precious workloadwhile it's running and it
might be running long time.
do you have, AI and ML workloadcustomers where they're.
Are they maybe part of the first moversto move onto something like this?
I'm basing it on the KubeCon talksand things that I've seen out there,

(24:53):
Large scale data analytics isdefinitely, one of the big players here.
A lot of it's spark driven dataanalytics that we're seeing,
because of exactly that problem.
A lot of these jobs will be runningfor 8, 10, 12, 14, 16 hours and.
Running those on demand at thescale that they're running them
at is extraordinarily expensive.
So the big ask is how do we get thoseworkloads onto spot instances where

(25:18):
when we get the interruption notice wecan fall back to some type of reserved
capacity and then fail back to spot.
So basically the goal is to moveto this new concept where in your
Kubernetes cluster you have some swapspace, whether that's, excess, spot
capacity, two or three extra nodes ofspot capacity or a couple of nodes of
on-demand capacity where if you get anode interruption, you can quickly swap

(25:42):
into those nodes, and then once you standyour new spot instance back up, then
you can swap back to that spot instance.
where we're headed.
that's what Q4 is gonna be working onthis year, is to be able to automate that
entire process to where you can float backand forth between reserve capacity and
spot capacity to really save on those,data analytics jobs, those large ML jobs.

(26:04):
We're not to the GPU side of things yet.
I'd love to get us to where we couldmigrate GPU workloads 'cause that's where
the next big bottleneck is gonna be.
the hooks aren't there in theNvidia tool sets yet for the Cuda
tool sets in a lot of places, to beable to get what we need for data.
we're figuring our way around that.
they tend to be much larger, so thetime taken to move them very expensive.

(26:28):
it might be 20 minutes to be ableto move a job from one to the other.
'cause it took 20 minutes to geta startup up in the first place.
just due to the size of the models andhow much data you have to replicate.
we're starting to put some POC workinto the GPU side of things while we're
continuing on full steam with buildingout the expansion of the feature set
of the CPG, and memory based workloads.
Alright.

(26:48):
Dan, I was curious if you've seenon, the implementation side of this.
when we talk about.
the need to live, migrate a pod, whetherit's for maintenance then the almost
feel like the next level is the ideaof spot instances, I love that idea of
my infrastructure dynamically failingand my applications can handle it.
is there a maturity level whereyou see people start out it's hard

(27:11):
for me to imagine like on day onesomeone's like, yeah, let's just put
it all on spot instances in yolo.
let's just fi we don't care.
It's all good.
Mi Live migration will solve it all.
Because obviously, there arephysics limits to the amount of
data we can transmit over the wire.
I'm imagining this scenario where you're.
accrediting certain workloads,like this replica set is good
for spot because it's low data.

(27:33):
we don't need to transfer a hundredgiga data during a two minute outage
or a two minute notice of outage.
do you see that as a maturityscale where you have to
Yeah.
I mean, it is absolutelya maturity scale, right?
kind of going back to the referenceswe talked about, the early days of
VMware, nobody started doing vMotionin production, everyone started it.
Oh, we've got this, five secondinterruption, development and test

(27:55):
boxes can handle that all day long.
so it's the same concept reallythat we're living in now.
We're going through that same evolution.
I agree with Phil.
I think we're doing it much fasterthan VMware did in 2002, 2003.
I was around when that happened as well.
So I remember racking andstacking all those boxes.
but yeah, it's very much the same thing.
container live migration is brand new,we've just been GA for a month with it.

(28:17):
So we've had conversations at tradeshows and with customers and there's
a lot of excitement around it.
I think we're still trying tofigure out where it fits, what
are the exact workloads that itmakes the most sense to do this in.
And yeah, I think it's goingto be a process of adoption.
there's definitely a lot of use cases.
I think the spot is a very interestinguse case, especially the large data models

(28:37):
and things that we're processing today.
I'm working with a customer now that'sdoing a lot of video processing in
Kubernetes and that's a very, youknow, CPN memory intensive job.
I mean, you know, we're talking a clusterthat scales from a D CPUs to 6,500
CPUs while they're processing this.
we're really trying to figure outwhere it makes the most sense to
apply this type of technology.

(28:58):
no one wants to have that kind ofdynamic scale and then have to pay
for reserved instances for all ofthat, like, worst case scenario.
that sounds like a billing nightmare.
and you don't want a job thatruns for, you know, hours that
cost you tons of money to fail 80%through and have to restart it I
mean, that's just not efficient.
So, yeah, I think the ability toreally move this and allow those

(29:20):
workloads to finish is gonna be.
Huge for the market.
Alright, so we have beentalking a lot about the problem
and some of the solution.
we do have some slides that givevisualizations for those on YouTube.
this will turn into a podcast.
So audio listeners, we will giveyou the alt text, version of it
while we're talking about it.
But, Philip, I'm Exactly what ishappening and the process of how a live
migration, like how does it kick off?

(29:42):
what's really going on in thebackground when it starts,
Absolutely.
and we do have somebetter demos other places.
I think on the website.
basically what we have is a livemigration controller looking
across all the workloads and nodesthat are lab migration enabled.
You don't necessarily have toturn this on for everything.
You've got all your statelessworkloads, you don't need to live
migration, stateless workloads,just treat them as normal.

(30:03):
You've got your stateful workloadsthat you do want this to use for,
so you could set up a specific,you know, node group for that.
that's gonna allow you to select what youactually want to do live migration for.
You could use it for everything, butit just eats up more network bandwidth
if you're using it for the stuffthat already tolerates being moved.
that controller's gonna be looking fordifferent signals within the cluster of
when something needs to be live, migrated.

(30:25):
instance, interruption is a good one.
being able to do bin packing, evictinga node from the cluster because it's
underutilized, and then migrating thoseworkloads to another node in the cluster.
what we call rebalancing.
basically, rebuilding thecluster with a new set of nodes.
And that could be because you'redoing a node upgrade, you're doing a
Kubernetes upgrade, you're doing anOS upgrade, you're just trying to.

(30:45):
Get a more efficient set of nodes.
all of those are good reasons that youwould want to do your live migration.
So what's gonna happen in that processis the two daemon sets on the source
node and the destination node aregonna start talking to each other.
They're going to look at the pods onthe source node, start synchronizing
them over to the destination node.
So behind the scenes, all of thatkind of memory is being copied over

(31:08):
any disc states being copied overany TCP/IP connection statuses are
being copied over and you're doingall that prep work behind the scenes.
If you have, ephemeral storage, onthe node that'll start getting copied.
Obviously, depending on how much, it'sgonna depend on how long it takes.
Once the two nodes have identicalcopies of the data, that's when

(31:29):
the live controller will say, it'stime to cut over it will cut the.
Connections from one, pause it and putit into a pause state in containerd,
then it will unpause on the new node.
It'll come up with a new name.
Right now we call 'emclone one, clone two.
We just add clone to 'em.
So you can tell which was thebefore and which was the after.
when that clone one unpause,then traffic will be going to it.

(31:51):
It'll have the same exact IP address thatit had while it was on the previous node.
All the traffic continues onto that node.
It picks up exactly where it left off,and the old pod disappears, right?
The old pod gets shut down and torn downif you have something like a PVC attached.
So if you've got an E-B-S-P-V-Cattached, there is a longer pause
because you have to do a detach reattach.

(32:13):
with the API calls, it works.
It just takes a little bitlonger for that pause state.
That's The downfall ofhaving to work with, APIs.
it takes time to do an unbindrebind, to the new node.
but it works today.
If you're using NFS where you cando a multitask, then it's instant,
it doesn't actually add any delay.
just NFS is a slower storage technology.
So does that sort of makesense from a high level?

(32:35):
Yeah.
when we talk about Cast AI as asolution, it do live migrations
based on certain criteria?
Is it making decisions around, if yousay I want a bin pack all the time,
it in the background, is it just likedoing live migrations on your behalf?
Or is this something where you'relargely doing it with humans clicking
buttons and controlling the chaos.

(32:56):
No, this goes back to what wehad talked about at the beginning
where, automation is key.
when a node is underutilized,our bin packer is, probably the
most sophisticated on the market.
It analyzes and runs live tests onevery node in the cluster of whether
that node can be deleted, whether thatnode doesn't need to be there anymore.
And it'll simulate all the pods beingredistributed throughout the cluster.

(33:19):
if the answer is we don't need thisnode, it would automatically kick
off a live migration of all thepods on that node Once it's empty,
it'll just get garbage collected.
once it's gone, all your podsare running on the new nodes.
Everything's moved seamlessly.
You haven't seen any interruption.
the cluster keeps continuing as normal.
Most of our customers doscheduled rebalances, so those
are just in the background.

(33:40):
It's evaluating how efficientlydesigned the nodes in the cluster are.
If the nodes in the cluster are.
Not as efficient as they could be, anddifferent shapes and sizes would be
better for, that setup at that pointin the day, based on the mixture of
workloads there, it'll do a blue-greendeployment, set up new nodes live,
migrate the workloads to those newnodes and tear down the old ones.

(34:02):
So everything that we'retalking about here can either
be scheduled or it's automatic.
It's running every few minutes on a cycle.
but yeah, no, it's entirelyseamless to the users.
Nice.
So in the technical details, we'removing the IP address, I think you had a
diagram, showing the pod, on the nodes,when we get down to the nitty gritty of
Kubernetes level stuff, pod is recreated,but pod names have to be unique you

(34:26):
can't have the IP on both nodes at once.
And then there's the differencebetween TCP and UDP and other, you
know, IP protocols and the, there'sa lot of little devils in the
details that I'm super interested in.
We won't have time to go into allof it, but I do remember you showed
the replica, the pod that you'recreating, step one is we create a
pod and download an image, right?

(34:47):
this is all still goingthrough containerd.
So it's not, there's not like voodoomagic in the background happening
outside of the purview of containerd.
maybe you can talk about that for a minute
Right, Exactly.
So being by changing the pod name,you now have a. Placeholder for
your new pod information to go into.
And it does maintain the same IP addresswhen it moves over from one to the other.
So to your point, that's when thatswitch has to kick in where the

(35:11):
old pod definition disappears.
And the new pod definition appearsin your, control plane with the
API calls up to the coop API.
that cutover is extremely importantbecause you can't have the same
pod living in the same place twice.
that's why we do have to changea name when we switch it over.
there's certain services that, causesome tricks because they have an

(35:31):
operator structure where they expectthere to be a certain pod name.
So when you move it and add the clonesuffix to it, we're working on finding
workarounds to that, in certain areas.
that is a little bit tricky oncertain workloads because you can't
have the same pod existing with thesame name in two different spots.
They have to be unique.
but yeah, definitely.
the IP is the same, but the podname is gonna have a clone dash

(35:52):
one or something like that on it
Yep.
yeah, so it starts with your podand then there's an event that
happens outside of the pod that istalking into, containerd and you're
It's a second pod.
Correct.
It's adding that placeholder.
And because the placeholder pod isactually still in a paused state.
It can have the same IP address.

(36:13):
it's not actually routing traffic toit, 'cause it's not an active pod yet.
it'll be in a staged state.
So you stage it up with allthe information, but it has
to be named differently.
And then when you do the cutover,that's when you switch it from
being inactive to active andswitch the old pod to be inactive.
and that's the final stage when,the clone pod becomes the primary.
And because it's maintained theexact, IP address within the

(36:36):
system, it's not losing any traffic.
So the networking system withinKubernetes routes it to the new node
and the routing tables are updated andthe pod goes to the new destination.
Yeah, that does sound like the hardpart of like the old pod is shut
down so the IP can be released.
I assume the IP.
Can't be taken out whilethat old pod is still active.

(36:57):
it's one of these things where it's like,I understand it at the theory level, but
I have no idea what containerd and coupproxy and, all these different things that
are binding to a virtual interface and theorder of the things that have to happen
in the exact right sequence in order foryou to first assign that IP to the new
node and then also replay all the packets.

(37:19):
it does seem like a very discreetorder of things that have to happen.
it has to go in a certain order orthey're just all fail it feels like.
so that's the part that tookus about a year to figure out.
there had been a lot of studiesand some research papers around,
the moving of memory and thesnapshotting of different workloads.
Like that part was a little bitmore straightforward because it
was really out of the vMotionplaybook days, from early on.

(37:41):
there were also some collegestudies around using cryo to
replicate and migrate containers.
None of them had been able tosolve the IP side of things,
the connectivity side of things.
that's what Cast AI was able to solve for.
and it took a lot of research,took a lot of, in-depth work.
We started on this, early 2024,with a team of about five engineers,

(38:02):
deep kernel level Linux engineers,Kubernetes engineers, people
very familiar with the, code.
they contributed to the Kubernetesopen source project, it was.
10 months before we had a demoand that was using Calico.
before we could demo, we had tohave a custom AMI at that point in
time in AWS because everything waskernel level at the a MI level.

(38:22):
we knew that was not feasible goingforward to production, but that
was the first demoable version Likeanything else, There's a lot of warts
and vaporware in the first version.
since then we were able to movethe logic up to a containerd plugin
makes it a lot more portable.
Now it can be applied to different clouds.
It's much less invasive.
You don't need a specific AMI.
and under the hood anymore, we were ableto move it to the A-W-S-V-P-C-C-N-I.

(38:47):
So you don't need the custom Calico CNI.
all of those were iterative steps tobuild this and make it more, production
viable and adoptable by the industry.
Now it's a matter of, we'vegot kind of two forks going on.
One is continuing to build outadditional platforms, figuring out
cilium, figuring out Azure, CNI,
The other is performance tuning theexisting migrations, reducing time to

(39:09):
migrate, being able to reduce the sizeof the deltas down further and further
so we can migrate it faster and faster.
so we've got kind of thosetwo tracks right now.
the team's up to I think10 or 12 engineers.
kind of working on those two paths,and this is probably one of our most
heavily invested, areas in the companyis being able to further this technology.
'cause we see how much value it brings.

(39:29):
Yeah, I imagine it won't be verylong where, you know, this technology
is pretty advanced, but like otherseventually will probably if it's truly
the thing that we all are looking for.
And it sounds like, it feels like it is,it feels like the kind of, tooling where
it's a hard problem to solve and we'llmaybe see other people attempt to do it.
I mean, the.
the research I had to do for the show.

(39:50):
'cause I was very curious.
I was like, what's thehistory of all this?
And we, someone mentioned SierraIU, which I believe you're
using at least some of that.
that's a project that's been aroundfor quite some time over a decade.
And it's not a new idea, but likea lot of these other technologies,
the devil's in the details.
we never really had an ability, tocapture and understand what a binary's

(40:10):
true dependencies were, whetherit's disc or, networking things.
until we had containers you mentionedon here like it's in LXC, it's
in Docker, it's in pod man, likethis actually tool is used widely.
it's just maybe not, well knownto us end users because it's
packaged as a part of other tooling.
And I can sort of see a world where.
if this becomes more widespread, you'regonna end up with haves and have nots

(40:34):
where my solution doesn't have livemigration or my solution does have live
migration at some point, maybe it'subiquitous, you're building functionality
on top of it, like your automation thattruly adds value around, spot instances
where my company maybe has never done spotinstances because it was too risky for
us and we didn't have the tooling to takeadvantage of it without risking downtime.
I definitely have a couple of clientsthat I've worked with over the last

(40:56):
couple of years that are like that,where they're a hundred percent reserved
instances because they want the cheapest,but they also need to guarantee uptime.
And they can't do that at a levelthat live migrations would provide.
So they have to pay that extrasurcharge, for avoiding, ephemeral
instances and stuff like that.
to me, it gives me comfortthat the technology stack.
is part open source,part community driven.

(41:18):
there's also the product, and private IPside of this as well, but it's not like
you're reinventing the Linux kernel.
It wouldn't have been that long ago whereyou had to actually throw in a Linux
module or kernel module rather, thatwould only work on certain operating
system distributions of Linux andthat you would have to deploy a custom
ISO that wasn't that far in the past.
But now that we've got all these modernthings, I don't know if EBPF is involved

(41:41):
in this at all, but we've got more modernabstractions that it feels like you can
just plug and play as long as you'vegot the right networking components.
From an engineering perspective,pretty awesome because it allows you
to build stuff on the stack like this.
the team's not on the call, but theteam that's developing this, good job,
Bravo, that's, some great engineering.
obviously anytime something is a yearlong effort to crack a nut like this, I

(42:02):
feel sad for the people that had a sixmonth, like, no one's gonna see this
feature for a year and I'm gonna work allyear on it and I hope someone likes it.
So that's from a software developmentperspective, that's the hard part.
That's the true engineering.
Um.
Yeah, absolutely.
we could talk about this forever,but people have their jobs to do.
okay.
How do people get started do they justgo sign up for Cast AI and this is

(42:22):
like a feature outta the box that theycan implement in their clusters or.
Yep, absolutely.
it's in the UI now, so if peoplewant to sign up and onboard, we do
recommend having somebody on our salesengineering team work with folks.
So reach out to us, We'll also reachout when people sign up as well.
it's all straightforward.
There's no caveats.
It's helm charts to do the installand then you set up the autoscaler.

(42:42):
we will be adding supportfor Karpenter today.
It's using our autoscaler,but we will support Karpenter.
around end of Q4 or early Q1,
Yeah, that's great.
my usual co-host is with AWS, sothey would greatly appreciate that.
I know that over the lastyear, Karpenter's been
out, a little over a year.
We've had, a surprising numberof people on our Discord server.
For those of you watching, there'sa Discord server you can join.

(43:06):
There's a lot of people on Discordtalking about using Karpenter.
I'm really impressed withthe uptake on that project.
And in case you're wondering whatKarpenter is, it's with a K and it's
for Kubernetes, you can look on thisYouTube channel later because we did a
show on it and has had people talkingabout it on the show and demoing it.
whenever it was released.
I think that was 2024.
I can't remember exactly.
alright, so everyoneknows how to get started.

(43:27):
Everyone now knows that they wish theyhad live migrations and they currently
don't unless they're a Cast AI customer.
Where can we find you on the internet?
Where can people learn moreabout what you're doing?
Are you gonna be at conferences soon?
I'm assuming, cast is probablygonna have a booth at KubeCon again.
They always seemed to have a booth there.
we've got a big boothat KubeCon this year.
I think we've got a 20 by 20.
We're gonna be doing demos andpresentations in the booth.

(43:48):
this is gonna be a big part of that.
we'll also be at reinvent in Vegas.
In later, early December, Iguess is first week of December.
so I'll be at both of those events.
I'm also really active on LinkedIn, soif anybody wants to reach out to me on
LinkedIn, if you wanna set up a sessionjust to go into more detail, feel free
to ping me I post a lot of Kubernetescontent in general, best practices,
things that we see in the industry froma Kubernetes evolution side of things,

(44:10):
and also obviously a bunch of cast stuff.
so, you know, feel free to follow orconnect Happy to share more information
Awesome.
well, I'm looking forward to hearing aboutthat continual proliferation of all things
live migration on every possible setup.
someday it'll be on FreeBSD withsome esoteric, Kubernetes variant.
It's, pretty cool to seethe evolution of this.
Well, thank you both forbeing here, Philip and Dan.

(44:31):
see you all later.
Ciao.
Thanks for watching, and I'llsee you in the next episode.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

Dateline NBC

CrimeLess: Hillbilly Heist

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Move K8s Stateful Pods Between Nodes

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

Dateline NBC

CrimeLess: Hillbilly Heist

All Episodes

Move K8s Stateful Pods Between Nodes

Stuff You Should Know