Durable Execution for Real‑World Failures with Temporal’s Cornelia Davis

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
I'm Cory O'Daniel. This week,Cornelia Davis from Temporal.IO joins
me to talk about durableexecution, failure handling, retries,
entity workflows, and whyplatform engineering may have been
a distributed systems problemfrom the start. Let's get into it.
My guest today has beenworking in the platform space since
before we had a word for it.She spent seven years at Pivotal

(00:21):
as the VP of Technology, whereshe helped shape Cloud Foundry, a
developer platform that in twothousand thirteen was already doing
things we consider tablestakes - container based abstractions,
separation between Platformand App Teams, and Golden Paths.
Now a principal technologistat Temporal, she's also the author
of "Cloud Native Patterns" andshe spent more than three decades
helping developers buildresilient distributed systems. Cornelia
Davis, welcome to the platformengineering pod.

(00:43):
Oh, it's so great to be here.Thank you so much, Cory.
I was very excited for today'sshow. I loved Pivotal. I loved everything
Pivotal did. So you haveabsolutely influenced a ton of my
work. So very excited to haveyou on today. You've been front row
to so much of platformengineering from Cloud Foundry, GitOps,

(01:04):
Kubernetes, and now you'reworking on durable execution at Temporal.
When I think about our spaceand DevOps and kind of the entire
gamut of how our industry'schanged over the past twenty six
years or so, I feel likethere's so many problems that still
persist for different teams.They're just stuck at different places.

(01:24):
But what are the things thatyou still consistently see teams
getting stuck on?
Cloudflareoneofthe thingsthathashappened intheplatformspacethathasbeensuchagreat
thingisthatwewentfromthissystemadministrationmindsetwithClickOpsintoadevelopermindset.Sothat'saboon.Thatislike,thisisawesome,wearenowprogrammingoursystemsinsteadofclickingonuis.Wearetreatingitsystemsinsteadof clickingon

(01:56):
UIs.WearetreatingiIstillseeisthattherearesoftwareengineeringpatternsandsolutionsthatwe'veappliedoverintheapplicationspacethatwestillhaven'tbroughtoverintotheplatformspace.Soforexample,youjustmentioneddurableexecution.We'regoingtotalkalotaboutthat.It'sjustthisveryrealawarenessandrealization mentioneddurableexecution.We'regoingtotalkalot

(02:25):
about that.It'sjustthisveryrealawarenessandrealizationthatwhatwedointhe platformspaceis
adistributedsystemsproblem.Everythingthat we'reorchestratingisadistributedsystems
problem.We areorchestratingthingsupinAWS,we'reorchestratingoftheknowledgethatwehavearounddistributedsystemsthatwe'veappliedintheapplicationspace.Ihaven'tseenusapplythat awholelotintheplatformspaceyet...

(02:57):
Yeah, yeah, and that one'sexciting to me too because I'm an
Erlang developer. I lovemyself a good distributed system.
But like, what would you sayare like the core principles? Because
I feel like at the same timeit can be hard because a lot of the
people that you see in theplatform space are coming from the
OPS side, right. They haveexperience writing Terraform, they

(03:17):
have experience writing Bash,but they may not be like a formal
"application developer". I'mthrowing air quotes on there for
folks listening to the pod,right? And so they may not have worked
in building distributedsystems, they might have not worked
on building the checkoutAPI. The kind of things that application
developers do day in, day out.And many teams, when they're getting
started in their platformjourney, it's that OPS team, it's

(03:40):
that DevOps team that has kindof taken those first steps. And so
like, what are some of thethings that you see that they're
most often missing fromdistributed systems that would be
the best boons for them?
Cloudflare soIthinkthatoneofthe things is that a big part of...
I mean it's said that sixty toeighty percent of the code that is
built for applications anddistributed systems is failure handling

(04:04):
code. And that is somethingthat I don't see us doing on the
platform side as much becausewhat we do is we build these automations.
We even build like... I lovethe transition. I like you, I'm not
Erlang, but I'm a functionalprogrammer at heart and I love declarative
systems. That's why CloudFoundry. And Cloud Foundry was before

(04:28):
Kubernetes, so it was one ofthese... it was maybe around the
same time that Terraform wascoming on the scene... and so this
is my love letter toTerraform. Terraform did something
amazing. It was thisdeclarative system. It was like,
"Okay, we're going to go outand take a look at what the current
state is. We're going tocompare the state and then we're
going to update it." Love allof that, but things break. And so

(04:51):
this system where we're makingcalls to AWS and cloudflare and some
APIs in our own internal datacenters, those things break. And
so what we end up is... we endup with systems that are inconsistent,
we end up with orphans, all ofthose types of things. And then we
try to apply, like, if youwill, maybe brute force hammers to

(05:13):
trying to fix those things."Oh, I'll have a cron job that goes
and looks for orphans everyonce in a while," that type of thing.
But the reality is thatthere's patterns that could keep
you from getting into a statewhere you've got orphans. I think
that we tend to be reactive.And so it's largely around that failure
handling is that... yeah, wecan put together orchestrations,

(05:36):
we can do declarativeconfigurations, but then realizing
that to stitch it alltogether... and by the way, in the
platform space, nothing runsin a half a second, in five hundred
milliseconds everything runsfor not just minutes, but hours,
days, weeks and months. And sothe longer that something runs, the

(05:57):
more likely something's goingto fail. And so it's that failure
handling, that implicitinherent failure handling that has
to be there for everything.And it's hard, by the way, because
you don't want to spend sixtyto eighty percent of your cycles
working on making the systemas durable as you can, because you've

(06:18):
got a whole team of developersin your organization that are like,
"When are you going to give methe next feature I'm waiting for?"
I need a database. And you'relike, "Well, that's a simple request."Right?
But even like Redis... I don'tknow why, every time I start up a
Redis on any cloud, it's likeit's memory on a wire but it takes
an hour and a half to boot upthe Redis cluster. It's just like,

(06:39):
"What?" It's like things willgo wrong, like subnets will disappear,
EIPs will disappear. Likethere's... but it is funny because
one of these things, becauseagain, like, I feel like for many
developers that are used tohaving most of their infrastructure
set up for them, they may notrealize the nuance in getting all
this stuff together. To themgetting a database is, "Why isn't

(07:01):
it as easy as Docker PoolPostgres? That's how it worked for
me locally." Most of what wedo in the cloud, you said it, it's
not twenty seconds, it is anhour and a half. And it's like, "Oh,
you're out of quota." Oh,shit, that's not even a real failure.
I mean, it is a real failure,but you know what I'm saying? It's
like, that's one we couldeasily get around. So we mentioned

(07:22):
durable execution. How wouldyou define that for folks that aren't
familiar with it?
Yeah. So durable execution isa term. It's gaining a little bit
of popularity. But the waythat I would describe it, because
most of your listenersprobably aren't super familiar with
it, is that it's... and I'm anerd, so I'm going to explain it

(07:43):
from a nerdy perspective...it's a programming model. It's a
programming model that allowsyou to write your code as if those
failures didn't exist. You seethe potential huge value here in
the platform space is that theplatform space, because you say they
come from a systemsadministration background often,

(08:06):
and they have gone throughtremendous pains to clean up after
failures. They do program asif failures don't exist, and then
they compensate for that.Well, what this is, is it's a programming
model that allows you toprogram as if process boundaries
don't exist, as if failuresdon't exist. And the system, the

(08:31):
durable execution platformwill actually put all those compensations
in place for you. So forexample, it will do automatic retries
for you. And by the way, thoseretries... now I know you're probably
thinking, "Retries? Well,everybody does retries." The thing
about durable execution isthat the retries themselves are durable.

(08:53):
And what I mean by that isthat even if you're in the middle
of retrying something, you'vegot a something rate limited, or
you've got some network issueand you're in the middle of retrying
and then some other failurehappens and your retry logic itself
is in process and goes away.Oopsie. But by durable retry, I mean

(09:16):
that if the system that'sdoing the retrying, so you're at
step three and you're doingthe retrying, and now your orchestration
itself goes down. When yourorchestration comes back up, we remember
exactly that you were on retrynumber three and we continue on with
that. So durable retries, forexample, is one of the examples...

(09:37):
or state management, likemanaging state... all of that stuff
is what we call durability. Soyou get to program as if failures
don't exist. But the runtimebehaviors... and by the way, a lot
of that is distributedsystems, and we'll dig into that,
I'm sure, in just a moment...the runtime behaviors make the system
resilient to those failures.It's not that the failures don't

(10:00):
exist. It's like a waterproofwatch. Doesn't meit won'tget get
wet, it thatifitgetsWet, getswet, it'll s
Expressexample.Yeah, thiswatchisdefinitelygoingtogetwet...it'swaterproof.Soletaskaquestionhere.here.Soasadeveloper,whetheronaplatformteamorappdeveloperworkingwithdurableexecution,howmucwhatImayconsidermyapptodayversusdoesitstarttolookabitmorelikeaseriesoflambdas?SodoesthislooklikeamorelikeaseriesofLambdas?SodoesthislooklikeabunchofseriesofsmallerfunctionsthataregettingexecutedinagraphoraDAGorsomethinglikethat?OrdoIs

(10:46):
DAGFromatheoreticalperspective,durableexecutiondoesnotimplyagraphmodel orimplynotagraphmodel.Youcandodurabilitywitheitherofthoseprogrammingmodels.Theonethatwedo...andIwork,asyousaidintheintro,Iworkfortemporal,whichbytheway,isanopensource,hundredpercentopensource-it'snotopencore,it'snotlikeyoucan

(11:16):
runsomestuffontheopensource,it's100%opensource.That'swhatI'mgoingtobetalkingabouttodayistheopensource,super cool
not like you can run somestuff on the open source, it's 100%
open source. That's what I'mgoing to be talking about today is
the open source, super cooltechnology... Our approach there

(11:40):
is that we don't want tointroduce a DAG, we don't want to
introduce a DSL, that you haveto program a different programming
model. We believe that themost natural programming models are
the ones that you're alreadyfamiliar with. They are the languages

(12:01):
that everybody uses. It'sPython, it is Typescript, it is Java,
it is .Net. We support sevendifferent languages. There's even
a Swift SDK. Whateverprogramming models that you're familiar
with, you can continue to usethose. Getting back to your question,

(12:22):
what the code looks like is,it ofthoseunitsofwork intheflow Now,
there's two fundamentalabstractions that we use. One is
called an activity. And anactivity is basically a unit of work,
but it's a unit of work wherethere's a possibility of failure.

(12:43):
Then the other abstraction iswhat we call a workflow, which stitches
together all of those units ofwork in the flow that you want. The
activitydecoratorsonthosewhichTellsourSDK.Oh,hangon,payattentiontothis.Whenthisfunctionis work

(13:04):
where there could potentiallybe failure. You program that. You
basically say, "Okay, here'smy..." basically activities, or decorators
onbasicallyactsasifyouwillabitofproxyaroundthatfunction.Itturnsitintoadistributedsystem,whichmaybeweshouldtalkaboutnext.t,""""So t,
,

(13:24):
I like that, I like that it'sannotations. I feel like... coming
from an Erlang background,there's this philosophy called Let
It Crash. Have you heard theLet It Crash philosophy?
Yep.
Libcurlforfolksthataren'tfamiliar with
it, the idea is you write codefor a happy path and you let processes
fail fast for adherence.Right? And there's this supervision
system in Erlang that... youhave to program it, you don't get

(13:48):
it for free, you have to go dothe error handling work. And it will
restart the system to a knowngood state. It will bring things
back. It's one of those thingsthat I feel like it's so novel and
such a beautiful idea and itis absolutely so easy to screw up.
It is hard to let thingscrash. It is actually very. I mean,

(14:10):
it's very easy to let thingscrash, but the actual recovery of
it does require... it requiresa lot of adhering to. And understanding
what is a good crash versus abad crash. Like a user putting in
extremely bizarre data - thatis a good crash. You don't want the
entire system to fall apart,you want to tell the user they did

(14:31):
something wrong. But ishardtoreasonabout.Andthe ideathatyoucanjustdecoratecode
thatyouknowhowtothinkaboutinthe goodwayandthen
havecommoncases.Sowiththosedecorators,isityou'recallingout whatthefailure
modesareforeachonethatyouknowastheauthorofthecode?Or isit

(14:52):
likeakindofa magicaldecoratorwhereit'slike, oh,Iwill
probePython andLibcurlordecorator where it's like, "Oh, on?"
It's more the former.Basically, you put a decorator on
the function and you say,"This is a function." This is what
we call an activity. And thenyou can set your retry policies so

(15:13):
you can do things like decidewhether you want to do exponential
backoffs. You can also, as apart of that, identify which failures
are retryable and which onesare not. So for example, if you are
making a call out, I think youjust had a very similar example,
you're making a call out toAWS to provision something and you

(15:35):
know what, the credentialsthat you're using to try to provision
that are failing. You're notgoing to retry that because the credentials
are still going to be failing,right? So that is what we call a
non retryable error. And youas a developer basically get to decide
for this function which typesof failures are application failures,

(15:55):
i.e. we're just going tocontinue processing those as application
failures, and then everythingelse we'll just assume is a retryable
error. And so you basicallyget to program those policies. And
of course you can programtimeouts as well, because in distributed
systems timeouts are a majorthing that you need to deal with.
Oh yeah. Oh yeah. So forsomething like... if you're wrapping

(16:17):
some sort of cloud IaC tool -like Pulumi or Terraform or Helm
or Ansible, whatever - andgoing to the AWS case around the
credentials, it's like I cansee a handful of different error
modes. The credentials arejust wrong - that one's just completely
not retryable. There's thequota failure. And then there's the

(16:40):
IAM failure where like halfthe build or half the provision worked,
but when it got over to makinga subnet, the role that you have
doesn't have the ability tomake subnets. You can make this and
you can make that, but not asubnet. So in scenarios like that
where it's like... it all tiesback to kind of authorization, like
sort of, I guess, right? Sohow do you decorate for that scenario?

(17:05):
Where it's like it could bethe inputs to it... like the same
function... the inputs to it,this credential coming in could result
in like one of three errormodes. How do you signal to the,
I guess the durable executionengine, which one of those error
modes it fell into?Becausethe quota one, it's not retryable
but it also is? You know whatI mean. Or like the IAM role - I
forgot to put the IAMpermission on there, I want the same
execution, I want the samerole to come through again, I Ijustneed
to make sure I give this rolethe ability to create that resource
type.

(17:39):
Yeah. The first thing is thatthe durable execution layer doesn't
actually know the details ofyour application logic. So you just
described a really nuancedapplication specific thing. The way
that I've created my unit ofwork is it bundles up a number of

(17:59):
different things... which bythe way I'm going to go off on a
little bit of a tangent here.One of the things that we've seen
is obviously in the platformspace, everybody uses Terraform.
We've seen customers that areusing Terraform and they're creating
these Terraform configurationsthat are very composite. And it's
like, "Do this, do this, dothis and do this." And that's part

(18:19):
of the reason why you end upin this very complex nuanced scenario
that you did. Which is like,"Okay, this composite object, part
of it worked, part of itdidn't. How do we deal with all of
that?" One of the things thatwe're seeing is that people are starting
to, because they have durableexecution, they're able to break
up their Terraform resourcesinto smaller units, because the reason

(18:41):
that they had them in acomposite unit was so that they wouldn't
have to deal with thepotential failures between those
different components. But whenyou have a different solution like
durable execution, that'sorchestrating those lower level units,
now you can break yourmonolithic Terraform configurations
into smaller pieces and usedurable execution for that. So it

(19:02):
kind of simplifies thescenario that you talked about because
part of the nuance of what youdescribed was because you had a multitude
of different things. It wasn'tjust one resource. It was like, "Okay,
I got through part of it, butI didn't get through the rest of
it because of this nuancederror." So back to your original
question - "How do you dealwith that?" Well, you would have

(19:22):
to still deal with that. Thatwould still have to be part of the
return codes of your activity.And then you would have to create,
of course, a mapping... and soyou would decide like, I'm going
to return things out of herethat indicate retryable versus non
retryable. And that woulddovetail with your policy. I want

(19:46):
to make one other commentthough, because I love your quota
example. Because whatsometimes people would have naturally
thought of that... and you'realready picking up on this really
interesting thing, which isthat quotas, you might say, "Well,
that's not retryable, becauseI'm just going to go back and ask
and the quota is still goingto be not satisfied," or "It's going

(20:11):
to take me... I have to file arequest to up my quota, so I need
to stop retrying this until myquota goes... like, how do I deal
with all that? I don't want toretry it every five seconds because
now I've got a human in theloop process to increase my quota."
The interesting thing is whenyou start working with durable execution,

(20:33):
you start thinking a littlebit harder about retryable versus
not. I would suggest that thequota error is a retryable error
because you can actually sideeffect the system, where now this
is going to continue when thequota is updated. We'll get a little
bit further into the durableexecution... I'll come back to this

(20:54):
example when we get a littlefurther into the technology.
Ops teams, you're probablyused to doing all the heavy lifting
when it comes toinfrastructure as code wrangling
root modules, CI/CD scriptsand Terraform, just to keep things
moving along. What if yourdevelopers could just diagram what
they want and you still gotall the control and visibility you
need? That'sexactly whatMassdriver does. Ops teams upload
your trusted infrastructure ascode modules to our registry.Your

(21:17):
developers, they don't have totouch Terraform, build root modules,
or even copy a single line ofCI/CD scripts. They just diagram

(21:38):
their cloud infrastructure.Massdriver pulls the modules and
deploys exactly what's ontheir canvas. The result? It'sstill
managed as code, but withcomplete audit trails, rollbacks,
preview environments and costcontrols. You'll see exactly who's
using what, where and whatresources they're producing, all
without the chaos. Stop doingtwice the work. Startmaking Infrastructure
as Code simpler withMassdriver. Learn more at Massdriver.cloud.
So in that scenario, can youpause execution and then resume execution?

(22:03):
So this sounds like also ifyou had a system where you're like...
in an invented system andyou've got something that's hitting
a dead-letter queue all of asudden because there's something
just... there's an error thatyou were just not expecting and the
code's just wrong. It's likethat thing that hits the dead-letter
queue, you have to have someother code that's going to process

(22:23):
that dead-letter. This eventhappened, we still need to deal with
it. But in this scenario, Ican pause execution, we fix the code,
ship it, and then resumeexecution of this event that was
failing previously. And it'sjust like, okay, now we just have
the handling of this. Maybe afield was spelled wrong or something
like that, using the same kindof model.

(22:44):
Absolutely perfect, you havethe essence of durable execution.
I've broken shit. I've brokenshit before, I've been around.
Yep. So there's two things.I'll get back to the pause in just
a moment because you also usedanother magic word, which is eventing,
and I hadn't described thatyet. So one of the things that we

(23:05):
do... I mentioned that youhave these units of work, they're
activities, and then you havea workflow that stitches them together,
right? So we talked about theretry around the activity. But the
other thing that I want topoint out that's very, very important
is that when I have thisworkflow that's orchestrating these
activities, every single oneof those calls from the workflow

(23:26):
into an activity and thereturn happens via message queue.
Happens via task queue. Nolonger does this... remember I said
the programming model saysthat you can program as if failures
don't exist? Another way ofputting that is you can program as
if everything is running inthe same process. Like a function
call, you don't have to worryabout that because it's running in

(23:48):
the same process. Of courseyou can have out of memory errors
and things like that, but thehigher level programming languages
have done a pretty good jobnot letting you shoot yourself in
the foot by not having theright pointer type of a thing. So
a function call is a prettysafe thing. You don't really have
to wrap every single functioncall with a whole bunch of error

(24:08):
handling code. So what theSDKs do in the durable execution
case is they intercept andthey're handling some retry. But
in fact that retry itself isbeing handled over a task queue.
And so all of this ishappening with an eventing system.
So now in this quota examplethat you talked about, now I want

(24:30):
to get to your dead-letterqueue, because that is a perfect
example, because this is theway that we have programmed these
distributed systems in thepast. We have an eventing system
and at some point when we hitsome failure scenario, we don't let
it sit in the queue anymore,we send it to a different queue,
which is the dead-letterqueue, which says, "Hang on, I'm

(24:53):
stuck here, I can't do...."And then that's typically where application
engineers... or if you'reapplying this in the platform space,
platform engineers... have tooccasionally go, they have to write
automation that goes acrossthat dead-letter queue, gives you
some observability, and you'vegot to handle these things. And that's
where a lot of orphans end up,right? Like orphaned infrastructure

(25:13):
ends up somehow manifestingitself into the dead-letter queue.
With durable execution,another way of expressing it is that
once you have started aprocess, so once you've started one
of these workflows withdurable execution, it will live until
it either completes or youdecide to terminate it. What that

(25:36):
means is that if something'sgoing wrong. Like I hit this quota
problem. I don't have toactually send things to a dead-letter
queue. I can basically say,"You know what, I'm stuck. I got
an error. It's a quota error.And so now I am going to put this
flow into a wait state."
Very cool.

(25:57):
You can basically have somelogic in the workflow that says,
"When I hit a quota error, I'mjust going to go into a wait state
and I'm going to wait forsomething." Typically you're waiting
for some state in theapplication to change. You might
have a flag in the runningcode, in the running workflow, it's
a local state variable thatsays, "Hey, waiting for human input."

(26:23):
You basically go into a waitstate. And the magic is that in a
durable execution platform, itbasically says, "Alright, I'm going
to offload, I'm not going toconsume any resources with this."
By the way, there are someplatforms out there that talk about
durability, but if you readthe fine print, it says, "While you're
waiting, this is your cost."This because it's still consuming

(26:49):
resources. In the Temporaldurable execution platform, it literally
consumes zero compute whileit's in that wait state. And it can
wait for a minute, a day, aweek, a year. That process lives
forever until it finishes. Weneed to talk about what we call entity
workflows in just a minute.What we do is we go into a wait state.

(27:13):
It's not that we go into adead-letter queue. We just say, "Hey,
this workflow is paused." Nowsomebody goes and updates the quota.
In updating the quota, you'vegot some code that updates the quota.
It also flips that bit in theworkflow. It says, "Hey, human responded."

(27:35):
Now the workflow says, "Ohcool, I'll pick up where I left off."
Which is an important part ofdurable execution. It knows exactly
where you were, picks up whereyou left off and says, "Okay, human
came back with some input. Letme continue this retry." And this
time it retries and the quotahas been updated and you can go on.

(27:58):
No special logic, it's justthe workflow continues. That's what
durable execution is.
Terraformverycool.In thatTerraform or OpenTofu example...
Sorry, I've been saying a lotof Terraform. Sorry folks, don't
come at me... in thatscenario, it would start to reapply,
right? So it skipped throughmost of it because that's okay. So
it's not like actually somehowpausing the terraform binary and
resuming that. Okay, very cool.

(28:23):
That's Correct. Yeah. We havea lot of people who are using Temporal
to orchestrate theirTerraform, to manage their state
files, like I said, to breakdown their monolithic configurations
into smaller pieces so thatthey can be a little bit more fine
tuned with it. And it's reallyvery, very cool because you realize
when you're doing things likeretries, you need to have things

(28:44):
like idempotence and most ofthe resources in OpenTofu are idempotent.
So it's actually quite a nicematch made in heaven between Terraform
and durable execution.
In the preshow you said thatyou guys weren't originally designed
for infrastructureorchestration. It wasn't really designed
for people on the oppositeside of the house, but that's where

(29:04):
you're starting to see a lotof people using the product today.
What do you think is the mostattractive thing for Operations,
DevOps, budding platform teamsabout this execution model for managing
things like Ansible andTerraform and whatnot?
Yeah. And so I'll say a littlebit more about that just for a little
context for your listeners. SoTemporal's been around for like six

(29:28):
and a half years and when wehad a conference a couple of years
ago... I wasn't there yet,I've only been here for about a year
and a half... but withoutdoing any kind of enablement in the
platform space, without doingany kind of go to market or anything
like that, like literally overhalf of our user stories that came
to that conference wereplatform engineering, were infrastructure

(29:50):
orchestration... byinfrastructure I also mean, you know,
any kind of user onboarding.It's not necessarily compute storage
and network, but it might beprovisioning a user or provisioning
them into some kind of a SaaSsystem. So very platform engineering
use cases. And I think thatone of the main reasons that it took
hold in that space was firstof all, a lot of what we've been

(30:13):
talking about was the need tohave a tool set and also not have
to learn about actor systemsand event driven systems and event
sourcing systems and all ofthat stuff to be able to get your
job done. So the programmingmodel is really great, but I think
that one of the main reasonsis because of the long running nature.

(30:34):
Durable execution - Yes,iandit'susedinevenmoneytransfers.It'susedatoninthatscenario becauseit'sgreattohavedurabilitywhenyou're
doingmoneytransferacrossdifferent money
transfer across differentsystems. But those transactions are
relatively short in timelineand people have built the Rube Goldberg

(30:58):
machines and spent the sixtyto eighty percent of to.Otherwiseifyoucan'thaveresilientfinancialsystem,itwouldn'tbeasystemworthusing
if you can't have a resilienfinancial system, it wouldn't be
a system Sothatscenario all.But it's the long running nature,
I think that was one of thebiggest things. So that scenario

(31:22):
thatwprocesswhenyou'rereadytogo.IthinkthelongrunningnatureIthinkisoneofthethingsthathasreallyprogrammingmodelandlongrunningnatureoftheworkflowsarethetwothingsIthink The
long running nature I think isone of the thin.. .
Terraformreasonabout,takesupatonoftimewhenit'salltogether,butalsoit'sjustfundamentallyatoddswithuse,youknowwhatI'msaying?It'slikegivingsomebodyaterraformmoduleandyou'relike,"Thismakesavpcsomesubnets,adatabase,akubernetesclusterandyoucanputyourapp init.Andit'sjustlike,it'slike2,000parametersIhavetothinkabout.Right.Versus

(32:23):
likeokay,I'mgoingtothinkaboutaa Kubernetesclusterandyoucanputyourappinit."Andit's
just like,"That'sliketeothousandparametersIhaveto thinkabout."Right?Versuslike,"Okay,I'mgoingtothinkabouta
networkormaybethenetwork'salreadybeenthoughtabout a
database, a Kubernetes clusterand you can put your app in it."

(32:46):
And it's just like, "That'slike teo thousand parameters I have
to think about." Right? Versuslike, "Okay, I'm going to think about
a network or maybe thenetwork's already been thought about
and it's provisioned, sothat's grea. I won 'tsa y, lO
Yep, yep, totally a hundredpercent on that. There's another
thing. So I mentioned longrunning and earlier I kind of hinted

(33:08):
at like we should talk aboutentity workflows.
rememberIwantedthat...I wastrying to Remember, I was digging
in my brain. I'm like,"There's another workflow of things
she'd mentioned I wanted totalk about." That was it. Yes.
Prod And this is a perfectspot to talk about it. So we said
that the things that theplatform engineer is dealing with,
these things live for a longtime. An it's not only the provisioning

(33:33):
process. We know, especiallywith the infrastructure as code and
some of these declarativesystems, there's definitely this
notion of like, okay, it's notthat we do all the orchestration
and then we're done. Werecognize that things change, so
that environment will change.I've provisioned the environment,
but now I need to provisionmore. I need to scale capacity or

(33:57):
I need to add some othercomponent into the architecture.
Those types of things. I needto cycle credentials, all of those
types of things. And this iswhere entity workflows come in. I
actually prefer a differentterm. It's a term, I think that's
going to resonate with more ofyour audience, which is the notion

(34:18):
of a digital twin. So, I'vegot a thing, it is an application
team has come along andthey've said, "I need an environment.
I need an environmentprovisioned. I actually need Dev,
Staging, and prod. And theyall need to have this database, this
message queue... theTemporal... It needs to have all

(34:42):
of these different componentsthat needs to be tied into the IAM
system in the following way."Then we go ahead and the infrastructure
is code, the orchestrationlogic, it provisions all of that.
The notion of a digital twinis that you always have a logical
and digital analog to thisvery real thing that is out there.

(35:05):
The very real thing, of coursehere in this case is abstract, it's
infrastructure and all ofthose things. But the cool thing
is now you have a digitaltwin. And that digital twin is basically
sitting there saying, "okay, Ihave a representation of what that
infrastructure is. And I'malso the place where you can interact
instead of interactingdirectly with the physical thing.

(35:27):
You're interacting with me asthe digital twin." And so the entity
workflow... remember I wasdescribing how these workflows can
live forever... you canbasically design a workflow so that
it provisions everything andthen it goes into a wait state. Now
you can send signals into thatworkflow and say, "Hey, I need you

(35:48):
to scale capacity," or "Hey,I'm adding additional users to this
project." It basically allowsyou to not create a brand new orchestration
to make changes to an existingthing. It says, "Alright, here's
the orchestration that isgoing to make changes to the existing
thing." So it's a great placefor audit. It is a great place for

(36:12):
observability. What are thethings that happened? That's what
we mean by an entity workflow.I think of it as a digital twin for
the infrastructure that you'reorchestrating and that's a super
powerful thing in the platform space.
Yes. That effectively givesyou close proximity or parity effectively
between your environments.Right? Like I have prod, that is

(36:34):
my physical thing in thiscase. And then it's like, "Hey, I
need a preview environmentevery time somebody opens a PR for
this app, I want to cloneProd, stand it up, make sure that
like QA can QA the entirething, it works as intended and then
get rid of that twin so I'mnot spending money on it."
Yep, yep, that's right.
Very cool. And so that is...entity workflows in Temporal is how

(36:57):
you model that today?
Yeah, exactly. So they'rethese long running things that basically
are just mirrors of the realinfrastructure. So you've got the
orchestration that lives aslong as the infrastructure does.
It's not that theorchestration completes and then
something else has to come inand affect that infrastructure, it's
that the orchestration livesas long as the environment lives

(37:22):
and you can continue tointeract with it through that one
thing. So that means itconsolidates all of that orchestration,
all that history.
Yeah. And that's frustratingto model in a CI/CD pipeline. It's
very frustrating. You can doit, but is there a better use of
your time? There probably is.

(37:44):
Yep. Rather do a little innovation.
non-determinismSo Temporalhasbeenaroundforwaybeyond thecurrent
AIwave,butyou're also seeingit used heavily in AI systems today.
There's the platformengineering shaped angle in there
around sandbox orchestrationand managing environments and agents

(38:04):
are executing... executingagents is like... it is also very
much like an infrastructureorchestration problem, right? Got
tons of run. What is it aboutdurable execution that AI teams are
reaching for? And how canit... or can it help some of the,

(38:27):
I guess, nondeterminismthatwe receivefromtheseAIsystems.?
Yep. One of the things I liketo say is that your LLMs are non
deterministic enough.Everything that you wrap around the
LLM, let's make that asdeterministic as possible.
Yes, please.
non-determinismbeenaroundforsixandahalfyears,sowellbeforetheGenAIcraze.But theuptakeof

(38:53):
TemporalinAIcompaniesallthewaytothebiggestAI companiesoutthere...OpenAIisverypublicabouttheiruseofTemporalandtheyuseitinatonofplaces...isseveralfold.Numberone,alloftheGenAIbasedapplications,whetherthey'reagenticorfixedflow,aredistributedsystems.Especiallywithagents,they'restartingtorunforlongerandlongerperiodsoftime.longerperiodsoftime.Andverysimilartotheplatformengineeringspace,thelongerthatsomethingruns,thehigherthechancesoftherebeingsomekindofaproblemthatyouneedtocompensatefor.AndsoLLMsareanexternalcallgenerally.Imean,evenifyou'rerunningalocalLLM,it'snotrunninginprocess,soit'salwaysanexternalcall.Ifyou'rerunningtheonesoutonthefrontiermodels,you'regoingtogetratelimited.Sothatwholethingthatwetalkedaboutearlierwithquotas,theanalogoverintheAPIintheAIspaceisRateLimited.LLMs,you'regoingtobeorchestratingallsortsofthings,interactingthfilesystems,allofthosethings,allexternalcallswherethingscouldgoaboutearlierwithquotas,theanalogoverin

(40:13):
overintheAIspaceisrate-limitedLLMs.You're goingtobe
orchestratingallsortsofthings-interactingwith databases,filesystems,allofthosethings-allexternalcallswherethingscouldgowrong,where
there'ssomeelementofnondeterminismthat thirt ythir
ty -five

(40:35):
No, you certainly do not.
They'redurableexecution,we'verecorded this state and all... Durable
execution is basically anevent source system for any of your
listeners who are familiarwith event sourcing. Everybody's
familiar. Nobody's built itbecause it's fricking hard. But that's
essentially what it is. It'san event source system. What we can

(40:58):
do is we can say, "Yep, let'sgo through, where were we? Oh yeah,
we already did that. Werecorded that. Yep, yep, yep, yep.
Got all those LLM outputs." Soyou're not re-executing the LLM,
which has two problems. Numberone, it burns a lot of tokens, lots
of money, but also you'regoing to get back different results
if you were to rerun that LLMcall. Which is crazy because now

(41:21):
how are you supposed to evenidentify whether your system's running
properly? There's that wholething, it's a distributed systems
problem, but then it has thisadditional thing of, "How do you
actually do reasonabledevelopment in a system that is inherently
non-deterministic? And durableexecution absolutely helps you manage

(41:43):
that thing. There's anotherelement that I want to talk about,
and I want to go back to theprogramming model. We all know that
more and more code is beingwritten by AI agents... coding agents.
they're getting really good.The LLMs together with the harnesses
that people are building,they're all getting really good at

(42:05):
writing business logic. It'spretty darn good. I usually have
to go back and ask it to getrid of some of the fluff. Like, "Do
you really need this?" Andthose types of things. But it's pretty
darn good at the businesslogic. It's not so good at the whole
event driven resilience andall of that stuff.
No way.

(42:26):
What if you... like humansaren't particularly good, spending
sixty to eighty of their timeand it's toil and all that stuff.
What if we don't burden thecoding agents with understanding
distributed systems plumbing?That is just under the covers and
that's where we're seeing...We had a customer, in fact they spoke
about it at our conferencelast week, where they had scheduled

(42:49):
six months for a migrationfrom some of their legacy applications
that they had on some legacyworkflow system. They had scheduled
six months for the migration.They did it in three weeks.
Heck yeah, they did.
Because they used codingagents that used a Temporal skill

(43:13):
that gave it the knowledge ofwhat these Temporal abstractions
are. Workflows, activities,described the types of things. And
so it was able to write thebusiness logic. It didn't have to
write any of the plumbing codethat it used to have. They were able
to... like developerproductivity, time to market, huge.
Yeah. Are those Temporalskills also open source and can be

(43:35):
used with the open source Temporal?
Absolutely. Yeah, they are inour repo.
Temporal We'll include some ofthose in the show notes. Yeah, awesome.
Well, I know we're coming upon time. This is a super fun conversation.
I love learning about what youguys are doing over there. Where
can people find out more abouttemporal, the open source project,
the company behind it andwhere can they find you online?
Yeah, so you can find me onLinkedIn. That's the social media

(43:57):
platform that I use these days.
My favorite.
Yep, same. Cornelia Davis,Temporal. You'll find me that way
in terms of finding Temporal,you can certainly go to Temporal.io.
S we are a business, we dohave a SaaS offering of Temporal.
So Temporal... I talked a lotabout the SDKs, but there's a service

(44:18):
element... there's a backingservice to that. We have a SaaS offering
of that and it runs globally,lots of different regions. We just
last week announced that we'vebeen achieving six nines. Yes, I
said six nines of availabilityon our Temporal Cloud product.
Nice.

(44:38):
Yeah, it's insane. It'sinsane. Our SLA is either 3 or 4,
but we've been achieving 6.It's just freaking awesome. You can
start at Temporal.io. But weare first and foremost an open source
company so you can find yourway to all of the open source stuff
there as well. The GitHuborganization is "temporalio". We

(44:59):
also have a"temporal-community" GitHub organization
where you can find a wholebunch of goodness. You'll find the
skills out there. We'll put... lik you said... put that in the
show notes. If you're on aMac, you can brew install Temporal
a local dev server. So youdon't... it's brain dead simple.
So lots of stuff.
I like it. I like it. And then"Cloud Native Patterns". You can

(45:20):
find it on Amazon... anywhere?It's published by Manning, right?
It is published by Manning.And I'll just take a moment to like
celebrate with you a littlebit. I just recently got, you know,
my quarterly royalty statementand I have ticked over 10,000 copies.
Heck yeah, nice. Is that NewYork Times bestseller yet? Do they

(45:43):
put tech books on bestsellers or...?
We don't write technical booksto get rich.
Awesome. Well, it's so awesometo have you on the show. Thanks so
much. And check out Temporal.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

Dateline NBC

iHeartRadio 24/7 News: The Latest

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Durable Execution for Real‑World Failures with Temporal’s Cornelia Davis

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

Dateline NBC

iHeartRadio 24/7 News: The Latest

All Episodes

Durable Execution for Real‑World Failures with Temporal’s Cornelia Davis

Stuff You Should Know