State, Scale, and Signals: Rethinking Orchestration with Durable Execution

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Tobias Macey (00:11):
Hello, and welcome to the Data Engineering podcast, the show about modern data management.
Data teams everywhere face the same problem. They're forcing ML models, streaming data, and real time processing through orchestration tools built for simple ETL.
The result, inflexible infrastructure that can't adapt to different workloads.
That's why Cash App and Cisco rely on Prefect.

(00:33):
Cash App's fraud detection team got what they needed, flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows.
Each model runs on the right infrastructure, whether that's high memory machines or distributed compute.
Orchestration is the foundation that determines whether your data team ships or struggles.
ETL, ML model training, AI engineering, streaming, Prefect runs it all from ingestion to activation in one platform.

(01:01):
WHOOP and 1Password also trust Prefect for their data operations.
If these industry leaders use Prefect for critical workloads, see what it can do for you at dataengineeringpodcast.com/prefect.
Composable data infrastructure is great until you spend all of your time gluing it back together.
Bruin is an open source framework driven from the command line that makes integration a breeze. Write Python and SQL to handle the business logic and let Bruin handle the heavy

(01:27):
lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement.
Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster.
Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform.
Go to dataengineeringpodcast.com/bruin

(01:49):
today to get started. And for DBT Cloud customers, they'll give you a thousand dollar credit to migrate to Bruin Cloud.

Preeti Somal (01:57):
Your host is Tobias Macy, and today I'm interviewing Preeti Somal about how to incorporate durable execution and state management into AI application architectures. So, Preeti, can you start by introducing yourself?
Hi, everyone.
Glad to be here chatting with you today. My name is, Preeti Somal, and,
I run engineering
at Temporal

(02:18):
Technologies.
We are the pioneer behind durable execution.
Prior to this, my background has been a lot of enterprise software, most recently at HashiCorp.

And do you remember how you first got started working in the area of data?

Yeah. I think my first,
sort of exposure to data was actually at Yahoo. I was at Yahoo for four years, and that's when I learned all about the wonderful world of Hadoop
and data.
I was more on the systems management side, and
one of our sort of dreams was to get all of the monitoring data into the Hadoop clusters

(02:56):
and kinda see what magic comes out of that.

And now you're working at temporal, which is definitely in the forefront of this space of durable execution
and
fail proof application design.
And I'm wondering if you can just give a bit of an overview about what that term means when somebody says durable execution and some of the ways that it changes the way that people think about their overall system architecture.

Absolutely. So the way we like to explain this is what if
your application
crashes
and the crash is inconsequential?
And that's exactly what durable execution gives you. The goal here is to offload
the developer,
the engineer from all of the heavy lifting that goes into building

(03:45):
reliable, resilient applications
and take all of that work and deliver it through a platform.
And that platform is temporal, and that's the core of the durable execution
value that we are delivering.

And so in terms of that resiliency
to errors and failure,
obviously, not every error is one that you can automatically
recover from, but many of them are just due to transient issues, whether those are brief network outages or an application node going down to get restarted because it's doing a rollout of a new version,

(04:24):
or maybe there is just a configuration error that needs to be addressed, and then you can retry it. And I'm just wondering if you can can talk to some of the ways that
as you are building on top of something like temporal, it forces you to think about what are those different failure modes and what are the appropriate behaviors.

Yes. Absolutely. So I think the the main thing to talk through here is temporal
is a programming model, and our approach has been to really build
idiomatic
SDK support. So we have
support for temporal in multiple languages.
And as developers kinda sit down and understand temporal, the the two key elements are

(05:08):
to think about how your application
is structured
vis a vis a concept called a workflow.
And as part of that model, what you do is you put your error prone
pieces of the application into something we call an activity.
And so that starts sort of creating this separation of concern between

(05:28):
the logic of a multistep process and
encapsulating
kind of a step that could be error prone into an activity. And from there, you can just set policies around how many times you wanna retry,
exponential
back off, you know, all of the sort of decorators and the power around

(05:49):
what you want to do with respect to handling of that error. But you don't actually need to build that logic. So one thing we find is in any application code, you know, roughly 60%
of the code is around
the
scaffolding
around the the error handling pieces of it, and that just goes away with temporal.

(06:13):
And so you get much better readability.
You get the ability to focus on just writing your business logic,
and temporal will handle everything else for you.

And with that focus on reliability
and high uptime
and the separation of concerns from the
reliability and state management versus the business logic, I'm curious how you're seeing that impact the way that data teams are thinking about the use of temporal within their work, whether that is building different ETL pipelines

(06:47):
or
managing state storage for some sort of data oriented application
or various other use cases.

Absolutely. And so,
really, when you talk about,
sort of the the data aspect of this, you know, a pipeline
and managing state,
The what we are seeing is that all of these sort of tasks are
multistep
processes,
and there is a lot of coordination
and state management that goes into it. As well as, you know, if an error happens, do you restart from the from the start? And that could be really expensive, especially if you're doing some training models

(07:28):
and task management
on really expensive GPUs, etcetera.
The the overall sort of job of the engineer becomes much simpler because
they're able to sort of logically
think about that pipeline
in terms of the steps that that pipeline has
and then build those steps out using temporal

(07:50):
without needing to worry about state management
or queues or checkpointing
or, you know, any of the sort of complexities
that aren't related to the task at hand
just sort of goes away.
And so what we are seeing is, especially
given
all of the the sort of data hunger for AI applications,

(08:13):
kinda the the number of customers
running their data pipelines on temporal
has actually kind of really exponentially
grown.
And then the other piece of it is they're able to go faster. So one of the interesting sort of dimensions here to talk a bit about is often

(08:33):
in software engineering,
developer productivity and reliability
is considered to be at odds with each other.
And we really think that's a false dichotomy.
We with temporal, we believe,
and we have customers
sort of attesting to this, that
you can increase the developer productivity

(08:53):
while
bringing that reliability
and scale as well.

And then digging a bit more into some of those foundational
primitives
of temporal and the overall space of durable execution,
you mentioned checkpointing,
which is something that has very specific meanings in different contexts where if you're doing machine learning, you'll typically checkpoint at a particular completion of an epoch of a training round. In systems such as Flink, there are checkpoints as far as once you're done processing a particular window. When you're dealing with something such as Spark Streaming, it has the idea of microbatches where maybe you're going to checkpoint at each completion of a batch. And I'm wondering, as you

(09:36):
work with data teams who are thinking about where and how to incorporate
something like temporal,
how does that change the ways that they think about the foundational primitives in the tools that they're relying on and maybe some of the ways that they can reduce some of that reliance and,
in some cases, maybe even move to a simpler tool because Temporal handles some of that heavy lifting?

Yeah. And that's a great question because I think what this question teases out is a lot of the tools you mentioned were built
specifically
for the data
space. And as you know, temporal
is a general purpose durable execution platform.
And while we are getting a ton of usage in data, we are not

(10:23):
out of the box. The abstractions
are not extremely opinionated
about how the data engineers should be structuring
their workflows. Right? So a checkpoint
is something that actually isn't even a term that shows up in our abstractions.
But depending on how

(10:44):
the team at hand is thinking about
their needs, they can
implement that using the signals and the activities and the and the tasks and kind of the their set of abstractions that the temporal programming model
delivers.

And one of the higher level primitives in temporal is the concept of a workflow,
which is a sequence of tasks to complete. I know it also has a concept of activities,
particularly when you're dealing with the data ecosystem. Workflows, again, have a very specific meaning, which is generally incorporated into the idea of

(11:21):
a data orchestration engine that handles the sequence of steps in a particular directed acyclic graph or DAG. And I'm wondering as people are starting to adopt temporal
maybe for application use cases, how maybe that
shifts
the ways that data teams are thinking about their usage of orchestration engines, the role of the orchestration engine in the management of workflows and state related to those workflows. And, also, as application teams and data teams maybe are building on that same substrate,

(11:54):
brings
those use cases closer together.

Yeah. And, you know, really great question. Again, I think that that really highlights the power of the platform. So,
clearly,
temporal is code first,
and a lot of the tooling that exists in the data specific space is oriented around the DAGs. Right? And so we believe and what we're seeing is that the code first approach

(12:21):
lets you reason
with the logic
in a much more compelling way and provides the flexibility
and scale
that you need as your pipelines get more and more complicated.
We are seeing in fact, we had a talk at our conference,
where, you know, this particular customer actually

(12:41):
as they moved from in their case, they were using Airflow. As they moved from Airflow to temporal,
the first phase of that was they actually just sort of built
a DAG
to temporal workflow
mapping to get their teams sort of up and running with temporal
and get familiar with temporal. And then as they learned more,

(13:04):
they started taking out sort of the DAG pieces and and being called first. So I think that is a super compelling
sort of approach here. And then I think the second piece, and this is where we see a lot of
the use cases that illustrate
both the AI application
and data sort of domains coming together as well as the sort of more real time components as well is being able to send,

(13:33):
signals, for instance, from your data pipeline processing to your application.
And I don't know if you've ever looked at Nexus, but one of the main sort of advances here we are seeing with durable execution is as
teams are building
their workflows,
the data team

(13:54):
might wanna have a way to
invoke something in the application tier or the other way around.
And Nexus is a,
sort of, an extension of temporal
that allows you to make these calls across boundaries
in a secure way. So

(14:14):
we really believe that and we're seeing patterns where
building both the data pipelines and the application code on the same platform
allows a lot more richness,
in the kinda end application that the consumer is seeing.

That's also interesting as we start to dive into some of the areas of
AI engineering and AI systems design because I think that's another factor that is pushing these teams closer together where, for a long time, application teams would focus on their user facing applications.
Data teams would handle their exhaust and try to turn it into insights to the organization that would maybe then get piped back into the application.

(15:00):
And machine learning and AI teams would focus on using those data assets to turn them into machine learning models either for business optimization
or for user facing features.
And with the introduction
of generative AI systems,
it forces all of those teams to work much closer together because the cycle time is much faster and the

(15:23):
degree of experimentation
can happen much quicker. And I'm just wondering how maybe that substrate of temporal being able to work across those boundaries
also factors into the necessities
of the organizational realities as we're bringing AI more into the inner loop of the business?

Yeah. Absolutely. I think, you know, first and foremost, I think it is allowing
kind of that common durable execution platform to be used across multiple
domains, which, as you were pointing out, were historically pretty segregated.
And,
you know, just even kind of the feedback loops there were were weeks and months as opposed to the need now around being as close to real time as possible. And especially with the AI use cases, you know, what what's definitely happening is the pattern around what are some of the common

(16:19):
sort of data prep elements that exist
that are needed, and those coming in place, and then
the app teams
sort of iterating much more quickly
and driving feedback into kind of that data prep layer. And being able to do that in a way where you can
actually break down the silos and have sort of, you know, an RPC level for the lack of a better word, like, you know, reach across what were historically hugely firewalled

(16:50):
and ring fenced domains.
That we're seeing is really just expediting
the time to deliver these capabilities.

And
when data teams are faced with the technical decisions of how to implement a particular
use case, particularly if they're dealing with a step based workflow,
what are some of the heuristics that you're seeing them use
when they are trying to decide, okay. Well, I have my orchestration engine, whether that's Airflow, Daxter, Prefect, what have you. And I have

(17:23):
temporal because the application team is using it, or maybe they're starting to use it for some specific data use cases. I'm just wondering how they figure out that decision point of what are the features that they need and what are the tools that they're going to use for them, and then in particular, what are some of the hybrid opportunities for being able to integrate temporal into that orchestration engine?

Yeah. So what we are seeing is a couple of patterns. One pattern is, you know, as temporal is getting more and more adopted,
the amount
of community and developer love that we are seeing for temporal,
the we have kind of these champions
for temporal within organizations.

(18:06):
So so the first pattern we're seeing is just a a sheer grassroots
adoption pattern where you've got an engineer that had a great experience using temporal
and is completely
a fan of durable execution,
and they're going and and helping other teams understand
how to think about temporal. Right? The second pattern we're seeing is

(18:29):
within
some organizations
where there are sort of the the senior kinda architects, principal level engineers that are thinking ahead
in terms of the dependencies
and the kind of the hybrid nature of these applications.
They are the ones who are bringing temporal and nexus
into the organization.

(18:51):
And then I would say the final pattern that we are seeing quite a bit of is
data teams that
are
seeing you know, started with Airflow, for instance, and it's just not scaling for them. You know, the schedules aren't running
as expected.
They're missing the reliability
and scale,
and they're looking for a solution that is, like, a proven solution at scale.

(19:16):
So a lot of our sort of customers that are in the in the camp of migrating from
one of the existing
sort of products to temporal,
You know, a lot of them, the common theme there is
the scale and reliability
that they need.

And moving into the AI use cases, as data teams are starting to
come to grips with what are the actual requirements
for the purpose of the AI application, or they're trying to feed the appropriate data to the teams who are building maybe an agentic use case.
What are some of the ways that temporal

(19:55):
simplifies
the workflow
of doing that iteration or decomposing
the state requirements
for the AI application
and, just some of that
interface between the data preparation and data curation stage and the actual activation stage in the context of an AI system?

Yeah. So,
one one pattern here that we're seeing is the the sort of incremental
sort of data prep and signaling.
We also have some use cases where the data prep needs sort of a human in the loop type sort of thing.
We have a customer that we were talking with recently where they are actually

(20:39):
wanting to have what they call
accountability
markers
in the
in kind of the final stage of the data prep before
that gets surfaced to the application.
And that marker could be, again, either a human or a,
sort of a a validation
system of some kind. So it's, you know, what we're seeing is that there's sort of a a multistage

(21:05):
complex
flow here,
that brings in these requirements
around the sort of accuracy and trust
elements as well that are really easy to implement
with temporal,
again, because it is a general purpose
sort of durable execution platform with a very powerful programming,

(21:27):
sort of model around it. One other sort of use case we are also seeing we haven't talked about yet is we're starting to get used also in just the pure
task management
and scheduling
of the sort of data prep side of the house as well,
and signaling that to the

(21:48):
application
around
a new batch of data that's just come in is another really interesting example. So one of the case studies we've published is
with a customer
that uses us for medical records transcription
and kinda real time sort of visit summary preparation
as well.
And you can imagine, you know, there's there's a number of pieces in there that also relate to compliance.

(22:12):
And so I think the the core thing here that I I guess I'm trying to articulate is the complexity
and the requirements
as you look at how data feeds these applications
is growing
pretty large because
the these domains
are sort of coming closer together.
And that's where needing a platform that can help you build that simply

(22:37):
is really compelling.

Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy?
DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems.
Whether you're moving from Oracle to Snowflake,

(23:03):
migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed timeline and fixed price.
Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafull
to book a demo and see how they turn months long migration nightmares into week long success stories.

(23:25):
Another
aspect
of the system design and architecture that I'm curious about when you're building with temporal
is we've been talking about temporal as a means of state management
and durable execution,
whereas data engineering as a discipline
is entirely concerned with very stateful

(23:47):
assets.
And a lot of times, the
scale of that state is the core of the problem where you need to deal with terabytes, exabytes, petabytes of data
writ large
and
temporal
being a primarily database
backed system, I imagine, is more focused on state with a much smaller scale in the order of bytes, kilobytes, megabytes.

(24:14):
And I'm curious
how
that
factors into the ways that people are using temporal state management
in juxtaposition
with the much larger scale of state management that some of these
broad data systems require to be able to operate on?

Yeah. That's a great question. The
the main thing here to note is that the way the temporal model works is what the state that temporal is managing
is the state of
where your workflow
and activities
are. So
the the beauty, the elegance of our model is that you as the as the engineer

(24:56):
is are running kind of what we call workers.
Your code runs in your environment,
and you don't have to slap all the data over to us. You kind of the data resides,
you know, where it needs to reside.
Your workers, the code that gets built using the temporal SDK
can run-in your environment. In fact, we want it to run-in your environment so that it can sit as close to the data as as you need it to. What we are managing is the orchestration

(25:25):
of the tasks and the activities that the workers are running. And this is a really key point because of a number of reasons.
One, it really enables a super elegant security model
where
on the temporal side,
we don't see your data.
We are just seeing any sort of input, output parameters to your workflows

(25:49):
or pointers to s three buckets
or, you know, whatever
you need in terms of the workflow execution
context. And that too is encrypted by you, and you are the only one that has the keys to that. So what you're passing to us is very small and, as far as we're concerned, is garbage.
And then the second main reason this is really critical

(26:12):
is that
this is exactly
why temporal
can scale
and doesn't hit the limits that we see other systems hitting because
what we manage is purely kind of the task and the workflow execution state, not sort of your your business or your data application,
sort of specific things. Right?

And because of the fact that you are using that worker to manage execution, it allows you to use
your other tools that are accustomed to doing that heavy lifting.
And I'm wondering then what are some of the key pieces of
state or information
that is useful to

(26:55):
maintain and temporal
for being able to recover from failure or, just some of the ways that people are
using that statefulness,
maybe going back to our earlier conversation about things like checkpointing in Flink to be able to handle that without necessarily
reaching for some of those heavier weight, more complicated engines because you're able to use temporal for those, you know, state maybe smallest state

(27:21):
for those executions?

Yeah. Absolutely. So the core of how temporal works is the worker kinda runs in your infrastructure.
And on the temporal server side,
we have the notion of a task queue. And the worker is essentially long polling the temporal server around
give me my next task, give me my next task. Right? And the the give me my next task takes with it, you know, whatever bare minimum sort of context parameters you need to run that. And so what temporal on the server side is doing is maintaining that state around

(27:58):
where the worker is, what tasks has it has executed,
what's the next one to be dispatched, and so on. And so if a crash happens,
what we can do is we can pick up from exactly the last task that was dispatched,
and we call this we call this capability
replay. And so this is not it is literally not replaying the entire sequence of events. It is essentially just picking up from where it fell over and then running

(28:28):
through the next set of things. And because it has enough of the context around what was executed
and what's next
and what the input and output were,
we can just pick up and and sort of tell your worker, here's where you were, get going. I know it's a hugely simplified
sort of description, but I'm hoping that that that helps with that question.

Yeah. That's definitely useful. And then moving more into that AI system design,
digging a bit deeper into that
where one of the newer requirements
that a lot of data teams are maybe less familiar with is the introduction of things such as vector databases,
the corpus management
for knowledge bases that the LLMs rely on,

(29:13):
the requirement
to maintain
freshness of that information to ensure that you're not feeding old data or bad information
to the consumers of that AI system?
And what are some of the ways that having that durable execution
pattern
simplifies or enables those teams to be able to
build and experiment and maintain that corpus?

Yeah. I think, again, you know, at the end of the day, I think the main thing durable execution brings
here is the rigor
around thinking about
the state
and separating
out the sort of the steps of the workflow
and activities. Right? And so you would, you know, you'd still run your VectorDB

(29:58):
in your own sort of infrastructure, your own accounts,
and, you know, you you can have, like, an activity
wrapper that can invoke that and and get the context from there and do the checks on freshness, etcetera.
One thing we haven't done yet, and this is this is a definitely a topic of conversation internally is this question of, you know, temporal is a general purpose durable execution platform.

(30:25):
Should we be thinking about
building
data specific abstractions
that really take some of these patterns we're seeing and helps developers
do that more easily?
And, honestly, you know, for us, this is a a big discussion
because we have sort of stayed a little agnostic
around being very opinionated about things, and we feel like that really helps with a lot of the developer

(30:52):
sort of empowerment and creativity
and and control over the way they wanna implement their use case. But, you know, there's definitely a question on the table here for us around these abstractions,
and
should we be thinking about,
you know, building some abstractions that make that sort of data prep pattern a little bit more sort of in product out of the box versus maybe a best practice or a sample or a demo.

And when I was preparing for this conversation, I was reading through some of the blog posts on the Temporal
blog about some of the ways that using Temporal as some of the state management for agentic applications
reduces some of the complexity of
actually building those systems because it can be the system of record for the various conversational flows for the executions

(31:47):
and tool uses of the models and then being you
a failure of an agent to be able to successfully execute a tool call, you have that complete state, whereas

(32:11):
a number of the agentic frameworks
want to
be the owner of that information, or maybe it is by default, all going to be in process or in memory. And I'm curious how, with the introduction of durable execution,
those frameworks
can be used more effectively or just maybe some of the ways that the introduction of this pattern is help shifting the ways that people are thinking about the design of those types of systems?

Yeah. Absolutely. You know, to to someone who is spending their, day and night thinking about durable execution, the the fact that it's all in memory and a crash might mean that the the user has to start all over again. Like, that is,
sends shivers down my spine for sure. So we you know, what we're seeing is that the frameworks

(33:01):
don't have durability
in place, and and this is where we're doing integration.
So we have a first party integration with the OpenAI agent SDK, for instance, that brings durability
into the picture. We are also
honestly seeing,
you know, customers
building agents
without needing to use frameworks.

(33:23):
Right? And so, you know, customers are building these agents just on top of the durable
execution,
abstractions, and foundation.
In particular,
you know, the the the interesting thing is we are still early in the sense of we believe that these agents are going to need to be even more longer lived. You you know, I I like, for instance, there's no reason why the an agent I use is an interactive agent. You know? What I wanna be able to do is give the agent some work, go away, and come back after whatever amount of time and and get my results. And so the the agent patterns

(33:59):
are very, very much kind of the asynchronous,
long running,
durable execution patterns that temporal has been solving
for a very long time. And, we're starting to see that value creation now kind of coming through. So, you know, for the set of developers that do wanna use frameworks,

(34:19):
we are integrating
with frameworks and and, you know, we can kind of bring durable execution to them. But we're also seeing the usage of temporal
in just building
agents.
And the piece that you were referring to, we we I haven't talked much about, but a really compelling part of temporal is
you can go into temporal and you can look at the execution

(34:42):
of your workflow, and you can see exactly, you know, what services were called, what LLM calls were made. And, you know, that visual sort of observability
piece
is a big part of the value. And you can also export this history,
and we have customers that are using that for their audit and compliance needs as well.

One of the other patterns that I've seen
for a similar use case is to
proxy through
an LLM gateway
to be the system of record for all of the interactions
between your application
and the LLM API.
Tensor zero is one in particular that I'm familiar with that actually uses a ClickHouse database as that state store and will then use that information as a means to execute reinforcement learning to do fine tuning of the model that you're using and improve efficiency and effectiveness.

(35:38):
And I'm curious
how
temporal
or teams that are using temporal are maybe doing similar use cases or what you see as some of the trade offs between those two approaches of either using temporal as that state store versus
proxying through the LLM gateway and using something like a dedicated database as that state store? Yeah.

I think it depends on what you're trying to achieve here. I think the that use case
around sort of the the tensor zero piece, you know, you could build that pretty easily on temporal.
And the value of building that on temporal would mainly be kinda getting the resiliency
and the error handling, you know, all of these pieces

(36:20):
that durable execution brings forward.
How you make the decision on whether you use
what what we would sort of call
a a sort of a more opinionated
sort of offering versus a general purpose platform,
I think, is really dependent on what is the end goal that you're trying to achieve.
But we do we do see this pattern

(36:41):
where sort of customers are using temporal
to call the LLM.
And, again, the benefit really is that the actual call gets done
from your worker that runs in your environment. And so if if you've got sort of the complexity of use cases around
private hosted models

(37:01):
or sort of data privacy pieces, etcetera. You know, you kind of the temporal model lets you have full control
on the output of what the LLM is bringing, where you store it, how do you compare it and check it and validate it and so on.

I think one of the key aspects
of temporal
as the state store is that it is also
a programmatic substrate
versus something that is
maybe
more opinionated or constrained
in terms of what it is expecting to do.
And so it gives you broader flexibility in terms of how to take advantage of that state without having to necessarily

(37:42):
do additional integration work to reuse that so you can actually use the data in situ rather than having to do an extraction,
transformation,
either reimport or do a more roundabout means of using that. And I'm curious how that changes the ways that teams are
selecting
the supplemental tools that they're relying on once they do start using temporal.

Yeah. It's a great point because I think the core message here is temporal is a code first, you know,
developer tool. Right? And so I was talking about the programming languages.
Our goal is to meet the developer where they are in the language of their choice. And so the the value here would be that you would be able to fit your temporal usage in your existing

(38:33):
engineering practices,
your
CICD pipelines, how you do testing. You know, it we wanna be able to fit in your processes
without inducing
more overhead for you. And I think that is a big part of the decision making,
criteria that goes in here because we we aren't going to come in and mandate the use of a specific programming language, and we're gonna give you the power of the SDK.

(39:00):
Now the flip of that conversation always is, well, is an opinionated
system sort of faster to use, or or is that better for me? And then in a lot of cases, it might be, and that's totally fine. You know, for us, again, what we are building is a platform
where we believe that the developers
can use the power of the platform

(39:22):
as they see appropriate
and, you know, the the fit in with how they are building their software today.

And for teams who are starting to adopt temporal or they are figuring out how best to design around it, what are some of the key primitives or key
design patterns that you see teams maybe either struggling
to understand
or maybe they are not using temporal

(39:50):
in the most idiomatic
manner and just some of the useful either references or pieces of advice that you have for teams who are starting to tackle that design phase of, okay, temporal seems great, durable execution seems like it would be really useful, but what do I actually do to get started?

Yeah. So this is a really interesting topic because,
it is almost
at you know, the the the first reaction that people have is is, like, it's almost magic, that kind of a reaction. Right? And and what we what we see is
that
once someone understands temporal,
they cannot look back. They are essentially

(40:29):
fully onboard with the programming model. But it it requires them to have an open mind because as an engineer, you've been trained for years to be thinking about all of the error scenarios, and you've got all of this code that is gonna deal with all the reliability
pieces. And then someone comes and tells you, oh, you don't need any of that anymore. You know? Of course, there is some element of, you know, skepticism

(40:54):
and sort of unlearning that needs to happen here. Our recommendation
always is to just, you know, get your hands dirty,
try it, run some samples,
you know, read the blogs. But, inherently,
it's, you know, you've gotta try it. And once you understand
the power of the platform,
then your design decisions
become hugely simplified.

(41:16):
One other thing that we hear a bunch about, especially as we're talking to more of, like, the VP
engineering,
the CIO type audience. You know, one of their first questions is, okay. What do you replace? And what they're trying to do is pattern match. Does this mean I can get rid of my queue here or I can do this or that? And, you know, it's really it's really temporal is yes and and more is kind of the answer. Right? You have to really start thinking about the initial set of use cases

(41:45):
and learn. And then from there, you'll what we find is then, you know, engineers
are applying temporal
across all these domains that we didn't quite imagine that they would, sort of think about use cases for.

And as teams are coming up to speed with temporal or maybe they're evolving in terms of their sophistication of its use and the level of integration into their systems, what are some of the most interesting or innovative or unexpected ways that you've seen temporal and the durable execution pattern used in these data and AI workloads?

So one of the ones that,
I find fascinating and interesting is we had a use case come up where,
someone
was intentionally
killing their workers
because they wanted to optimize
the usage and the cost of the workers. So they would just go in and kill the worker knowing that temporal could pick up wherever

(42:42):
that worker sort of left off. So I I I thought that was fascinating.
Other interesting things that I'm seeing and and, you know, I I have a tremendous amount of respect for folks in the data field is just the sheer volume
of the
data processing that's happening
and how having

(43:03):
a production ready scale
orchestration
durable execution platform is condensing
the time it takes. So we had a customer tell us that the pipelines that were running took, like, eighteen hours to run. They've been able to condense that down to five minutes because
they have temporal has forced them to think about sort of what are all the various steps in this process and how can they

(43:29):
sequence these steps so that they can actually go faster.
And if if something fails, they they don't need to start all over again.

And as you have been working in this space and
working with the company and the community around Temporal
and
understanding
more deeply
the capabilities that it provides
and the use cases that it can be applied to, what are some of the most interesting or unexpected or challenging lessons that you've learned

(43:57):
personally?

Wow. So I think I think the main lesson
for me is if you really think about it, what durable execution and temporal
does is it takes the onus on reliability
to be in the temporal server. And, of course, our business model is one where we run temporal

(44:19):
cloud,
and our responsibility
there
is
really
high because, you know, the the core of the promise we're making is we will handle reliability
for you. And the way we do that is by putting that problem
in the temporal server. And so, of course, the temporal server has to be

(44:40):
incredibly reliable.
And I think that is really the the main it's it's not surprising
when you think about it, but living that with a cloud delivery model, you know, especially with some of the outages that we've been seeing as well
and making sure that we live up to our promise there. You know, that's something that we think about constantly.

And for people who are designing
systems, thinking about how best to manage
the reliability
of their data workflows or their AI applications,
what are the situations where you would advise against the use of temporal or durable execution as a pattern?

I think situations
where
you you know, I know you used the word designing them for reliability,
so
it makes it harder for me to answer that.
I would say
situations where, you know, a crash happens and and you really don't care about it, you know, you may not need temporal at that point. But I think anytime where you need to scale and be reliable,

(45:47):
we do believe that, you know, you should be looking at temporal and the value it brings there.

And I guess put another way, what are the situations where the incorporation
of temporal
adds excessive complexity
or unnecessary
coordination
to an application design?

Again, I think if you've got you know, if you're doing early prototyping
and you have not established
your
business value yet or, you know, you're you're, like, just you've got some toy agents that you're trying to figure out what's really gonna be the core IP. You know? Maybe maybe you don't need temporal there.

(46:28):
I think, you know, I think we are kind of this this notion of the complexity
piece is something that, you know, we are working
towards doing some more sort of DevRel
education
on because
we really do believe that this the complexity thing is like a false

(46:48):
argument here. Because the whole premise around temporal is to make your life easier.
And so, you know, the question really becomes,
you know, where is that complexity coming from? Is it learning temporal?
Is it running temporal?
And we really believe that with all the progress we've made, it should be a nonargument.

(47:09):
But, of course, I work here, and,
you know, I'm willing to be convinced otherwise
and,
continue to work towards improving.

And you mentioned already that you are trying to combat the tendency
to form strong opinions or build excessive integrations
into temporal for specific use cases.
But I'm wondering what are some of the things you have planned for the future of temporal and the durable execution pattern,
in particular, with an eye towards how it can be used in these data and AI systems?

Yeah. So I think the main focus for us is to continue to
improve
sort of the
onboarding
experience
onto temporal,
and, really, the durable execution sort of
execution sort of constructs
that have any rough edges. How do we make sure that we are continuing to sand them down? And all of this is really

(48:06):
tying to our core focus on the developer.
And so what you will see from us is continuing
engagement
on what are the developer pinpoints and scenarios. And I'll give you a concrete example. For instance,
versioning of workflows
and how do you deploy
new versions, how do you sort of incrementally

(48:29):
roll out traffic
across these versions. Maybe that there was a long running workflow on a previous version and you wanna make sure that you can execute
multiple versions and and wait till that workflow is is completed. We're doing a lot of product work around simplifying
that whole space for the developer community.

(48:49):
So that's that's the kind of thing that you'll see from us is around just listening to our community and the developers and just sort of working hard to improve the overall experience.

Are there any other aspects of durable
execution as a pattern,
temporal as an implementation
of that, and the application
of those capabilities to data and AI systems that we didn't discuss yet that you would like to cover before we close out the show?

I think really quickly,
the we did touch on Nexus a little bit during the conversation, but, you know, if if there are folks listening to this that haven't checked out Nexus,
definitely
recommend taking a look there. Nexus is also
in open source, and it this is how we believe sort of organizations

(49:37):
that have teams working on different parts of the stack can actually build applications
that can call across boundaries.
So that that would be my only sort of call out here.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and the rest of the temporal team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Gosh. Not not living in the space.
I'm not sure I have a great answer for you.
You know, I think one of the things that does come to my mind is just around
effective use of the resources,
you know, whether those are costly GPUs, etcetera. Just how can,

(50:29):
you know, sort of more tooling around helping manage those seems like a pattern. And, we're seeing people use temporal to solve that, and and maybe someday, there will be sort of more opinionated tooling around that.

Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experiences
on the ways that temporal and durable execution can be used to simplify
the design and implementation
of these data intensive systems and ways to reduce the complexity
of managing

(51:02):
both the business logic and the resilience and reliability.
So I appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day.

Thank you so much. It was a pleasure to chat with you today.

Thank you for listening. Don't forget to check out our other shows. The Data Engineering podcast covers the latest on modern data management, and podcast.init
covers the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at AI engineering podcast dot com with your story.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

Las Culturistas with Matt Rogers and Bowen Yang

Crime Junkie

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}State, Scale, and Signals: Rethinking Orchestration with Durable Execution

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

Las Culturistas with Matt Rogers and Bowen Yang

Crime Junkie

All Episodes

State, Scale, and Signals: Rethinking Orchestration with Durable Execution

Stuff You Should Know