Dagster's New Era: Modularizing Data Transformation in the Age of AI

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:11):
Hello, and welcome to the Data Engineering podcast, the show about modern data management.
Poor quality data keeps you from building best in class AI solutions. It costs you money and wastes precious engineering hours. There is a better way.
Core signal's multi source enriched cleaned data will save you time and money. It covers millions of companies, employees, and job postings and can be accessed via API or as flat files.

(00:35):
Over 700 companies work with Core Signal to develop AI solutions in investment, sales, recruitment, and other industries.
Go to dataengineeringpodcast.com/coresignal
and try Core Signal's self-service platform for free today.
This is a pharmaceutical ad for SOTA data quality. Do you suffer from chronic dashboard distrust?
Are broken pipelines and silent schema changes wreaking havoc on your analytics?

(00:58):
You may be experiencing symptoms of undiagnosed
data quality syndrome, also known as UDQS.
Ask your data team about Soda.
With SOTA metrics observability, you can track the health of your KPIs and metrics across the business, automatically detecting anomalies before your CEO does. It's 70% more accurate than industry benchmarks and the fastest in the category, analyzing 1,100,000,000

(01:21):
rows in just sixty four seconds.
And with collaborative data contracts, engineers and business can finally agree on what done looks like so you can stop fighting over column names and start trusting your data again.
Whether you're a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you.
Side effects of implementing soda may include increased trust in your metrics, reduced late night Slack emergencies, spontaneous high fives across departments, fewer meetings and less back and forth with business stakeholders, and in rare cases, a newfound love of data.

(01:50):
Sign up today to get a chance to win a 1,000 plus dollar custom mechanical keyboard.
Visit dataengineeringpodcast.com/soda
to sign up and follow soda's launch week, which starts on June 9.
Your host is Tobias Macy, and today I'm welcoming back Nick Schrock to talk about lowering the barrier to entry for data platform consumers and the impact that the current era of AI has had on data engineering.

(02:14):
And so, Nick, for folks who haven't heard you before in your numerous past appearances, can you give a quick introduction?
Yeah. Tobias, thanks for having me back. It's always a pleasure. So my name is Nick Schrock. I'm the CTO and founder of Daxter Labs. We're the company behind Daxter, which is a data orchestration framework.
And, yeah, kinda my background is I cut my teeth at Facebook engineering.

(02:36):
And the project I was most known for prior to Dagster was GraphQL, which I initially created, and we
open sourced it. Well, first, we built it inside Facebook, and then we open sourced it. And it become a broadly used technology. But then I moved on to data engineering and data platforms, and
I've been working on Daxer for quite a while now.

(02:58):
And so we don't need to get too much into the usual flow because we've covered that in past episodes. But I think that given the
current state of the industry and all of the hype around AI, I'm wondering if you can just start by
giving your summary of the impact that the
overall adoption and growth of AI and automation and agents has had on data platforms and data teams?

(03:26):
Yeah. That's an interesting
question.
I guess, just to set a framework for this, I think there are varying degrees
of how people view AI.
I like to joke around and call it, like there's, like, AI boomers
and AI doomers.
And AI boomers are like, oh, we've seen this before, blah blah blah. And AI doomers

(03:48):
are like, this is the end of all
human labor.
It's going to zero. We should just bomb the data centers, literally.
And I am I would call it squarely in the middle. I think it's gonna be incredibly disruptive. I think it's gonna be
as important as, say, the transition from the industrial

(04:08):
to the information age or the, pre industrial to industrial age.
But I think that it will be a massive
productivity boost to lots of people, including including engineers. And
far be it from software engineers going away. I actually think it will expand the number of people writing software and will make them more leveraged. So, you know, in terms of its impact on the software engineering in this industry, I'm very far

(04:33):
from being a a doomer. I think it will be a renaissance in software engineering, and it's super exciting, but it will fundamentally change that. In terms of its impact on the data platform space specifically,
I think in reality in the day to day lives of practitioners
working in data platforms, it's kind of like there's been an earthquake.

(04:55):
There's a tsunami out there, but it hasn't really hit shore yet.
And what I mean by that is that I think lots of people are using AI tools
to write software
in an accelerated manner.
I think lots of people are starting to work
on AI projects at their various organizations,

(05:16):
you know, especially
use cases involving structured data.
And I think some of their
tools outside of their code editor do have AI features,
but I don't think it has
fundamentally
changed their world
yet. So I think everyone's kind of
waiting for it

(05:37):
and, you know, figuring out how to adjust to this new future.
But I think it's like the we're at the beginning of the inflection point, so it's kind of this
odd state for most people where
their day to day hasn't changed that much, but they know it's going to in some time horizon. I don't know if that resonates with you. But

(05:57):
Yeah. It absolutely does. And I was actually,
recently giving a presentation
for another company to their engineering team
because they wanted to get my thoughts on the future of data engineering, the impact of AI.
And
I think that there's definitely a lot
of new work to be done,
but the

(06:18):
fundamentals of the work don't change. There's definitely
a lot of
changing in terms of the specifics of the tooling, but in principle, the job of data engineers and and the way that I distilled it for the presentation was that the role of data engineers is to turn raw information into knowledge and enable the business to make use of that knowledge to

(06:40):
either make better products, power the applications that they run, make better decisions, etcetera.
And using data to power these different AI systems
maybe takes a different shape because you're bringing in, vector index versus just a star schema, or
maybe you're able to unlock more value out of the unstructured data that you've been storing since Hadoop hit the scene.

(07:04):
But,
ultimately,
the core fundamentals of the job are the same where you find data that can be used for something. You run it through some sort of transformation. You get it into a manner where you imbue context and semantics into the data beyond just the raw bits, and and then you feed that into some downstream system, whether that's business intelligence or an LLM or web dashboard or whatever might be the case. So the work is fundamentally the same. It's just the shape of it is evolving a little bit, and the speed is probably increasing.

(07:37):
And I think that there there's another interesting aspect of it where the way that I build the distinction is that you're either building for AI where you're using data to feed it into an LLM or you're building with AI where you're actually using the LLM to generate that transformation and help you iterate faster and find anomalies, etcetera.
Yeah. That's right. That makes sense. You know? An analogy I like to use is that when accountants first saw the spreadsheet, they were probably like, it's over. Right? And when people saw calculators, they were like, oh, no one's gonna need to know math anymore. That's fundamentally not true. You need to know the principles in order to evaluate

(08:15):
and use the tools.
I do think that
computer you know, data engineering especially is so critical, and evaluating correctness requires so much business context
and localized context that it will, again, fundamentally change whether the practice, but there will need to be a human who has deep understanding of technical systems and business context in order to make these things work.

(08:40):
I think too that's interesting because
with the injection
of these LLMs and AIs and Copilots
into things like software engineering or even some of the, like, the Microsoft Copilot getting embedded into various office suites and Gemini trying to hook into all of the Google products. It highlights and accentuates

(09:02):
the work that data engineers have been doing in the background for so long because everybody's trying to use these AIs. And, ultimately, it works in some cases, but as soon as you try to go outside of the bounds of what the system was specifically built for, for instance, with if you're a software engineer and you're iterating with Copilot and your editor, and then all of a sudden you say, oh, I actually need to build something that touches on some other data system that isn't directly embedded in my application.

(09:30):
All of a sudden, you have a problem because the LLM doesn't know anything about it, and then you have to do your own little bit of data engineering to be able to pull the information into that context to be able to enable the AI to do what you want it to do. So Right. It's just kind of bringing more of that work into everybody else's day to day that up until then was just, oh, hey. I need this data, so I'm gonna go throw it over to the data team and ask them to do it for me.

(09:53):
Yep. Makes sense.
And so
digging now into what you're building at Daxter,
again, we've talked about the the fundamentals of it and some of its evolution in previous episodes.
But for people who maybe listened to the last episode, which was, I think, maybe a year or so ago, I'm wondering if you can give a bit of an overview of what has changed, some of the new stresses that

(10:18):
the evolution of the data ecosystem, the impact of AI has had on the ways that you think about data orchestration and the role of Daxter in the overall stack.
Yeah. So what's happened the last year
so I'll do it from a very, like, Daxter centric,
way since that's the universe that I'm in. What we see

(10:41):
is
people
doing much more advanced things with the orchestration system,
wanting to get much deeper observability
into their systems. And then also,
you know, we also see
a
more and more
centralized data platform teams and data engineers who are building frameworks for less technical stakeholders.

(11:06):
And that fundamentally changes their job
from building data pipelines directly
to dynamically building data pipelines based on what some other stakeholder wants them to do. And that kinda, like, led to the product developments that we're doing today.
You know, I think the other thing that we've seen

(11:28):
is that in a predictable fashion, the what I'll call the data hyperscalers,
Snowflake and Databricks, are beginning
to attempt to consolidate as much as possible and building tools in every single vertical. And so that is quite interesting.
The counterforce
to that
is the full embrace of open table formats,

(11:50):
which is a big trend and sort of the standardization
of the term lakehouse
to describe
data stacks that are built over these open table formats. So I think that's a huge megatrend as well that we're seeing. Icebreak is kind of similar
to AI in that it's kind of this, like, tsunami that is coming. But, you know, adoption is still, I would call, modest. But it it it will come, and it's pretty exciting.

(12:20):
And
in terms of Dagster itself, I know that in one of your recent releases, you introduced this concept of components where you're focusing on trying to modularize and standardize
the different
elements of
the transformation flow, allow for people to
be able to have reusable

(12:42):
and more quickly
instantiated
data assets based on particular
concepts and guardrails,
which is a fairly notable change to the way that the framework has worked up until now where if you wanted to do anything, you needed to dig into some Python code, figure out how it all wires together.

(13:02):
And I'm wondering how that has changed the ways that teams who are using Daxter, either up until now or in particular people who are newly onboarding onto Daxter,
how that changes the overall collaboration patterns
for people who are consuming data or working with data. I'm thinking in terms of data analysts, analytics engineers, but also application engineers

(13:23):
and how that changes the work to be done for these data platform teams and people who are closer into
the infrastructure and the technical details of the data pipelines.
Yeah. Lot to unpack there. The,
you know, the components,
has been in preview for a couple months, and we'll be releasing it to release candidate in July. So we've been working with select design partners

(13:47):
to work on it. I guess I'll start with what the trends we saw among our usage
of both us
and also data platform teams that were using other orchestrators. And we kinda, like, saw a bunch of patterns and converged on the single project, which you think addresses a bunch of the issues. I guess I mentioned it in the last question, but, like, tons of people are dynamically building data pipelines,

(14:11):
meaning that they're not directly just authoring tasks. They're not just directly building operators in airflow. They're not just writing the asset functions in Daxter. They are working at a higher level abstraction and rolling their own systems to programmatically generate those things based on higher level APIs that they present to their users. Okay. There's that. Many of them who are doing that

(14:32):
are doing that with a config driven or some sort of front end. Right? YAML, JSON, even, you know, persisting it in a database, you know, all sorts of stuff. With that generation, they also programmatically generate metadata
in a pie policy across their data platform.
And, you know, the I think the other thing is that the data orchestrators

(14:52):
are all introducing
more concepts.
So tasks,
assets, we have asset checks.
There's metadata concepts, sensors. You know, there's like a whole bunch of individualized abstractions.
Usually, when someone's interacting with the orchestrator, they often are integrating with a specific technology,

(15:13):
and they don't wanna think in terms of those lower level things.
For example,
DAGSTER, when it integrates with dbt,
ingest the entire model graph and surfaces each one as a software defined asset, which is kinda what what makes our dbt integration best in the business. It's code that generates those things. It programmatically
scrapes the model graph and dbt and code generates stuff. So and the job to be done for the orchestrator is, like, integrated with the project. And then lastly so there's a bunch of stuff I know. But and then lastly, they wanna be able to bring in more stakeholders with friendlier interfaces and ideally have AI friendly Cogent targets. So we saw all those trends,

(15:53):
programmatic generation of pipelines, config driven pipelines,
the desire for a high level abstraction, AI native cogent, and that's components is what came out of that, which
is kind of the project release. And I think the the way to think about it generally
is that it provides a integrated way with the framework to, in a principled way,

(16:16):
programmatically
generate
definitions. Right? And the killer use case for that typically
is a YAML front end that you can present to your stakeholders.
But I also wanna emphasize, there are lots of people who don't wanna program in YAML, and trust me, I understand completely.
I like types and turn complete languages and all sorts of stuff. So it's not just YAML. It's also a lightweight Python API on top of that. But in effect, you kind of separate

(16:46):
metadata
from the underlying
complicated code. And that metadata can be expressed
in YAML, right, or in very lightweight Python. Right? At the end, it's like Pydantic models. So you can program against that if you prefer it. But in that way, we have a a native way to programmically
build definitions in the framework,

(17:07):
and it changes for those of the Pythonistas
out there. It makes it so you defer definition generation
until after the Python import process is complete.
And I cannot emphasize how important that is to build reliable systems that dynamically generate these things. If you're using Airflow
or Daxter today, when you programmatically generate the definitions, it's happening at Python import time. And if you're talking to databases or doing something computationally expensive

(17:34):
or if you wanna unit test that thing, you do not want that to happen, actually.
So kind of like the core thing here is that components is composed ability abstraction that allows you to dynamically and in a deferred way load up definitions
that and by definitions, it ends up being the structure of the data pipelines. But the killer use case, I kind of was talking on highfalutin terms there. I think the killer use case

(17:57):
that people
understand and what's meeting meeting them where they are
is providing an integrated,
tool rich,
self documenting YAML DSL
with a pluggable back end for your users.
And it really is a lovely interface
between data platform engineers and their stakeholders.

(18:20):
So
as you're describing that and the philosophy around it, it puts me in mind a lot of things like the separation of concerns that Kubernetes is focused on, where you have the infrastructure and compute layer that the DevOps and the infrastructure engineers are responsible for and then the API and the the user space layer that people who are building applications that they want to deploy are integrating with. And so it gives a

(18:46):
shared infrastructure
with a clear delineation
between responsibilities
for people to be able to build on top of, which has then enabled a massive ecosystem of
other capabilities built across both of those dividing lines.
Another thing that comes to mind is the,
the the very declarative infrastructure as code

(19:08):
community that is built up around things like Terraform and Pulumi of you have cloud providers. All the cloud providers as you're deploying these resources have state that needs to be maintained and tracked.
And, similarly, in data engineering, you're building
complicated
resources that interact with each other in dynamic ways that are all dependent on state that you need to be able to understand and maintain and operate across. So I think that

(19:33):
what you're building with components
brings a lot of those same ideas into the space of actually building these data pipelines
where you can have that interface boundary between the platform team and the consumers thereof,
in a similar way as what we've done in the kind of cloud native ecosystem.
I couldn't have said it better myself, Tobias. That was great. No. Our data engineer who runs our own platform,

(19:58):
which is fairly substantial,
actually, now that we're a more mature SaaS business,
the way he expressed it, he's like,
finally, I have a front end for the data platform,
which I think is kind of a and that's what he was he's saying what you were saying, a much more simplified term where it's like, I manage all this
state and
but there's just this clear abstraction layer and there's a front end for it. And even when you're working by yourself,

(20:24):
it's very useful to have this
abstraction
so you can kind of switch your brain when you're working on stuff. And then there's also, like, a very concrete
advantage to slap in a bunch of metadata
in, like, a separate
spot
that either is Python with, like, no dependencies
or YAML, which can be loaded dynamically, is that you can, like, do syntax checks, like, very, very quickly.

(20:49):
So it can speed developer loops and CI a lot. If you're moving a lot of activity into, like, the Gammel or metadata space, the feedback loop's super super fast too. So there's kind of interesting product implications as well.
And then the topic that we, I think, have to touch on because every time somebody says, oh, I've got this great abstraction layer. It's gonna make everything easier. You don't have to worry about it, which puts me in mind a lot of the the various cycles of low code or no code tooling

(21:16):
is that everybody says, great. It'll be so much easier for you to build these things. I've worried about all the complex stuff for you,
and that works well to begin with. And then you have to start doing things that are specific to your problem domain, addressing edge cases, and then you start bumping up against the
capabilities of the system, and you end up having to just drop down to a lower and more complex level to be able to actually get the work done. And so I'm wondering how you've thought about that balance of making things very easy to use, opinionated,

(21:46):
constrained versus
maintaining the flexibility and adaptability that's necessary for such a complex domain.
Yeah.
And I think this comes from having a lot of experience.
Where things go wrong is
where people think
they can eliminate more complexity than they actually can with a framework.

(22:10):
And it imposes too many constraints
on itself,
and it's not sufficiently
customizable.
The reality
is that
every business
is complicated
and specific, and everyone is in their known context.
And so you can't know the complexity of all those things.

(22:32):
What you can do is provide tools and abstractions and infrastructure
so that
platform engineers and engineers in general
can subdivide
that complexity
into consumable,
understandable
parts.
And then the other thing a system can do
is provide
cross cutting complexity reduction that is domain neutral. So I think if you understand

(22:58):
those two things
and do that, you get this right balance
of having
thing that actually reduces the essential complexity of a program
as well as allows you to scale the program well.
So,
for example,
in, like, components,
we built all this tooling around the YAML front end. So there's, like, really nice

(23:21):
error messages.
You know, you can run-in CI. There's a CLI interface to it, all this stuff. That is just complexity that cuts across
all domains. Right? It's basically useful for anyone who's using that technology.
Then there's the other stuff about, like, how do you make it so that, you know, people can
kind of
put their take the complexity of their world

(23:42):
and put it into a containable chunk.
And that's why this is a Python native system with porous borders between YAML and Python,
where
the data platform engineers are empowered to build custom components,
right, that have a structured front end. They can package up this complexity, have it be self documenting, have a nice YAML front end for it. But we're not pretending,

(24:05):
like the job of building that custom component is not going to be difficult.
And it's not gonna be difficult because
it's not difficult because we're making it difficult. It's difficult because
it is difficult.
There are just problems in your world that we cannot know and that are complicated. And you are a smart person, and we we wanna get out of your way when you're doing that. But we want you to be able to capture that, like, complexity

(24:30):
in a nice consumable chunk and present it to your Upwork stakeholders.
So I think it's just like this knowing
what you can assume
control over and support and still providing
that flexibility
to the user of the framework so that it's adaptable to their own needs.
I think another interesting element of where we are in the industry, both data and otherwise,

(24:56):
is that the introduction
of generative AI and the capabilities that that engenders
has brought
the use of data
more fully into the space of application engineering,
where up until now, the application
had its own data that it cared about and that it maintained that data engineers would then

(25:18):
extract and rip out of context and then have to rebuild that context for various business use cases.
And now we've come full circle where the data across the organization
is now getting fed back into the application context via these LLMs and things like rag systems and fine tuning and needing to be able to do things like manage semantic memory for the LLMs, etcetera.

(25:41):
And so that means that application engineers need to be more aware of the data that exists within the organization to be able to power those use cases and more
empowered to be able to actually operate on that data to address the needs of the application
in the ways that these LLMs are using that context. And I'm wondering
how you're seeing that and the work that you're doing with components play out in terms of bringing application engineers

(26:07):
more into the space of operating across organizational
data and the interaction patterns that they have with data platform teams, data analysts, business stakeholders, etcetera.
I think in the AI the one way I think about it is that I think in the AI era and this was becoming more and more true as more data platform
assets were being incorporated into production app logic. But this is just gonna supercharge that, which is like and the phrase, I think it's, like, data engineering is becoming software engineering, and at the same time, software engineering is becoming also

(26:40):
partially data engineering. Because you need to do some data engineering on your application data in order to correctly feed it into things that feedback loops into your application. So I think there's two things in Daxter that help with that. One
is that we have developed this protocol
called pipes,
which allows you to invoke
Daxter native compute

(27:01):
in external programming languages in a super lightweight way.
So you we have pipes clients for
TypeScript,
Rust,
Java, I think couple other languages. I should probably I I know some users have done, like, some in c sharp. But, effectively, that allows a a user
to
write code in their native language, and then we provide lightweight APIs to stream back metadata

(27:27):
back to us. And we also
launch that process.
Well, we actually don't launch it. We, in a pluggable way, can inject context into that process so that they can get, like, what partition is being materialized or any other sort of config.
So we kind of have a back end protocol,
so you can write data processing logic

(27:47):
that needs to be in the data platform in the language of your choice. And then second of all,
components
is sort of the front end for that, meaning that when you're in the orchestrator
land
and you need to
connect
the business logic you wrote in some other programming language to where it needs to execute in the orchestrator and the metadata around that, you can use components

(28:09):
in order to kind of set that up.
And the goal of components is so that someone who's sort of, like, external to the data platform can sort of wander in
and do what they need to do without learning a complicated framework. They just kinda, like, see you know, they see where their teammate
put a similar thing. Maybe they copy and paste the file or they know the scaffolding command that scaffolds up the same thing, and then they can just, like, edit some configuration.

(28:37):
There's a type ahead in the editor, and there's embedded documentation.
They can verify it, and then they can go on their merry way. It's kinda like two sided. We want DAXTER to be a multilingual ecosystem,
be able to have a DAXTER native experience while having a very lightweight mechanism for doing that in other programming languages,
and then have a very easy way for a stakeholder to come in

(29:00):
and
incorporate and integrate their compute into the data platform.
One of the interesting things that I'm dealing with in my own usage of DAX store right now is that we have built up a set of pipelines,
asset definitions, etcetera,
that are running in production. They all do what we need them to do. But as the

(29:21):
use of AI
moves more into that application layer, the application engineers need to be able to operate on data and be able to fetch and transform data
in their own work. And so
I've stuck with figuring out, okay. Well, how do I onboard people into the data platform more easily? And so one of the
objectives

(29:41):
is to do what you said and turn the existing
repo of DAXTER code into more of a a set of platform
capabilities
that maybe get published as Python packages or what have you or these components
so that the application engineers can actually write their
asset logic
next to the code that they care about and, you know, their Django app or whatever it might be

(30:06):
so that there are no repository boundaries that they have to cross to be able to get their work done so that they don't have to have any
boundaries in terms of a hand off to another teammate just to be able to close the loop on the thing that they care about. And so I'm wondering how you're seeing this introduction of components and the
constructs that you've built up in Daxter up till now

(30:28):
enable
situations like that where you have that core capability of one team manages the
data ingest, manages
the definition of these assets, and then being able to
hand off those assets to another team, particularly when you're in the case of, oh, I'm running on my laptop, and I need to be able to make sure that all the pipeline does what I want it to do so that I can make sure that my feature works on this data the way that I presume without having to replicate all of the data across multiple different environment boundaries.

(30:58):
Yeah. There's a lot in that. You know, I think at a very basic
level,
there's embedded documentation capabilities
in Daxter
where,
you know, you can have long form descriptions that are marked down formatted that then appear
on kind of the home page
for the asset in the in Daxter, and a team can use that to establish a norm. They're like, hey. If you visit this home page, there's like a a little code snippet to know how to read it and something or a link to the right tool to read it.

(31:30):
Daxter itself doesn't kind of enforce
where or how you store anything. That's one of the other you know, it's an example of a problem where we're like, hey. The the world is complicated. We'll provide some built in integrations for stuff, but in the end, probably, you have to control what's going on. You know, in terms of I guess, you kinda spoke to two things, if to repeat what you were saying. One is kinda like I am

(31:54):
a application engineer,
and I wanna access the underlying dataset,
like, literally the table in Snowflake or something. Right?
That in in that way,
we are much more of just a nexus of metadata and documentation
that you can point your users to. And it can make it very smooth for you, the platform engineer, to add information to that because you can just, like, add stuff to your source,

(32:20):
add stuff into your repo, and then it gets exposed in a very accessible tool.
So that's cool.
Then there's the notion of the
application engineer actually interacting
with
the Daxter platform and adding stuff to it, maybe in a different repo. And right now, certainly, the only way that we you know, in that scenario that I talked about, the way that we would envision it is that even with components, they would still have to go into a repo and submit a PR and go through a process.

(32:48):
What's on our road map, however,
is in app editing of these component
YAML specifications
or the the front end, if you will. And we did a hackathon where we prototyped it, and it's, like, super exciting because we
think
because in the end,
from the user's perspective, it feels like in app editing. But in the background, it's actually submitting a PR in your path and triggering CI.

(33:15):
You know? But the user perspective will be, like, green, and then they can set save pretty much, and it'll submit the change to the platform.
And we think that is
super exciting. The other thing that this enables users to do, and we might go in this direction as well, because people do it already,
is they have

(33:36):
config systems that they expose
to users in their native repo, and then they set it up so that Daxter programmatically
fetches
those configs from elsewhere and dynamically constructs the pipelines. That's, like, another
approach that I think is, like, more advanced, but other teams are doing it today.

(33:56):
And, likely, we will support that in the as well in the future.
Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.
DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

(34:22):
And they're so confident in their solution, they'll actually guarantee your timeline in writing.
Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds
today for the details.
And then the other interesting
work that you've done is alongside the work of components, you've introduced this new DG CLI.

(34:46):
You've introduced new ways of thinking about the structuring of your Daxter projects,
and I'm wondering
how that has
changed
the
patterns that you've seen teams building with Daxter
effectuate
and how that simplifies the work of being able to bootstrap new capabilities or new data assets within the overall platform

(35:10):
implementation? So I think what DJ really provides,
it's just a shorthand for the CLI,
is an opinionated
project layout
sort of inspired
by the
Ruby on Rails
style. You scaffold things. You don't have to manually import them. You can enforce conventions because you can customize the scaffolding.

(35:31):
So it creates the config file associated with the integration in the same exact spot in every single
place where you instantiate it, and it has some schema prefilled or something. It sounds simple, but it's actually, like, really
simplifies
dealing with it just reduces decision fatigue a lot. You know, concretely for DAGSTER

(35:52):
users,
the user friction we wanted to solve with this as well
is what
I call the import circus where, you know, if you have like, you split all your stuff among a bunch of modules.
In order to construct the DAGSTER definitions object at the root of your project,
you often have to do a bajillion imports, or then we have these facilities, which kind of, like, dynamically loaded symbols from another Python module.

(36:18):
And that was just a pain in the butt, and it made it really hard to reorganize a project. And we instead just manage that for you. So with the new project layout, it's just way more elegant both to is this less code importing stuff around? But very importantly, it makes it dramatically
easier to reorganize a project. Because, like, as you onboard new teams and stuff, just you wanna be able to move stuff around. And if you can make that seamless, that is great. It also makes it so that changes are far more localized

(36:51):
because only the file that moves gets changed, not like all the bajillion places that import. And that has a bunch of that's like a trivial way of describing it, but it has much side effects. One of the hard things about building one of these projects is, like, how do I organize it? Do I subdivide the repo between teams?
Do I do it by the DAGSTER abstraction,

(37:11):
or do I do it by some other
dimension? Like, it's a it's a multidimensional
problem. We found
is that it's much, much more elegant actually
to
subdivide the project
at the technology level.
Meaning that you have your dbt stuff here,
your Sling stuff here, your Fivetran stuff here, your Kubernetes

(37:33):
ad hoc jobs in this folder
because it allows you
to localize
all the technology
specific complexity
in a specific subfolder
and reuse it there. And then the people who are dealing with other parts of the platform don't see it or think about it or anything.

(37:54):
And, typically,
when you're working
and when you're writing code in
the data platform context, you're usually doing it in the context of a single technology.
Right? You're, like, going in and, like, I'm changing the dbt models or I'm changing this ingest.
And the only time when the cross technology stuff matters a lot is when you're doing integration testing, reviewing it in the UI.

(38:16):
So
you organize the code
by technology, but then you allow other cross cutting views in the UI or, like, in the output of the CLI tool. So I think that was an interesting insight. That is not obvious
when you kind of first start building one of these
platforms.

(38:37):
And
to that point of
slicing along the technology boundaries, I think that also lines up fairly well with the data asset constructs because any individual asset is largely going to be owned within a technology boundary. So a DBT model,
a table that is generated from an airbyte or a Fivetran ingest,

(39:00):
the
s three object that gets generated from some external process.
One of the other things that I've been building a lot around is the different resources, custom IO managers,
and I'm wondering how that factors into the ways that
the DG and the project scaffolding

(39:20):
thinks about
the breakdown of the system where maybe I have an IO manager specifically for handling
file based objects
in either object storage or local disk,
or I have a resource that has a base module that is an OAuth client, but then different implementations of that for different APIs and just how that gets used across these different

(39:44):
submodules within the DG project scaffolds.
Yeah. I think the the nice right now, if you don't use
the project layout
or you don't use these kind of more advanced APIs, which are a bit hidden,
then you're ended you generally end up with a global dictionary of resources at the top of your project, and we wanted to get rid of that. So what DG allows you to do, which aligns very much with this organized by technology thing, is place the resources that are, like, relevant

(40:16):
to the other things in that directory right next to it.
And then you have one spot
in your project that's like, okay. Here's all the stuff that deals with this technology. And that's been, like, a very
off, like, cognitive load
reduction
issue as well. But it also allows you to put one put it because it's hierarchical, you can put it at the right spot in the hierarchy. Because maybe you have some advanced resource to, like, talk to two technologies or something. Well, you put it at the parent folder and then because it makes sense for it to be there.

(40:49):
And then continuing on from the earlier conversation of the impact of AI and the work that you're doing with components and DG to
bring more
opinionated constructs and bring more scoping to the problem
is the impact that
the generative AI capabilities has had on the actual creation and maintenance of code and systems.

(41:14):
And I'm
wondering how the work that you're doing in Dagster is
designed to align with
the needs
of these AI systems for managing context, managing
input, and enabling data engineers and application engineers to generate more of this
automatically

(41:34):
without necessarily having to have as deep domain knowledge of either Dagster or the
specifics of the underlying technologies.
Yeah. 100%. I mean, we could probably talk for a few hours on this, so I'll try to keep it brief.
But I guess, first of all, to set context on this, I have a pretty specific view

(41:56):
of
how
you make a framework
optimized for AI cogent.
Because I think without proper constraints,
AI is a hallucinating
demon
that is a technical debt super spreader. It can be a serious problem, and it's very easy when you're doing AI cogen

(42:18):
to end up in a spot where you do not know
how it's working.
If there's a bug, the AI can't debug it, you can't debug it, and you're just in this unstable place, and you basically have to throw it away and start over. So I actually think it's very important to structure these systems
to allow for the code to be ephemeral

(42:39):
and disposable
if things go wrong. It's one of the reasons why components was designed that way is that there's, like, only one spot in the project that gets edited. And then if something goes wrong, you'd be like, okay. And because the cost of creation is so low, like, regenerating something isn't that big a deal. So I think these frameworks and AIs will have this reflexive relationship as they go along, and frameworks will have to be LLM native.

(43:05):
You know, I wrote this article last summer, actually, which influenced a bunch of my thinking
and or kind of explained it that led to components. And I call it there's this rise of what I'll call medium code.
Meaning, it's not low code or it's not, like, no code click up stuff, but it's not full software engineering.
It is you're writing a Turing complete language or a complex declarative language like SQL, but you're doing it in a highly constrained way.

(43:32):
And you're usually doing it in some coarse grained
container
of code.
In DBT, that's a model. In Jupyter Notebooks, that's a cell. And it limits the amount of context a human has to have in order to work properly.
But it turns out one of the most interesting things about AI is that what's good for the human is good for the machine and vice versa. Like, you want obvious APIs.

(43:56):
You want to limit the amount of context someone has to do their head or literally the number of tokens in the, that are in context in order to do stuff well. But the key thing as well is that the code that gets generated
needs to be precisely interpretable
by both a machine and a human. Because if you have to debug stuff or you have to bring someone in to help debug stuff, it needs to be something with deterministic behavior that is understandable.

(44:23):
So I won't go through the entire article
because that's a whole thing. But, basically, I think that the right AI cogent targets have coarse grain containers of code. They code to some high level framework or DSL that's precisely interpretable, but still part of the software development life cycle because that is absolutely essential because you need to have guardrails to check to make sure the generated code is correct. You need guardrails for the AI slop. So that's how I kinda consider how you need to think about how to design frameworks

(44:56):
for the AI native era.
And the components are designed with all this in mind, high level framework with built in documentation
that's customizable
by the user. I think documentation
will be viewed
increasingly
more as a store of context.
So documentation
needs to change,

(45:16):
meaning that the purpose of it is to provide context to the LLM. And so that's one of the reasons why we, like, focus so much on built in docs
for components and allowing
for custom component authors to inject the domain specific context right there in your code and be able to allow l m to scrape it.

(45:37):
Yeah. And then high quality error messages, critical to provide feedback.
But I think it's, like, just very important.
The concept of building a good AI native framework,
the goal is to dramatically
accelerate
the work of the software engineer,
not to abstract them away. And I don't design like that just because I like software engineers, and I don't want them to go away. That is true. So maybe there's some subconscious, you know, psychology going on there. But

(46:06):
more importantly, I think in essence, it is correct.
The reason why AI is so exciting, if done right, we can abstract away
so much of the drudgery,
so much of the toil of software engineering, and focus much more exclusively
on what you uniquely have judgment on. So to me, the most exciting thing about AI

(46:28):
is the ability
to abstract away
enormous
swaths of incidental complexity
that we didn't think was possible.
But
in order to do that
in a way that's effective and safe
and actually leads to higher quality system,
the framework designers

(46:49):
absolutely
need to optimize for it.
And, also,
the other element of working with these AIs effectively
is that it removes us from, to your point, the drudgery of dealing with boilerplate,
dealing with very narrowly scoped problems,
and moves us up to thinking about what is the actual system level requirement and how do I get there so that the LLM can focus on those very narrow domains to be able to stick them together in a way that is composable.

(47:20):
Yeah. Exactly. It's all very exciting. You know? I hope that people can approach it from a abundance mindset rather than the fear based mindset. But I think it's I think it's more that any it's a radical change even if it's gonna be from the better. And that is
stressful and anxiety inducing, and I totally get that. Like, I am in my own development, I am probably not as AI native as, like, I need to be. You know? And, you know, at this point, I'm an old man, so I need to really work on maintaining that brain plasticity to learn new stuff. So I get it, but,

(47:53):
it's also very exciting.
I take umbrage of that because I think we're the same age.
At this point in my life, I wake up very early, and my my son always asks me, Dan, are you up because you're an old man? I'm like, yes. That's that's my mom. My son is six. He's very charming.
And so as you have been

(48:15):
exposing these new capabilities, working with some of the early adopters for the DG, the scaffolding,
the components, interfaces, what are some of the most interesting or innovative or unexpected ways that you've seen those capabilities applied?
Yeah. Like I said, we're going with a fairly limited set of design partners. But even among that set, there's been a bunch of great stuff happening. One of our users is

(48:40):
onboarding
his
Databricks
kind of using stakeholders, data scientists who work in
notebooks hosted notebooks in the Databricks environment. And
he you know, the first thing he did was write his own custom component to and there's a REST API for running one of these things. Right? So you wrap a custom component around that. You can basically tell your data scientist, like, hey. If you wanna schedule it within the context of data platform, copy and paste the notebook URL, put in this YAML file, like, fill out these things. You're good to go. But then he realized the power

(49:14):
of the customizable config system, and he started kind of putting all the DevOps stuff in there too. So configuring memory,
how the integration with Datadog should work in the context of this thing, like, what metadata to put everywhere and stuff. So it ended up being this, like, kind of single
spot
where this intrepid data scientist integrating his notebook into the production data platform can control a bunch of different parameters of stuff. So I thought that was really cool. Another user

(49:47):
went kinda I would call it hog wild on the number of custom components he built, and he was all sorts of crazy legacy systems and talent and I think even inform
all this stuff. And so that was really heartening that someone was able to kinda churn out so many custom components as early in the life cycle of the system. And then another
one that comes to mind is that the moment that one of our users saw the new project layout and it's kind of hierarchical

(50:14):
nature, he also
saw this capability we have. We're kind of, like, in the
YAML file at any point in the hierarchy. You can kind of post process
all the definitions that were above it to, like, apply a common tag or apply the same metadata. And that allowed a really nice separation of responsibility

(50:34):
where the data platform engineer could programmatically
apply governance information across the entire
system very smoothly because it kind of changes the way that you can abstract things. Because you basically you tell your stakeholder, just put this tag on this asset. Okay? And then later down the line, the platform engineer can process that tag and then decide how to interpret it and, like, produce all sorts of derivative metadata and, like, who's the owner and what team is it on and, like, this piece of metadata indicates this policy, and I'm gonna write some other piece of code that queries that and makes decisions on it. So the fact that, like, the you know, you show the capability,

(51:11):
and then instantly,
the user is like, wow. I can use this to programmatically control all my governance in a smooth way. Like, that was really cool to see.
And then a natural outgrowth too of these components
is that it gives you a
reproducible and a packable
abstraction
that you could feasibly build a community library around of here are all the different components that people have published for their use cases so that somebody who is new to Dagster can come in and just pick and choose, carte blanche, whatever it is that they want to be able to LEGO brick their system together to get up and running.

(51:49):
That is the idea, and we didn't even talk about this beforehand, but you just you're setting me up perfectly. You know, one of the things we're building into this is the integrations marketplace
that we wanna be able to kind of index integrations
from both our own monorepo,
the community repo, as well as internally built components. So at the, you know, at a at scale customer, we want the centralized data platform team to kind of be able to publish components

(52:16):
into a searchable index that has all sorts of metadata and, like, embedded docs and, like, copy and pastable cost digits by, like, how to install this thing. So, yeah, we really wanna kick off an ecosystem effect for these things.
And as you have been
building these component capabilities,
working with teams, helping them come to grips with how to

(52:39):
accelerate their work with AI, how to build for AI with Dagster, what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
That's a good question.
I think
it is
very
important
to balance
innovation
with change management,

(52:59):
and that can be very challenging. You have to be very careful
to introduce stuff incrementally,
to allow incremental process that provides value at every step along the way. Like, when you're adding new capabilities and you're asking users to do any sort of code change, you have to provide value instantaneously
every step along the way and then communicate about it properly. And that can be a challenge.

(53:25):
So there's that. And then the desire
to incorporate
stakeholders
and just how universal
the need is to incorporate people into the data engineering process
continues to kind of astound, actually. You know? Just data is so instrumental
to so many people's jobs, and that person who it's instrumental to is, like, the only person who understands it. So it's just allowing them to participate directly into the process

(53:55):
reduces so much context switching
and painful context switching and collaborative processes that there's just it ends up there's a huge organic demand for, that sort of thing. So I think that that's also been kind of a a lesson learned here.
And so as teams are trying

(54:15):
to tackle their data challenges, they're ramping up on AI use cases, what are the cases where Daxter is the wrong choice?
Well, we're fundamentally
a batch processing
system, you know, that can get to semi real time. But if you need microsecond latency,
like, we're not the right tool for that. Highly

(54:36):
dynamic,
you know, kind
of computations that require
loops and can't be structured
into a DAG structure.
Systems like temporal are more appropriate. They make different trade offs in terms of, you know, we provide much more structure and constraints built in lineage, all this stuff. And they have a much more complex state machine, but it is more generic,

(54:57):
more imperative, and more flexible. So there is that use case as well. So those are the two that come to mind.
And as you continue
to build and iterate on DAXTER, on components,
as you continue
to keep pace with the demands of the AI age that we're in, what are some of the things you have planned for the near to medium term of DAXTER that you're excited to dig into?

(55:23):
Well, you know, like I said, you know, obviously, I'm working with components, and I'm very excited for the, this kind of continuing the journey of expanding
that ecosystem.
And this in app editing is gonna be amazing. And it's in app editing that I can get into because it actually ends up still checking in the stuff source controlled, running tests on it. And that provides us nice layering so that it can make stakeholders handy happy, and it can make the engineers happy. Most importantly, it can get good outcome. So we have an entire long road map on the components front for that. In terms of other things I'm super excited about, you know, DAXR is in a unique spot because

(55:59):
we have meta information
on
the integrations.
We have metadata information
on the actual definitions defined in code. We have operational metadata. We have all sorts of very interesting metadata. And I think we can involve DAX or not just to for, like, a useful operational
tool, but as a generalized

(56:20):
context store for all sorts of tools, including our own across the platform. When we were kinda discussing this component stuff, it's like, okay. Yes. We're gonna design an abstraction that's good for native AI AI cogent, but what else why else do we have the right to win here? And we have the right to win in our opinion because we have a unique view
of all the context in the system across every tool, across the way that you define your pipelines and code, all sorts of stuff. So I'm very excited to

(56:49):
kind of, you know, not just have
AI
accelerate authoring
of data pipelines, but to have DAGSTER's contextual information power,
AI use cases of all sorts. And I think we're in a great place to do that. I kinda jokingly refer to it as the mother of all MCP servers because we can, like, aggregate

(57:09):
the MCPs of all our integrations,
like, all sorts of interesting stuff, ingest tons of information. Like, our DBT
integration, we ingest the full model code. Right? So we can provide that context directly in the same API where we can provide, like, the Python definitions of completely other different technologies,
as well as, like, information about when this last failed, as well as information about, like, hey, what things are upstream of this? So we have a great opportunity to be a really compelling

(57:37):
context store for AI tools operating on the data platform and across business.
Are there any other aspects of the age of AI, the impact
on engineering
practices, the work that you're doing at Daxter,
or any other related topics that we didn't discuss yet that you'd like to cover before we close out the show? Well, there is any number of topics I have opinions on. But, I feel like we have, you know, we've been talking for a while. I don't wanna overwhelm anyone. So I think we can,

(58:06):
wrap it up there. I guess, you know, it's changed,
but it is very, very exciting.
I might you know, I'm often a skeptic of such things, but, these AIs are really doing incredible things I wouldn't have believed were possible even a few years ago. So it's, pretty cool to see.
Yeah. I was skeptical at the outset as well, but I have been

(58:28):
reasonably impressed in recent months with some of the capabilities and the ways that it can accelerate work to be done. So definitely
Claude is really good at Cogent now, and I find that ChatTPT,
it's the o three model, is really incredible
for doing kind of research
on the Internet. And, like, I also love now how it's, like, showing you what it's doing. Like, hey. I'm fetching from here. I'm fetching just that that transparency actually builds a lot of trust, so you kind of can guess that it's doing stuff correctly. So, yeah, the the use cases are pretty incredible.

(59:02):
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your contact information to the show notes. And for the final question, I'd like to get your perspective on what you currently see as being the biggest gap in the tooling or technology that's available for data management today.
Ugh. You really had to ask me that. The,
well, it's hard coming on as a vendor and asking that because, of course, the thing I'm working on is,

(59:25):
is the most important thing. I guess, to me, the missing thing
is that all of the hyperscalers
and all the data hyperscalers,
they are all trying to build walled gardens.
And we're maybe going back to a world where
there are Databricks developers and Snowflake developers. And more importantly, there are Databricks companies and Snowflake companies and AWS companies and GCP companies. And I think that is a bad state of the world for the engineers

(59:53):
because you want people to be able to move flexibly between different companies and have skills be portable,
and we should really still be striving for a world of open standards. So, you know, I don't know if Daxter is the right tool. Well, I I mean, I think we can play a part, but I think there's other parts in the ecosystem that need need to step up too. But I hope we live in a world that's less vertically integrated and more horizontally

(01:00:15):
integrated.
And kinda like anyone who can help out with that by building standards, building open source tools, making that story better, that is great. Because I think, like, a a world of five walled gardens is kind of a sad one.
Yeah. Absolutely.
Well, thank you very much for taking the time today to join me as usual and for all the great work that you're doing on Daxter and making that easier for folks to adapt to the changing ecosystem. So I appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias. It means a lot. Appreciate you having me on the

(01:00:54):
show.
Thank you for listening, and don't forget to check out our other shows.
Podcast.net
covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com

(01:01:20):
with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

All Episodes

Episode Transcript

Popular Podcasts

On Purpose with Jay Shetty

Ruthie's Table 4

The Joe Rogan Experience

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Dagster's New Era: Modularizing Data Transformation in the Age of AI

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}On Purpose with Jay Shetty

Ruthie's Table 4

The Joe Rogan Experience

All Episodes

Dagster's New Era: Modularizing Data Transformation in the Age of AI

On Purpose with Jay Shetty