Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Tobias Macey (00:11):
Hello, and welcome to the Data Engineering podcast, the show about modern data management.
Data teams everywhere face the same problem. They're forcing ML models, streaming data, and real time processing through orchestration tools built for simple ETL.
The result, inflexible infrastructure that can't adapt to different workloads.
That's why Cash App and Cisco rely on Prefect.
(00:33):
Cash App's fraud detection team got what they needed, flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows.
Each model runs on the right infrastructure, whether that's high memory machines or distributed compute.
Orchestration is the foundation that determines whether your data team ships or struggles.
ETL, ML model training, AI engineering, streaming, Prefect runs it all from ingestion to activation in one platform.
(01:01):
Whoop and 1Password also trust Prefect for their data operations.
If these industry leaders use Prefect for critical workloads, see what it can do for you at dataengineeringpodcast.com/prefect.
You're a developer who wants to innovate. Instead, you're stuck fixing bottlenecks and fighting legacy code. MongoDB can help. It's a flexible, unified platform that's built for developers by developers.
(01:24):
MongoDB is asset compliant, enterprise ready with the capabilities you need to ship AI apps fast. That's why so many of the Fortune 500 trust MongoDB with their most critical workloads.
Ready to think outside rows and columns? Start building at mongodb.com/build
today. Your host is Tobias Macy, and today I'm going to talk about some of the shifts that we've been seeing in recent years as a result of the
(01:50):
AI and AI engineering
practices.
And I started this podcast back in 2017,
which is right at the point that the term data engineering was really becoming widely used and widely adopted,
and it got attached to
a job title and the responsibilities for that job title were
(02:12):
generally pretty well understood.
And
from what I've seen, the creation of data engineering as a specific discipline and as an outgrowth of things like
data warehousing and business intelligence
engineers
and DBAs
was that the early to mid two thousands
(02:33):
through until the mid twenty tens was when we really had the
massive growth in popularity and hype around the idea of data science and data scientists and how that was
the hot new job of the twentieth century, and that was what everybody wanted to do.
And so there was a massive hiring spree of large companies and the growth of the Internet era for people who wanted to be able to
(03:00):
collect and make sense of all of the data that was
available to be had because of the growth of Internet usage.
And they wanted to be able to turn that raw information
into valuable
and actionable insights and assets that they could then generate new revenue streams from.
And
(03:21):
a lot of the people who got hired for that role of data science and data scientists and machine learning engineers
didn't actually have the data that they needed
when they started in those positions. And so rather than
coming into a company,
building a model, building a forecast,
making predictions,
(03:41):
and automatically
being able to
drive a huge amount of revenue for the company, they instead had to do all of that hard work and heavy lifting of finding the data that existed,
understanding its context, understanding how it fit together,
collecting it, cleaning it before they could even do the work that they are hired to do. And so
(04:03):
that was what really led to the growth of data engineers as a job title as it was a means of being able to
differentiate
those two areas of work and
capitalize on the investments of these data scientists and the
statistical
and modeling and machine learning skills that they had by bringing in people who understood how to work with data, how to make it reliable, how to clean it, how to present it to the teams that then wanted to take action on it.
(04:33):
And the data engineers
also
pretty quickly started to take over a lot of the business intelligence
work as well
with caveats.
And so
that, I think, too was also what led to some of that analytics engineer as a role where
you didn't want to necessarily have your data engineers spending all of their time building reports, working with the
(04:57):
business stakeholders because then they weren't doing the job that they were hired to do. And so analytics engineers came in,
we started to have this fracturing of
the
titles and the different business roles.
And so we kept hiring more people who are working with data in different avenues, and there were fairly clear delineations
between
(05:18):
who was doing what
at what points in the overall data life cycle.
Around when I started this podcast, it was also on the tail end of the Hadoop era where the whole idea was, hey. We'll just collect all of the data. We'll throw it into these massive data lakes. We'll do some fancy map reduce on it, and eventually we'll get some value out of that data. And so a lot of companies invested a substantial amount into those infrastructures and into that architecture,
(05:45):
and it didn't ultimately
pan out for a number of them.
And so
technologies such as Redshift was at the forefront, but since then
Snowflake,
ClickHouse, a lot of these cloud data warehouses and
columnar engines
came in to be able to
give us the ability to
(06:07):
collect a lot of the data, but keep it structured,
operate on it. And that
came to be a lot of the work that data engineers would do was fill the data warehouse with all of that structured and semi structured data,
make it reliable, make it repeatable, make sure that we had ways of bringing that data in and doing some activity on it and then providing clean interfaces
(06:31):
for downstream consumers of that data, whether that was the business intelligence
or the machine learning engineers.
In the
late twenty tens was when we really started to have that uptick in deep learning,
which was what then led to the idea of ML ops and operationalizing
these machine learning workflows because it became
(06:52):
more practical and achievable
to be able to
build
useful
models of the data without necessarily having to do
as much upfront
work and experimentation.
Obviously, it was still a very core piece of that workflow, but
you didn't necessarily
have to
do every little bit of fine tuning of the features that you were feeding in because the deep learning algorithms could
(07:19):
pick out some of those patterns
for you.
And so we had the growth of
machine learning engineer and ML ops as
a role and as a set of technologies.
And so
many of those were an outgrowth of the existing data engineering technologies and orchestration workflows, but also many of them
(07:39):
were net new and built specifically for
those
deep learning and
ML
use cases.
Now over the past
two years in particular
with the growth of generative AI and large language models,
a lot of those distinctions and differentiations
(08:01):
have really been
blurred.
About two years ago, I started the AI engineering podcast because of this growth in this new
style of work and set of requirements,
and I've largely
kept them as distinct shows. But the more time goes by and the more that adoption of these capabilities and technologies
(08:24):
grow,
the harder it is to really differentiate between is this a data engineering topic, is this an AI engineering topic? Because
in many cases it's both
where the data that we need is something that you have to prepare for use by
the AI models. Many of the AI models can be used in preparation
(08:46):
of the data, particularly when we talk about things like unstructured and even semi structured data where
for a long time
PDFs,
large
free text documents,
audio,
even things like images and video
were
stored and you could do some
(09:08):
metadata extraction and gain some insight about it. But the actual content of that information
was largely
put to the side and not used in the day to day of data engineering workflows.
Now that we have these models that are capable of
processing larger chunks of that raw data
and being able to extract
(09:31):
meaning and
semantics
and
pieces of detail from it, we're at a point where we now have to start thinking about, okay, what are the pieces of unstructured data that are going to be useful?
What are the common factors that I want to extract out of it? How do I turn this into
something that can be stored in a data warehouse or adjacent to a data warehouse?
(09:56):
So that's one of the major ways that data engineering
has been shifting is that we need to bring in some of these
language models and other
probabilistic
technologies
into what has been a fairly deterministic
workflow.
Composable data infrastructure is great until you spend all of your time gluing it back together.
(10:20):
Bruin is an open source framework driven from the command line that makes integration a breeze.
Write Python and SQL to handle the business logic and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement.
Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster.
(10:41):
Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform.
Go to dataengineeringpodcast.com/bruin
today to get started. And for DBT cloud customers, they'll give you a thousand dollar credit to migrate to Bruin cloud.
Another aspect
(11:02):
of the ways that these
language models and generative systems have changed the work of data engineers is that we have new
types of data assets that we're responsible for with vector databases and vectors being
the biggest example where
before we would have tabular data, you would maybe
(11:22):
do some feature engineering on that and then send that to the training and maybe hydration of a machine learning model for being able to
create some prediction or perform some action.
But now
we need to take
structured, semi structured, and unstructured data, turn them into those vector embeddings, and then store them in a relatively new technology for many in the form of these vector databases so that the models can retrieve that
(11:52):
at inference time and be able to do it quickly
and
effectively.
And so that still requires a lot of the skills that we have as data engineers,
but the requirements
around
the
ways that we're structuring that data are changing. Another important aspect of what's really changing is that
(12:15):
unless you were working in an organization that relies heavily on real time streaming data,
the SLA or service level agreement or the
reliance
of that data and its timeliness has changed a lot. And the uptime in particular of that data has changed a lot where if you were
(12:37):
building something for a data warehouse that was feeding into a business intelligence system,
there's a pretty high probability
that
if it goes down for
15 or an hour in the middle of the night while you reload the data warehouse or update data or fix some bugs,
it's not gonna be that big of a deal. But if you are
(12:58):
running
a vector store and it is powering
a customer facing
LLM that is doing inference and providing an interactive
use case,
if that same downtime happens as a much bigger deal. And so then you get into some of the operational characteristics
of
(13:19):
the systems that we're building where it's not just
a one way flow of information,
we need to be able to actually start
generating new information and new insights from those interactions as well, which is where things such as memory stores come in,
the use of language models in particular, and ideas such as retrieval augmented generation have also driven a renewed interest in graph technologies because of being able to build knowledge graphs and semantic graphs
(13:49):
of the context.
And so at the core,
the
purpose of the work that we do as data engineers hasn't substantially
shifted, but the shape that it has taken has where
really
at the base level,
the purpose of data engineering is to turn raw information
(14:10):
into useful knowledge.
But the context in which that knowledge is being exposed and utilized has changed.
It has also
started to blur the lines
of the responsibilities where you don't necessarily have a data engineer who hands off to an analytics engineer
(14:30):
or a data engineer who hands off to a machine learning engineer
who then maybe hands their model off to an application engineer to incorporate it.
Those teams all have to work much closer together to be able to build useful products
that can quickly go from raw data to
useful inference
(14:51):
as quick as possible because the capabilities of these models keep changing, the use cases for these models keep changing,
And the pace at which we need to deliver is changing because
these AI models also have a substantial impact on our
ability to be productive and deliver quickly and iterate quickly.
(15:14):
In particular, because of their ability to generate code as well as generate new insights, and also because
it shortens the cycle from a business stakeholder or a customer
being able to have an idea or ask a question and then get a useful answer. Where
before,
if you didn't already have a report to be able to answer a given question, even if you had the raw data somewhere,
(15:38):
that business stakeholder would have to file a ticket or ask an analytics engineer to say, hey, I really wanna understand
what are the types of feedback that I'm getting from my customers.
Maybe that even required building a new model to be able to parse that data in natural language and extract
the sentiments
of the feedback where now a lot of that natural language processing can happen with these language models. And so it's really just a matter of, let me take all of these unstructured
(16:07):
feedback from customers, whether that's from things like Zendesk or audio recordings, and you can run that through and be able to actually get an answer
pretty quickly. And it might not even require the
involvement
of
a business
analyst to be able to provide that response because the language model can take on some of that workflow. And so
(16:32):
as data engineers, we're also seeing a lot of pressure to be able to
onboard new datasets
faster,
make them accessible to these models at a higher rate and in higher volumes because
the ability to unlock value from it has dramatically shifted.
(16:53):
Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy?
DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems.
Whether you're moving from Oracle to Snowflake,
(17:15):
migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed timeline and fixed price.
Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold
to book a demo and see how they turn months long migration nightmares into week long success stories.
(17:36):
Another way that data engineering
and AI engineering are starting to blur the lines and become
unclear is that a lot of the same terminology is being used with orchestration being the
biggest example where
data orchestration
has typically meant things like DAXTER airflow prefect where I have my ETL script. I know that each step needs to be able to consume this data, produce this data,
(18:04):
and then you run it over and over again.
But now orchestration is also being brought in in the context of these
agentic
AI systems where I need to orchestrate the execution of that agentic loop. I need to provide data to the agent for it to be able to make decisions. I need to be able to provide
ways for the agent to be able to access data, which also
(18:28):
dramatically increases the responsibilities
around governance and security and access controls.
And
so having those two different styles of orchestration
can make it challenging to understand
which type of orchestration do I need, where does that orchestration live, do I just use my existing
ETL orchestrator
(18:49):
for some of these agentic use cases, does it have the ability to operate at those speeds,
do I maybe
use something like a Daxter or a Prefect to
execute another orchestrator and maintain its loop as an extension?
What are the
different styles of loops that these agents need? Do they all have to be running in a tight loop in a single process? Or is this something
(19:15):
where an LLM gets called, it produces some output that gets stored as an artifact somewhere, another step picks up, runs it through a different LLM with a different set of instructions. So there's really a lot of
unknown
patterns and a new evolution of capabilities and ways of thinking about these things and decomposing these workflows that
(19:37):
some people have experience with, but many of us don't. And so there's really a lot of
lack of established practices, which makes this an even more challenging time to be an engineer in this space, but it also provides a lot of opportunity
to
contribute
to discovery and
(19:58):
creation of those useful patterns
and educating each other on how to be able to understand
what are those responsibilities,
how do we handle those interfaces and
manage those handoffs between the different stages?
Maybe one of the
biggest
(20:18):
new requirements,
especially for data engineers,
is the fact that
the way that we test these
different workflows has to change where we have things like data quality monitoring, we have unit tests, we have integration tests to make sure that
if the logic breaks,
if some new
(20:39):
data comes in that breaks our prior assumptions,
we would react to it and fix it.
But as the
models
start to become more of the
core
execution
either in terms of the actual pipelines themselves or in terms of the
(21:00):
serving of what the data is feeding into,
everybody in the whole workflow needs to be able to include experimentation
and evaluation
as that means of building confidence and verifying changes and maintaining functionality.
And so many
machine learning teams and AI teams
(21:23):
have that experimentation
practice and there are some sets of tooling that are available to be able to manage that, but
the scope and scale of it has definitely dramatically expanded.
And so there is a lot of
there's a lot of new discovery that has to happen to make sure that we can effectively
(21:45):
operationalize
these
experimentation
workflows
and make sure that they can be executed rapidly so that we
can continue to build and evolve our systems without fear of breakage. Because as new models come out, the behaviors change. As new data comes online, the behaviors change. We need to be able to understand
(22:08):
the effect that different
instructions will have for different models or even for the same model. So there's
a huge number of variables that are being introduced that
didn't exist
as prevalently as they do now, where
even just going from deterministic software where maybe I'm building a web application
or a mobile application
(22:29):
to data engineering,
the dimensionality
of the complexity
goes up in order of magnitude.
We're now at a stage where we're going another order of magnitude of complexity going from data engineering to AI engineering and building these AI systems.
And so we have a lot of the capabilities and the understanding to be able to do that, but we really need to be investing in that evaluation
(22:53):
flow as a means of confidence building in order for us to be able to
get that flywheel in motion and maintain momentum
in order to be able to keep up with the pace of change because it's just going to keep speeding up or at least stay the same as what it is now. It's never going to go back to what it was five years ago.
(23:13):
And so
as data practitioners, we really need to be
thinking about what are the set of skills that I have and how do I apply them to this expanded set of responsibilities
and complexity?
How do I collaborate more closely with machine learning teams, with software engineers?
How do I use these
(23:34):
models to be able to do my own work more effectively?
And
how do
I gain
some level of,
how do I gain some level of familiarity
with what these models can and can't do so that I can be most effective? And so
(23:55):
going forward with the podcast,
I'm going to be
juggling that question of what are those boundaries? When does this make sense to be discussed in a data engineering context versus an AI context?
And so
as you continue to listen,
I'm always open to feedback for what are the questions that you have as somebody who is working in this space, how can I best
(24:22):
surface and
explore some of the changing
responsibilities,
the changing
technologies, and those blurring boundaries? So
I'm going to keep digging into this. I'm working in this space every day as well, so I appreciate
the complexity and the
uncertainty
(24:42):
that we are all facing. But it's also a very exciting time to be working in this space because of all these new capabilities and because
with these models, we can move beyond just being
plumbers of data to
being able to operate at a higher level of abstraction and capability because
many of these models can handle some of the rote and menial work that we've been stuck with for a long time. So
(25:09):
going forward,
very excited for
the
technologies that we have built and rely on to continue to evolve and
adapt and bring in new capabilities
powered by these AI systems
as well as being able to
extend the governance
(25:30):
policies and controls so that we can ensure that we're doing this in a safe manner and a deliberate manner.
And I appreciate you taking the time to listen.
I really think that
going to my usual question of the biggest gap in the tooling of technology for data management today, it is that
set of
(25:50):
patterns and practices
of how and where to bring AI to bear and how the
ways that we think about data structures and delivery need to evolve to be able to accommodate these new patterns of access.
So with that, thank you for taking the time to listen, and I hope you enjoy the rest of your day.
(26:19):
Thank you for listening, and don't forget to check out our other shows. Podcast.net
covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com
(26:43):
with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.