Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:11):
Hello, and welcome to the Data Engineering Podcast, the show about modern data management.
Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.
DataFold's AI powered migration agent changes all that.
Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.
(00:35):
And they're so confident in their solution, they'll actually guarantee your timeline in writing.
Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds
today for the details.
Your host is Tobias Macey, and today, I'd like to welcome back Gleb Myszynski, where we're going to talk about the work of data engineering to build AI, to build better data engineering, and all of the things that come out of that idea. So, Gleb, for folks who haven't heard any of your past appearances, if you could just give a quick introduction.
(01:06):
Yeah. Thanks for having me again, Tobias. Always fun to be at podcast.
I'm glad I am CEO and cofounder of DataFold.
We work on automating data engineering workflows
now also with AI.
Prior to starting DataFold, I was a data engineer, data scientist, data product manager, and
I got a chance to build three data platforms pretty much from scratch
(01:29):
at three very different companies,
including Autodesk and Lyft, where I was one of the first
founding data engineers and got to build a lot of pipelines and infrastructure and also break a lot of pipelines and infrastructure. And
I've always been fascinated by how important
(01:50):
is data engineering to the business
In that, it unlocks the delivery of the actual applications that are data driven, be that dashboards
or machine learning models or now in busingly also,
BI applications.
And at the same time, as a data engineer, I have always been very frustrated with how manual, error prone, tedious, and toilsome my personal workflow was
(02:15):
and pretty much started Dataflow to solve that problem and remove all the manual work from the
data engineering workflow so that we can ship high quality data faster and help all the wonderful businesses that are trying to leverage data
actually
do it. So excited to chat.
In the context of data engineering,
(02:36):
AI,
obviously, there's a lot of hype that's being thrown around about, oh, you just rub some AI on it. It'll be magical, and your problems are solved. You don't need to work anymore. It's going to replace all of your junior engineers or whatever the current marketing spin is for it. And it's undeniable
that large language models, generative AI, the current era that we're in, has a lot of potential. There are a lot of useful applications of it, but
(03:03):
the work to actually realize those capabilities
is often a little bit opaque or
misunderstood
or confusing.
And so
there are definitely a lot of
opportunities for being able to bring large language models or other generative AI technologies
into the context of data engineering work or development environments.
(03:25):
But the work of actually getting it to the point where it is more help than hindrance
is often
where things start to fall apart. And I'm wondering if you can just start from the work that you're doing and the experience you've had of actually incorporating
LLMs into some of your product, some of the
lessons learned about what are some of those impedance mismatches, what are some of those stumbling blocks that you're going to run into
(03:51):
on the path of saying, I've got a model. I've got a problem. Let's put them together.
Yeah. Absolutely. And I think that's spot on observation, Tobias, in terms of there's a lot of noise and hype around AI everywhere. But, yeah, we don't have a really clear idea
and consensus
how actually it
impacts state engineering.
(04:12):
And maybe before we dive into, like, okay, what is actually working, it's worth kind of disambiguating and cutting through the noise a little bit. And
I've been thinking about this recently, but I think there is probably two main
things that everyone gets a bit confused about.
One is
the confusion of software engineering and data engineering.
(04:33):
Software engineering and data engineering are very related. And in many ways, they are similar.
In data engineering, we ultimately
also write code that
produces some outcome. But unlike software engineering, typically, we're not really building a deterministic application that performs a certain function. We write in code that processes
(04:55):
large amounts of data.
And, usually, that data is highly imperfect. And so
we're dealing not just with,
code. We're dealing also with extremely complex,
extremely
noisy inputs and a lot of the times also
unpredictable
outputs. And that makes the workflow quite different.
(05:15):
And I think one important distinction is
when we see lots of different tools and advancements in tools that are affecting
software engineers and impacting their workflows for the better like, one example is,
I think, over the past year, we've seen
amazing amazing
improvement of the
(05:35):
kind of Copilot
type of support within a software engineering workflow through various tools. We at Dataflow, for example, use cursor ID a lot, and we really like how it seamlessly plugs in and enables our engineers working on the application code just be more productive,
spend less time on,
a lot of, like, boilerplate, total sum tasks.
(05:57):
And
those tools are really it's really exciting how it affects the software engineering workflow. There's also a huge part in the software engineering space right now that is devoted
to the agents. So, for example, with Cursor,
the idea is that you
plug it in the IDE
in a few
(06:18):
touch points for developer, like code completion and then kind of in the system and helps you mock up and refactor the code. And it's very seamless, but it's still kind of part of the core workflow for human. And And then there's a second school of thought where there's an agent that takes the tasks that can be very loosely defined and then basically builds an app from scratch or takes a Jira or linear ticket and does the work from scratch. And it's also very exciting. I would say, in our experience testing multiple tools, the results there are far less impressive, and actual
(06:50):
impact on the business for us in terms of software engineering has been far less impressive than with more, like, a ID native
enhancement.
But all of that is to say that
while those tools are
really impactful for software engineers and there's a lot happening also in other parts of the workflow,
we've seen very
limited impact of those particular tools on the data engineers workflow.
(07:14):
And the primary reason is that although we're also writing code as data engineers,
the
tools that are built for software engineers, they lack very important context about the data.
And it is kind of a simple idea and a simple statement, but what's underneath is actually quite a bit of complexity.
Because if you think about what data engineer needs to do in order to
(07:38):
do their job, they have to understand not just the code base, but they also have to have a really good grasp on the underlying data that their code base is processing,
which is actually a very hard task by itself starting from understanding
what data you have in the first place,
how was the data computed,
where it's coming,
who is consuming it, what are the relationships
(08:00):
between all the datasets.
And absent of that context, the tools that you may have supporting your workflow, yes, they can help you generate the code, but
the impact of that would be quite limited relative to,
how complex your workflow is. And I think that means that for data engineers, we need to see a specialized class of tools that would be dedicated
(08:24):
at improving data engineers' workflow and would excel at doing that by having that context that
is critical for data engineer to do their job. That's kind of, I think, one aspect of the confusion sort of like all the advances in software engineering tools are exciting and inspiring. It doesn't mean that now data engineer's workflow is
(08:45):
impacted as as significantly as the software engineer's workflow.
I think the other type of confusion that I'm seeing is
is a lot of talk about AI in the data space.
And all the vendors you see out there are,
I think, smartly positioning them themselves as really
relevant and essential to
(09:07):
the fundamental
tectonic shift we've now seen technology, meaning they try to position themselves as relevant in the in the world where LMS are really
providing big opportunity for businesses to, to improve and grow and automate a lot of
business processes.
But if you double click into what is exactly everyone is saying is it's pretty much we're going to help you, the data team, the data engineer, ship AI to your business and to your stakeholders. Like, we are the best,
(09:36):
you know, workflow engine
so that you can get data delivered for AI, or we are the best data quality vendor that will help you ensure the quality of the data that goes into AI, or we have the most integrations with all the
vector databases that are important for AI.
And
(09:57):
kind of the the
the message that you're getting from all of this and by no means, this is not import this is definitely important and relevant.
But what's interesting about this, it's we're saying essentially,
data engineer. You have so many things to do, and now you also have to ship AI. We're gonna help you ship AI. It's so important that you ship beta for AI applications.
(10:19):
We are the best tool to help you ship AI.
But it almost sounds like this is data engineers in the service of AI.
And I think what's really interesting to explore and to unpack and what I would personally love for myself as a data engineer is kind of reversing that question and asking the question
of, okay. So we have now this fundamental shift in technology,
(10:43):
amazing capabilities
by LLMs.
How does it actually help me in my workflow?
So what does the AI four beta engineer
look like? And I think we need much more of that discussion because I think that if we make people who are actually working on all these important problems more productive with the help of AI, then they all for sure do amazing things with data. And I think that's a really exciting opportunity to explore.
(11:10):
One of the
first and
most vocal applications of AI in that context of helping the data engineers
by maybe taking some of the burden off them that I've seen is the idea of
talk to your data warehouse in English or text to SQL or whatever formulation it ends up taking where rather than saying, oh, now you need to build your complicated star or snowflake schema
(11:35):
and then build all of the different dashboards and visualizations for your business intelligence.
You just put an AI on top of it, and then your data consumers just talk to the AI and say, hey. What was my net promoter score last quarter, or what's my year over year revenue growth, or how much growth can I, expect in the next quarter based on current sales?
(11:56):
And it's going to just automatically generate the relevant queries. It's going to generate the
visualizations for them, and you, as a data engineer or as an analytics engineer, don't need to worry about it anymore.
And
from the description, it sounds amazing. It's like, great. K. Job done. I don't need to worry about that toilsome work. I do all of the interesting work of getting the data to where it needs to be, and then the AI does the rest. But
(12:21):
then you still have to deal with issues of making sure that you have the appropriate semantics maps so that the AI understands what the question actually means in the context of the data that you have, which that's the hardest problem in data anyway no matter what. So the AI doesn't actually solve anything for you. It just maybe exacerbates the problem because somebody asks the AI the question, the AI gives an answer,
(12:42):
but it's answering it based on a misunderstanding
of the data that you have. And
so you still have those issues of hallucination,
incorrect data, or variance in the way that the data is being interpreted. And I'm wondering what you have seen
as far as the
actual practical applications of the AI
(13:02):
being that simplifying interface
versus the amount of effort that's needed to be able to actually make that useful.
Yeah. I think this is text to SQL is the holy grail of
the data space. I would say for as long as I've worked in the space
for over a decade
that, you know, people really try to solve this problem multiple times. And,
(13:25):
obviously, now in hindsight, it's obvious that pre LLM,
all of those,
approaches using traditional NLP
were doomed.
And
now that we we have LLMs, it seems like, okay. Finally, we can actually solve this problem.
And I'm very optimistic that
it indeed will help make data way more accessible, and I think it eventually will have tremendous impact on how humans interact with data and how data is leveraged. But I think that the how
(13:57):
and how it happens and how it's applied is also very important
because
I don't think that the fundamental problem is that people cannot write SQL.
SQL is actually not that hard to to write and to master.
I I think the fundamental issue is that if we think about the life cycle of data in the organization,
(14:18):
it's very important to understand that the raw data that it gets collected from, you know, all the business systems and all the events and logs and everything we have in a data lake, it is pretty much unusable. And it's unusable both by machines and AI or and and people if we just try to, you know, throw a bunch of queries so they didn't ask, you know, try to answer really key business questions.
(14:41):
And in order for the data to become use usable,
we need what is currently is the job of a data engineer
of structuring,
filtering, merging, aggregating this data, curating it, and creating a really structured representation of what is our business
and what
are all the entities in the business that we care about, like customers, products, orders.
(15:04):
So that then this data can be fed into all the applications. Right? Business intelligence, machine learning, AI.
And I don't think that text to SQL replaces that because if we just do that on top of the raw data, we basically get garbage in, garbage out. I do think that in certain applications in certain
applications of that,
(15:25):
we can actually get very good results even today
if we put that level of a system on top of highly curated,
semantically
structured
datasets. Right? So if we have a number of tables that are well defined that describe how our business works,
having a text to SQL interface
could be actually extremely powerful because we know that
(15:48):
the questions that are asked and will be translated into code will be answered with the data which has been already prepared
and structured. And so it's actually quite easy for the system to be able to make sense about it.
But I don't think we are there where just, like, you don't need the data team. Let's just ask a question. Almost guaranteed that the answer will be
wrong. So data engineer in that regard, data engineering and data engineers,
(16:12):
are definitely not going to lose their jobs because now it's easy to generate SQL from text.
And in the context even of that text to SQL
use case, what I've been hearing a lot is that
it's not even very good at that. One, because LLMs are bad at math and SQL is just a manifestation of relational algebra, thereby math.
(16:33):
But that if you bring
a knowledge graph into the system where the AI is using the knowledge graph to understand what are the relations between all the different entities from which it then generates the queries, it actually does a much better job.
But, again, you'd have to build the knowledge graph first. And I think maybe that's one of the places where
bringing AI earlier in the cycle is actually potentially useful, where you can
(16:57):
use the AI to do some of that
root work of saying, here are all the different representations that I have of this entity or this concept across my different data sources.
Give me a first pass of what a unified model looks like to be able to represent that given all of the data that I have about it and all the ways that it's being represent
that given all of the data that I have about it and all the ways that it's being represented. And I'm wondering what you've seen in that context of
(17:22):
bringing the AI into that data modeling,
data curation workflow
of
it's not the
end user interacting with it. It's the data engineer
using the AI as their copilot, if you will, or as their assistant to be able to do some of that tedious work that would otherwise be okay. Well, I've got 15 different spreadsheets. I need to visually look across them and try and figure out the similarities and differences, etcetera.
(17:50):
Yeah. That's a good point, Tobias. I would say that there are I have two thoughts there.
On how
the EI plugs in to actually make text to SQL
work, yes, you absolutely need that kind of semantic graph of
what it what datasets you have, how are they related,
what are all the metrics, how those metrics are computed.
(18:11):
And
in that regard,
what's really interesting is that the metrics layer
that was, at some point, a really
hot idea in the modern data stack probably about for, you know, three to five years ago.
And then everyone was really disappointed with how little impact it actually made on on the data team's, productivity and just overall on the data stack.
(18:35):
It almost like now now it's the metric layer's
time. Because if you take the metrics layer and,
which gives you a really structured representation of the core entities and the metrics,
putting the text to SQL is almost, like, the most impactful thing that you can do because then you have a structured preservation of your
data model,
(18:55):
which allows AI to be very, very effective at being able to answer questions while being while while operating on a structured graph.
And so I think we'll see
really exciting
applications coming out of the hybrid of that kind of fundamental
metric layer semantic graph and text to SQL
in you know, we're already seeing that the early impacts of that. But I think over the next two years, it probably would become the a really popular way to
(19:23):
open up data for ultimate stakeholders
instead of
classical BI of, like, drag and drop,
interfaces and kind of passively consume dashboards.
But then the second point which he made is, basically, can AI actually help us get to that structured representation? And I think absolutely,
for the data engineer's workflow. So not for a, I would say, business stakeholder or someone who is data consumer, but for data producer,
(19:50):
I think that leveraging
LLMs to help you build
data models
and especially build them faster build them faster in the sense of understanding all the semantic relationships,
not just writing code, is a very promising
area. And that comes back to the my point about
how software tools are limited
(20:11):
in their help of you know, for data engineers. Right? I can write SQL, but if I if my tool does not understand
what are the relationships
between the datasets,
then it can't even help me write joins properly.
And one of the interesting things we've done at DataFold
was actually build a system that
essentially infers
(20:31):
a entity relationship diagram
from
the
raw data that you have combined with all the ad hoc SQL queries that have been written by people. So, previously, that would be a very hard problem to solve. But with the help of LLMs, we can actually have a really good shot
at understanding
(20:52):
what is the what are all the entities that your
business have in your data lake, how they're related. And that's almost like a probabilistic graph because people can be writing joins correctly or incorrectly, and you have noisy data. And sometimes
keys that you think are, like, primary keys or foreign keys are not perfect.
But if you have a large enough dataset of
queries that were ran against your warehouse, you can actually have a really good shot at understanding what's the semantic
(21:19):
graph looks like. And the context on which we actually did this was to help
data teams
build testing environments for their data. But the the implications
of having that knowledge is actually
very powerful. Right? So to your point, we can use that tools to help write SQL.
So I'm very bullish on the ability to help engine data engineers
(21:42):
build pipelines
by creating a semantic graph
without the need for curation. Because previously,
that problem was almost pushed to people
with all the kind of data governance tools. The idea was, let's have data stewards define all the canonical datasets and all the relationships. And, obviously, we just powered this completely non scalable.
(22:02):
So now we're finally at the point where we can automate that kind of semantic
data mining,
with LMS.
That brings us back around to another point that I wanted to dig into further
in the context of how to actually
integrate the LLMs into these different use cases and workflows.
You brought up the example of Cursor as an IDE that was built specifically with LLM use cases in mind,
(22:28):
juxtaposed with something like a Versus code or VIM or Emacs where the LLM is a bolt on. It's something that you're trying to retrofit into the experience.
And it can be useful, but it requires a lot more effort to be able to actually set it up, configure it, make it aware of the
code base that you're trying to operate on, etcetera,
(22:49):
versus the prepackaged
product.
And we're seeing that same type of thing in the context of data where you mentioned there are all these different vendors of, oh, hey. We're gonna make it super easy for you to make your data ready for AI or use this AI on your data. But most teams already have some sort of system in place, and they just wanna be able to retrofit the LLM into it to be able to start getting some of those gains, would the eventual goal of having the LLM maybe be a core portion of their data system, their data product? And I'm wondering, in that process of bringing an LLM, retrofitting it onto an existing system, whether that be your your code editor, your deployment environment,
(23:29):
your data warehouse, what have you.
What are some of those impedance mismatches or some of the issues in conceptual understanding about how to bring the appropriate, I'm gonna use the word knowledge,
even though it's a bit of a misnomer,
into the operating memory of the LLM so that it can actually do the thing that you're trying to tell it to do? Yeah. That's a great question, Tobias. I think that to answer this, we kinda need to go back to
(23:55):
what are the jobs to be done for data engineer,
and how does the data engineer workflow actually look like. And if we were to visualize it, it actually looks quite similar
to the software engineering workflow in just the types of tasks
that a data engineer does day to day to do their work. And by the way, we're saying data engineer as sort of like a blank label, but I don't necessarily mean just
(24:21):
people who have data engineering in the title because
all roles that are working with data, including data scientists, analysts,
analytics engineers, and VM in many cases, and software engineers,
a lot of them actually do data engineering in terms of building pipelines and developing pipelines as part of their job. It's just data engineers probably do this, you know, % of their data time. And if I'm a data analyst or data scientist, I would be doing this maybe 40%
(24:48):
of the time of my week. And so if we think about what do I need to do to, let's say, ship a new
data model like a table or extend
an existing data model, you know, refactor definitions or add new types of information into an existing model,
it starts with planning. Right? So I'm doing planning.
I'm trying to find the data that I
(25:10):
need for for my work. And
a lot of the times,
a lot of information can be
sourced from documentation,
from a data catalog. I think right now, the data catalog, giving you the sense of, like, what datasets I have and what's the profile of those datasets,
has been largely solved. There are great tools. You know, some are open source. Some are vendors. But overall,
(25:33):
understanding what datasets you have now is way easier than it was five years ago. You also probably are consulting
your tribal knowledge, and you go to Slack and you do, like, search for certain definitions. And that's also now is largely solved with a lot of the enterprise search tools. And then you go into writing code.
And writing code, I think this is also an important misconception. Like, if you are not really, you know, doing this for for a living, you think that people spend most of their time actually writing SQL
(26:02):
and in terms of, like, writing SQL to for production.
And in my experience,
actual
writing of the SQL
or other types of code is maybe,
like, 10 to 15%
of my time,
whereas all the operational tasks around
testing it,
talking to people to get context, doing code reviews,
(26:24):
shipping it to production,
monitoring it, remediating issues, talking to more people
is where the bulk of the work is happening.
And if that's true, then that means that probably as we talk about automation,
these operational workflows are where the bulk of the lift
coming from MLMs can actually happen. And so for actual writing code as a data engineer, I would still recommend probably using the best in class software tools these days, like Cursor. It will even though it's not aware of the data, it will probably still help you write a lot of boilerplate
(26:58):
and will speed up your workflow somewhat. And or you can use other IDs with Copeland or, like, Versus Code plus Copilot. I think those tools will just help you speed up the writing of the code itself.
But back to operational workflows that I think take the majority of the of the time within any kind of cycle of shipping some of shipping something. When it comes to
(27:21):
what happens after you wrote the code, right,
typically,
if
you have people who care about the quality of the data, it means that you have to do a fair amount
of testing of your work.
And testing is both helping
making sure that my code is correct. Right? Does it conform to the expectations?
Does it produce the data that I expect? But it's also about understanding potential breakages.
(27:45):
Data systems are historically fragile in the sense that you have layers and layers of dependencies
that are often opaque because,
I can be changing some definition of what an active user is somewhere in the pipeline. But then I can be completely oblivious of the fact that 10 jobs
(28:06):
down the road, someone builds a machine learning model that consumes that definition and
tries to automate certain decisions for, like, for example, spend and manipulating that metric. And so if I'm not aware of those downstream dependencies, I couldn't be actually causing a massive business disruption just by the sheer fact of changing it. And so
the testing that involves not just understanding how the data behaves, but also how the data is consumed and what are the larger business implications for making any kind of modification to the code is where a ton of time is spent in the data engineering. And so what's interesting is that is the use case where, historically, we at Data Vault spend a lot of time thinking about even pre AI. And before a lens were a thing, what we did there was came up with a concept of data diffing. And the idea is
(28:54):
everyone can see code diff. Right? My code looked like this before I made a change. Now
it's it's a different numb you know, it's a different set of characters
that, the code looks like. And, defining the code is something that is, like, embedded in GitHub. Right? You can see the diff. But the very hard question is understanding how does the data change based on the change in the code because
(29:15):
that is not obvious. That happens,
like, once you actually run the code against the database. And so Datadiff allow you to see the impact of a code change on the data. And that by itself was quite impactful, and we've seen a lot of teams
adopt that, you know, large
enterprise
teams, fast moving software you know, startup teams. But we were not fully satisfied with
(29:36):
the degree of automation
that feature alone produced because people are still required to, like, sift through all the data diffs and explore
them for multiple tables and
see how
the downstream impacts when they pass themselves through lineage.
And it felt like, okay. Now at least we can give people all the information, but they still have to sift through a lot of it, and some of the important details can be missed. And the big unlock that LLMs bring this particular workflow is once LLMs became pretty good in comprehending the code and actually, semantically understanding the code, which pretty much happened over
(30:14):
2024
with the latest generation of fundamental,
you know, large language models, we were able to
do two things. One,
take a lot of information and condense it into, like, three bullet points,
kind of like an executive summary. And those bullet points are
essentially helping the data engineer understand on the high level what are the most important impacts that I need to worry about for any given change and for a code reviewer to understand the same. And that just helps people to get on the same page very quickly and say they're running a lot of time that otherwise could spend be spent in meetings, running back and forth, you know, putting comments on a code change. And the second unlock that we've seen is the opportunity to to drill down
(30:57):
and explore
all the impacts and do the testing by, essentially,
chatting with your pull request, chatting with your code. And that it comes in the form of a chat interface where you're basically speaking to an agent
that has a full context of your code, full context of the data change, data diff, and also full context of your lineage
(31:17):
so that I can actually understand
how every line of code that was modified affecting the data and what does that mean for the business.
And you can ask questions, and it produces the
answers way faster than you would by essentially looking at all the different, you know, code changes and and data dips. And that ended up save saving a lot of times a lot of time for data teams. And
(31:40):
now that I'm describing this,
you kind of feel that I it sounds like almost having a buddy that just, like, helps you think for the code, almost like having a code reviewer, except for with AI. With LLM,
this is a body that's always available to you twenty four seven and probably makes your mistakes because it has all the context and can set through a lot of informations really quickly. So that's an example of how an LOAD could be applied into an operational use case that historically has been really time consuming and take a lot of manual work out of that context.
(32:12):
And I really wanna dig into that one word that you said probably at least a half dozen times, if not, maybe a couple of dozen was that context, where that, I think, is the key piece that is
so critical and also probably the most difficult portion of making AI useful is context. What context does it need? How do you get that context to it? How do you model that context? How do you keep it up to date? And so I think that really is where the difference comes in between the cursor example that we touched on earlier versus the
(32:43):
retrofitting onto
Emacs or whatever your tool or workflow of choices is how do you actually get the context to the place that it needs to be. And so you just discussed the use case that you have of being able to use the LLM in that use case of interpreting the various data diffs, understanding what is the actual ramifications
of this change. And I'm wondering if you can just talk through some of the lessons learned about how you actually
(33:08):
populate and maintain that context
and how you're able to
instruct the LLM how to take advantage of the context that you've given it? That's a great question, Tobias. And I think what's interesting
is that
at face value, it seems like you wanna throw
all the information you have at LLM. Right? Just like tell you everything and then let it figure out things.
(33:32):
And in fact, it is obviously not as easy as that. And in fact, it's actually counterproductive
to oversupply
the LM with context, in part because
the context window of Flash language models is limited.
And the trade off there is,
one, you just, like, can't physically fit everything. And, two, even if you were dealing with a model that actually is designed to have a very large convex window, if
(33:58):
you overuse it and supply too much information,
L and M just get gets lost. It's also
it starts
being far far less effective in understanding what's actually important versus not, and the overall effectiveness of your system goes down.
So back to your question of, like, what is the actual information that is important to provide as context into LLM? It really depends on what is the workflow
(34:21):
that we're talking about. In the context of a code review and testing,
where we are trying to fundamentally answer the question of, a, if we change the code,
was a change
correct relative to what we tried to do, what was what the task was,
or did we not conform to the business requirement?
(34:44):
The second question is,
did we follow the best processes
such as, you know, code guidelines and performance guidelines or not? And the third question is, okay. Let's say we conform to the business requirements.
We did the good job at following our coding best practices,
But we may still cause a business disruption
(35:04):
just by making a change that can
be a surprise either for a human consumer of data downstream or could throw off a machine learning model that was trained based on the different
distribution of data. Right? And so these are fundamental three questions that we try to answer. And by the way, even without AI, that's what a good code review would ultimately accomplish done by humans.
(35:27):
So what is the context that it's important for LM to have here? First, obviously, it is the code difference. Right? So we already know what the original code was, what's the new code is. And
feeding that into l m is really important so that I can understand, okay, what are the actual changes in the code itself, in the logic. And, I won't go into the details here because, obviously, the code base can be very large. Sometimes your PR can fetch a lot of code, so you have to be quite strategic in terms of how do you feed that on the technical side. But conceptually, that's what we have to provide as an input, number one. The second important input is the data diff. Right? It's understanding
(36:06):
if I have a kind of main branch
version of the code,
understanding
what data it produces
and what are the metrics showing. Right? And then if I have a new version of the code, let's call it a developer branch,
what data it produces and what is the difference in the output?
Let's say,
with my main branch code, I see that I have 37
(36:28):
orders on Monday. But with the new version of the code, I see that I have 39.
And so that already tells me that, okay. So this is the important impact on the output data and on the metrics. And that can and that's important both on the value levels, understanding how the individual cells, rows and columns, are changing, but it's also important to do roll ups and understand what is the impact on metrics.
(36:50):
And coupling that context with the code diff allows us to understand how changes in the code affect the actual data output. And the third really important aspect is the lineage. So lineage is fundamentally understanding
how the data flows
throughout your system,
how it's computed, how it's aggregated, and how it's consumed.
(37:10):
And
the lineage is a graph, and there are kind of two directions of exploration. One of them is upstream, which helps us understand
how how did the data get to the point where you're looking at it. Right? So, for example, if I'm looking at number of orders and I'm changing a formula,
where does the information about orders come from in the first place? And that is important because that can tell us a lot about how a given metric is is computed and what are the source of truth. Are we getting it from Salesforce? Are we getting it from our internal system? And then the downstream lineage is also important because it tells us how the data gets consumed, and that is absolutely essential information that can help us understand
(37:50):
what downstream systems and metrics will be affected. And lineage graph in itself can be very complex,
and building it actually is a top problem because you have to essentially scrape all of your
data platform information, all the queries, all the BI tools
to understand how data flows, how it's consumed and produced. But let's say you have this lineage graph. It's actually also a lot of information by itself. And so
(38:14):
to properly supply that lineage information into
an online context, you actually kinda need,
your system to be able to explore lineage graph on its own to see, like, okay. If I am if the developer make made a change here, what are the important
downstream implications of that? So now we're talking about kind of the system to be able to
(38:35):
kind of traverse that and do analysis on its own for for the context. I would say these are the three most important types of context. And then the fourth one is kind of optional. Yeah, if your team has any kind of best practices,
SQL linting rules,
documentation rules, you can also provide them as context, and then your kind of AI code reviewer assistance can
(38:56):
help you reason about, well, did you conform or not? And if not, making suggestions about what to correct. Eventually, probably going in and correcting your code itself. I think that's ultimately where this is going. But, again, it's pretty much would be operating on the same side of input context.
Another interesting element of bringing LLMs
into the context of the data engineering workflow and use case,
(39:20):
one is the privacy aspect, which is a whole other conversation. I don't wanna get too deep into that quagmire.
But, also,
when you're working as a data engineer,
one of the things you need to be thinking about is what is my data platform? What are the tools that I rely on? What are the ways that they link together? And if you're going
to rely on an LLM or generative AI as part of that tool chain, how does that fit into that platform?
(39:45):
What is some of the scaffolding? What are some of the workflows? What are some of the custom development that you need to do where a lot of
the first pass and naive use cases for generative AI and LLMs is, oh, well, just go and open up the chat GPT UI or just go run LM studio or use cloud or what have you. But if you want to get into anything sophisticated where you're actually relying on this as a component of your workflow, you want to make sure that it's customized, that you own it in some fashion.
(40:14):
And so that is likely going to require doing some custom development using something like a lane chain or a lane graph or,
crew AI or whatever where you're actually building additional scaffolding logic around just that kernel of the LLM.
And I'm curious how you're seeing some of the needs and use cases of
incorporating
(40:34):
the LLM more closely into that actual
core capabilities
of the data platform through that effort of customization
and, software engineering.
That's a great point, Tobias. I think that the models themselves
are
getting rapidly
commoditized
in the sense that their capabilities,
(40:55):
the fund you know, the foundational
large language models,
their
interfaces
are very similar.
Their capabilities
are similar. We're seeing a lot of race between the companies
training those models in terms of beating each other in benchmarks.
Looks like the whole industry is converging
on adding more reasoning, and then the ways that this is happening is also
(41:20):
converging on the same experience and the matter and the difference is, like, who is doing this better? Right? Who is bidding the metrics? Who provides the best,
the cheaper
inference, the faster inference,
more intelligence for for the same price? And to that and I don't think that
differentiation
or the effectiveness of whatever is the automation that you're trying to bring really depends on the choice of the model. Maybe for certain narrow applications,
(41:46):
actually, maybe choosing a more specialized model and or fine tuning model would be more applicable. But still, I don't think the model is where you really where the magic happens these days.
Model is important for magic, but it's not something that actually allows you to build a really effective application
by just, you know, choosing something better than what's available to everyone else. The actual
(42:09):
magic and the value add and the automation
happens
in how you leverage that model in your workflow. So all the orchestration
in terms of how do you prompt the model, what kind of context do you provide,
how do you tune the prompt, how do you tune the inputs,
how do you
evaluate the performance of the model
in production,
(42:30):
how do
you make various ALM based actors that may be playing different roles
interact with each other. That is where
the hard work
is happening, and that is where I think
the actual
value and impact is created. And that's where all the complexity
is. So I think you don't have to be, you know, a PhD
(42:52):
and really understand
how the models are trained. Although, I would say just like in computer science, it's obviously very helpful to understand how these models are trained and their architectures and their trade offs. But you don't have to be good at, you know, training those models in order to effectively leverage them. But to leverage them, you have to do a lot of work to effectively plug them in the workflows. And I think that the applications and companies and teams that are thinking about what is the workflow, what is the ideal user interface,
(43:21):
what is all the information that I we can gather to make l m do the better job, and then are able to rapidly iterate will ultimately create the most impact with OLMs.
And so on that note, in your experience of working with the LLMs, working with other data teams, and keeping apprised of the evolution of the space, what are some of the most interesting or innovative or unexpected ways that you've seen teams
(43:45):
bring LLMs into that inner loop of building and maintaining and evolving their data systems?
I think the most, in hindsight, it's obvious, but
not necessarily obvious when you're just starting realization
is that
no one really knows how to
ship LLM AI based applications.
(44:06):
There are obviously, you know, guides and tutorials
and
still, like, there there's a lot you can learn from looking at what people are doing,
but
the field is evolving so fast that
nothing replaces
fast experimentation
and just building things.
It's not that you can
(44:26):
just hire someone who worked on building an LLM based
application,
like, six months ago, a year ago, and all of a sudden, you, you know, gain a lot of advantage
as you would with many other technologies. Like, you know, if we were,
I guess, working in a space of video streaming, it will be very beneficial
(44:47):
to have
extensive experience with working with video streaming and codecs. And with
LLMs,
one,
no one really knows exactly how they work, even the company in terms of, like, how they behave. Right? In terms of even the companies that are shipping them are discovering
more and more novel ways of leveraging them more effectively
(45:07):
every week.
And from the from the teams that
are using leveraging the lens like like data folds,
the thing that we found
matter the most is the ability to,
a, just stay on top of the field and understanding what's the what's the,
like,
most exciting thing that people are are doing, how they relate to our field, how can we borrow some of those ideas.
(45:31):
But most importantly
is is rapid experimentation
with some sort of methodology that allows you to try new things, measure results
quickly,
and then being able to scrap your approach that you thought was great and just go with a different one. Because a lot of times when a new model is released,
you have to kind of adjust a lot of things. You have to adjust the prompts. You have to
(45:54):
even rearchitect some of the flows that you build.
And that is both
difficult but also incredibly exciting because the pace of innovation
and what is possible to solve
is evolving extremely fast. I would say the fastest of any previous technological
wave of disruption that we've seen.
(46:16):
In your experience and in your work of investing in this space, figuring out how best to apply LLMs to the problems facing data engineers and how to incorporate that into your products, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
Yeah. I I think that the
the interesting realization
(46:39):
was that specifically for data engineering domain again,
if you just take the problem at face value, you think, well, let's just build a Copilot or an agent that would kind of try to automate data engineer a way.
And I don't think we have the tech ready for an agent to just, like, really take a task and run with it yet. I don't think it's been solved in software space. I think it's, in some ways, even harder to solve in data space. We'll eventually get there. I don't think we are there yet.
(47:09):
I don't think that the biggest feedback you can make on the engineering workflow again is, like, having a
copilot because
that's not where
data engineers spend most of their time in terms of, like, writing production code. It's all operational tasks. And
there are certain kinds of problems
in the data engineering space where
(47:31):
it's not even a day to day, you know, you help, you save, like, an hour, two hours, three hours.
But there are certain types of workflows
where
to complete a task,
a team needs to spend, like, ten thousand hours.
And a good example of such a project would be a data platform migration where, for example, you have
(47:54):
millions of lines of code on legacy database.
You have to move them over
to a new modern data warehouse.
You have to refactor them, optimize them,
repackage them into a new kind of framework. Right? You may be moving from, like, stored procedures
on Oracle
to DBT plus
(48:14):
Databricks.
And
doing that requires
a certain number of hours for every object. And because you're dealing with a large database that at enterprise level sums up to enormous amount of work.
And, historically, these projects would last years and be done by, a lot of times, outsource talents from, you know, consultants or or a size.
(48:36):
And
for data engineer, that's, like, probably one of the most miserable projects to do. I've done I've led a a project at Lyft, and it's been an absolute grind where you you're not shipping new things. You're not shipping AI. You're not shipping even data pipelines. You're just, like, solving technical debt for years.
And what's interesting is that those types
of projects and workflows
(48:58):
are actually,
I would say, where
AI and OMS can make today
the most impact
because we can take a task.
We can reverse engineer it.
We know exactly what is the target of you know, you move the code, you do all of these things with the code, and, ultimately, the data has to be the same. Right? You're moving
(49:20):
you're going through multiple complex steps, but what's important for the business is once you move from, let's say, you know, Teradata to Snowflake,
your
output
is the same because, otherwise, business wouldn't accept it. And that allows us to, a, leverage LMS for a lot of the tasks that are historically manual,
but also have a really clear objective function for OMS,
(49:43):
like, dipping the output on a legacy system to a modern system and using it as a constraint.
And if you put those two things together, you have a very powerful system that is, a,
extremely flexible and scalable thanks to all ends,
but also
can be
constrained to a very objective definition of what's good.
(50:04):
You know, unlike a lot of this text to SQL generation that cannot be constrained to the definition of what's good. Because,
like, how do you know? By by the end of migration, you do know.
And
that allows
AI
to make tremendous impact on the productivity of a data team by essentially taking a project over the last four years,
(50:24):
cost millions of dollars, and go our budget
and constrain that into
weeks
and, you know, just a fraction of the price. I think that is where
we can
see real impact of AI that's, like, useful. It's working.
And we also see the parallels in software space as well. There are also a lot of the up like, really thoughtful enterprise applications of AI is actually taking these legacy code bases and, you know, helping teams maintain them and
(50:50):
or migrate them.
And I think that
there are more opportunities like that in a data engineering space where, we'll see AI make tremendous impacts.
And as you continue
to keep in touch with the evolution in the space, work with data teams,
evaluate
(51:11):
what are the cases where LLMs are beneficial versus you're better off going with good old human
ingenuity.
What are some of the things you're keeping a particularly close eye on or any projects or context you're excited to explore?
In terms of where you where where I think that LMS would really make a huge impact on the workflow?
(51:33):
Just LLMs in general, how to apply them to data engineering problems, how to incorporate them more closely and with less legwork into the actual problem solving apparatus of an organization.
Yeah.
So I think that
on multiple levels, there's a lot of exciting things. Like, for example,
(51:53):
being able to prompt an OLM
from SQL as a function call
that's available these days in modern data platforms
is incredibly
impactful. Right? Because instead of trying to in many instances, we're dealing with extremely massive data.
And instead of having to write, like, complex case one statements
(52:13):
and regexes and, like, UDFs
to be able
to clean the data, to classify things, and to just tangle the mess,
we can now apply LLMs from within SQL, from within the query to solve that problem.
And that is
incredibly impactful
for a whole variety of different applications. So I'm very excited about all these capabilities that are now, you know, brought by the major data platforms like, you know, Snowflake, Databricks,
(52:41):
BigQuery.
I think that we if we go into the workflow itself, like,
what does data engineer do and how to make that work better? I think there's a ton of opportunity to
further automate a lot of tasks. I think a big big one is data observability and monitoring.
I honestly think that data observability in its current state is a dead end in terms of, like, let's cover all data with alerts and monitors and then
(53:08):
be the first to know about any anomalies.
It's useful, but then it quickly leads to a lot of noise,
alert, fatigue, and ultimately
kind of could be even net negative on the workflow
of a data engineer.
I think that this is a type of workflow where
putting an AI to
investigate those alerts,
(53:30):
do the root cause analysis, and potentially
remediation
is where I see a lot of
opportunity for
saving a ton of time for data team while also improving the SLAs
and the overall
quality of the output of a data engineering team. And that's something that we are really excited about.
(53:51):
Something we're working on Dataflow, and we are excited about coming later this year.
Are there any other aspects of this overall space of using LLMs
to improve the lives of data engineers
and the work that data engineers can do to the effectiveness of those LLMs that we didn't discuss yet that you'd like to cover before we close out the show?
(54:12):
I think that, you know, we talked a lot about kind of
the the workflow improvements.
I think that, overall,
my recommendation to data engineers
today
would be to
learn how to ship Elven applications. It's not that hard.
Frameworks like LandChain
make it very easy to compose multiple blocks together and ship something that works whether or not you end up losing using
(54:37):
blockchain or other framework in production
and whether your, you know, team allows that. Doesn't really matter, but it's really, really, really useful to
try and build and learn all the components.
And
by it's just like software engineering. You know?
Learning how to code opens up so many opportunities for you to solve problems. Right? You see a problem and you're like, I can write a Python script for that. And I think that with LLMs,
(55:05):
it's almost like a new skill that both software engineers and data engineers need to learn where
you see a problem and you think that, okay. I actually think I can
scale the problem
into three tasks that I can give to an LLM. Like, one would be extraction web. It could be, like, reasoning and classification.
And now
it just solves the problem.
(55:26):
And so but but really learning how to build and trying helps you build that intuition. And so my recommendation will be for all data engineers while listening to this is
try to build your own application that solves either a business problem or helps you in your own workflow
because knowing how to build with OMS just gives you tremendous superpowers and will definitely be helpful in your career
(55:49):
in the coming years.
I definitely would like to reinforce that statement because
despite
the AI maximalist, the AI skeptics,
no matter what you think about it, LLMs aren't going anywhere. They're going to continue
to grow in their usage and their capabilities, so it's worth
understanding how to use them and investing in that skill because it is going to be one of those
(56:13):
core tools in your toolbox for many years to come. And so for anybody who wants
to get in touch with you and follow along with the work that you are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get what your current perspective is on the biggest gap in the tooling or technology for data management today.
(56:34):
I think that there's a lot of kind of skepticism
and some bitterness around
kind of modern data stack failed us in a sense that we were so excited that more data stack will make things so great five years ago,
and we're kind of disappointed.
And
I think that I'm an optimist here. I think that modern data stack in the sense of infrastructure
(56:56):
and getting a lot of the
fundamental challenges out of the way, like running
queries
and getting data in and out of different databases and visualizing the query outputs and having amazing
null books.
All of that
that we now take for granted is actually
so great relative to where we were, you know, five, seven, eight, ten years ago.
(57:20):
I don't think it's enough. So I think that,
I am with the data practitioners for, like, well,
it's 01/25.
We have all these amazing models.
Why is it still so hard to ship data?
Absolutely with you. And I think what I'm excited about is now that we have this really great foundation with
modern data stack in the sense of infrastructure,
(57:41):
I'm excited
about, one,
getting everyone on modern data stack to the point of migrations. Right? Let's get everyone on modern infrastructure so that they can ship faster.
Obviously, a problem that I'm really passionate about in solving and working.
Second, once you are on the modern data infrastructure,
how to keep modernizing
(58:03):
your team's workflows so that
data engineers are spending more and more time on solving hard problems and thinking and planning
on the valued activities that are really worth their time and less and less on operational toil that just
is burnout inducing and keeps everyone back. So
I'm excited about the modern data stack renaissance, thanks to the fundamental capabilities of large language models.
(58:29):
Absolutely. Well, thank you very much for taking the time today to join me and sharing your thoughts and experiences around building with LLMs to improve the capabilities of data engineers. It's definitely an area that we all need to be keeping track of and investing some time into. So I appreciate the insights that you've been able to share, and I hope you enjoy the rest of your day.
Thank you so much, Tobias.
(58:58):
Thank you for listening, and don't forget to check out our other shows. Podcast.net
covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com
(59:22):
with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and
coworkers.