Bridging the AI–Data Gap: Collect, Curate, Serve

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Tobias Macey (00:11):
Hello, and welcome to the Data Engineering podcast, the show about modern data management.
Data teams everywhere face the same problem. They're forcing ML models, streaming data, and real time processing through orchestration tools built for simple ETL.
The result, inflexible infrastructure that can't adapt to different workloads.
That's why Cash App and Cisco rely on Prefect.

(00:33):
Cash App's fraud detection team got what they needed, flexible compute options, isolated environments for custom packages, and seamless data exchange between workflows.
Each model runs on the right infrastructure,
whether that's high memory machines or distributed compute.
Orchestration is the foundation that determines whether your data team ships or struggles.
ETL, ML model training, AI engineering,

(00:56):
streaming, Prefect runs it all from ingestion to activation in one platform.
WHOOP and 1Password also trust Prefect for their data operations.
If these industry leaders use Prefect for critical workloads, see what it can do for you at dataengineeringpodcast.com/prefect.
Composable data infrastructure is great until you spend all of your time gluing it back together.

(01:19):
Bruin is an open source framework driven from the command line that makes integration a breeze.
Write Python and SQL to handle the business logic and let Bruin handle the heavy lifting of data movement, lineage tracking, data quality monitoring, and governance enforcement.
Bruin allows you to build end to end data workflows using AI, has connectors for hundreds of platforms, and helps data teams deliver faster.

(01:40):
Teams that use Bruin need less engineering effort to process data and benefit from a fully integrated data platform.
Go to dataengineeringpodcast.com/bruin
today to get started. And for DBT Cloud customers, they'll give you a thousand dollar credit to migrate to Bruin Cloud.
Your host is Tobias Macey, and today I'm interviewing Omri Lifshitz and Ito Bronstein about the challenges of keeping up with the demand for data when supporting AI systems.

(02:05):
So, Omri, can you start by introducing yourself?

Omri Lifshitz (02:08):
Hi. Glad to be here. I'm Omri, the cofounder and CTO of Upriver.

And, Ido, how about yourself?

Ido Bronstein (02:14):
Yeah. So hello.
I'm Ido, and I'm the CEO and cofounder of Upriver.

And, Omri, do you remember how you first got started working in data?

Yeah. So my journey started about
fifteen years back.
It started in the military. I was working in cybersecurity,
and I was fortunate enough to work across the entire value chain of cybersecurity operations
from doing really low level things, reverse engineering, knowing how we're able to get things running wherever we need them, all the way to building the data pipelines, collecting data from these cybersecurity operations.

(02:48):
And a big part of what we had to do is essentially
making sure that we're able to bring the right data at the right time, make this usable to all of these people on top of our data platform. So there, we had to deal daily with challenges of how do you maintain huge scale data pipelines and make this accessible to intelligence officers.

And, Ito, do you remember how you got started in data?

Yeah. So as a way,
I work in the intelligent unit on cybersecurity
operation
and then on the data infrastructure.
I was lucky to lead our internal data platform, the platform that collect all the different data sources that may gather and human image and, like, the variety of data that we have there
and being charged on all the layer of the stack from the infrastructure to the,

(03:34):
data pipeline management orchestration
and then to the application.
And our main goal is it was to bring data in high reliable,
high paced to the intelligence officers in a way that you can use.
And
all those experiences is what led Hongguindai
to, start up with.

And so in terms of
the overall space of data and the growing demands of AI systems,
obviously, there is a lot of additional complexity that's getting layered on top of the inherent complexity of dealing with data systems that we've been struggling with for several decades now. But as we

(04:15):
add AI
to the set of consumers for these various
data platforms and data streams,
what are some of the ways that you're seeing that introduce a gap either in terms of capabilities
or,
structure or just some of the points of friction that we're dealing with as we try to feed this

(04:36):
new set of requirements
into these new consuming systems that are now increasingly
dealing with a broader range of consumers.

Amazing. In a high level perspective,
a hype,
a, really excels the demand for data and organization.
And you can see it from both sides of the pipelines.
First of all, AI enable us to extract data from
any digital asset, images, PDFs, like conversation.

(05:07):
And today,
businesses really understand that they can actually use data in every piece of their business.
So you have enormous amount of data the organization wants to use. And it also help the organization
to understand that now it is time to use this data because, otherwise, it will fall behind. And in the other side, it help us to make this data

(05:30):
much more accessible
because now any person in the organization can take a a CSV
and put it in chat GPT,
ask question, and get, like,
amazing analysis. I don't know. I I'm sure that you have the chance to, like, trying to do some finance or marketing analysis using Chargepty.
It's answer

(05:51):
very sophisticated,
very help you to understand how to use this data. But the middle layer that helps you to take this
vast amount of data as that you have in the sources
that AI lets you collect. And to the point that this data is really usable for the AI, this is still a bottleneck
as that I see today in the market.

(06:13):
And when we are talking with CIOs, CDOs,
this is something that is not solved yet by AI.

So as you pointed out, the introduction of AI
adds new capabilities
to the processing and production of data because of the fact that we can bring in more of these unstructured assets that have typically been very difficult to
operationalize.
And so that's one side of the equation. The other side of the equation being things like rack pipelines or the whole chat with your documents

(06:45):
capability. But as we move more into agentic systems where we need to be able to do things like
manage memory state,
provide
up to date information
to those agents to make sure that the decisions that they're making or the information that they're providing is actually accurate given the context of the problem that they're trying to solve

(07:07):
for. That's two different sides of the coin where AI can help us accelerate our production of data assets, but it also means that we have a higher demand for those data assets. And so if you're not using AI, in that production pipeline,
there's a good chance that you're just going to be drowning in requests or
that you're going to be building agents that are constantly failing to provide useful capabilities because they don't have the context that they need. And I'm just wondering how you're seeing teams

(07:35):
deal with that challenge. And in particular, when you're talking about using AI in the production
phase, how do you prevent that from just causing costs to run rampant?

Yeah. So I think that you on point on how you, like, interpret that what I said.
And the way that I see that that this, like, two parts of pipeline
is disconnected, and this is not a new problem.
Like, the ability to make sense in, like even in a structured data,
it still requires, like, manually pure at the data,

(08:10):
put the right context,
join between different sources, understand what is true, what is not correct data.
And only after you
did this process,
cure the data, someone can really use it. And it is this time for agents or RAG system
or, like, very complex LLM. Someone need

(08:30):
to successfully
connect between the, like, hundreds of data assets that we collect to the semantic of the business,
to the right standard for the business. So when you connect this to agent, they will work with the right context and in a smart and correct way. And this is the exactly the the work of, like, managing the data. This is what we've done in, like, the last twenty years, just focusing on the structural data. Of course, you have

(09:00):
the equivalent from actual data or any other,
use of data that you want to do. And today, like, this is the point that
the agent breaks, said they caused them
not to move from the POC state to the production state because the data in production is a mess in in in its raw phase.

(09:22):
And without
working on ordering this mess in the production, the agent won't work. And every now everyone knows the phrase of garbage in, garbage out. And the production in its raw form, and I think this is correct for most of the regression, is garbage. Someone need to cure the data and make sense out of that.

And the other problem with
using
AI in that production phase is that it can be very straightforward in some cases
to build a proof of concept and say, hey. I threw AI at my data pipeline, and now I've got this text document giving me structured output. But as with any AI driven system, there is a lot of potential for things to go wrong as you try to scale that and operationalize it and actually start to depend on it

(10:11):
for feeding downstream systems. Because
as you introduce
error rates and you feed them together, those error rates compound. So where maybe at the introduction of AI in doing document processing, you're okay with a 2% error rate of maybe it misinterpreted
some of the phrasing of a document that you're processing. But then as you then do more analysis, particularly in an agentic context

(10:37):
of that data. Now that 2% compounds to 5% or 10%, and then all of a sudden, you're playing the worst game of telephone ever, and you're getting complete garbage output to your end users who are then starting to lose faith in your capability to give them the data that they need. And I'm wondering how you're seeing some of the
skills gap manifest in terms of people who are very adept at building production

(11:01):
data pipelines,
but they don't necessarily understand
the
requirements of building production
AI systems where maybe they need to manage things like evaluation or model selection,
or maybe they need to do fine tuning to be able to increase the efficiency of the model that they're using and drive down cost

(11:21):
for something that requires high uptime usage. And I'm just curious how that's manifesting in teams who are just being thrown saying, hey. Pierre, I need to be able to process this data now because we have AI, and it can do it. But there are a lot more
supporting systems that need to be put in place as well.

Yeah. So I think you you nailed it, like, with the difference between doing a POC and saying, okay. I can now process tons of PDFs than to actually making this production ready. And, like, that's a big thing that pea teams need to overcome. And one of the things we need we've needed to deal with when we've built our system that is based on AI as well. Like, how do you actually make sure that you know how to do these evaluations, how to make sure you're building out the right process? And I think one critical thing that has stayed the same is that good engineering is still good engineering. Like, how do you build out a system that you knows to deal with faults? How do you build out the right process to make sure these things are done correctly?

(12:17):
And I think there is still a gap, and even the market is still trying to understand exactly what is the right way of doing this. Doing these events, doing using agents and elements to check agents. There are a lot of things coming in to the system that are now we see, like, in the industry. But I think in general, a lot of teams are moving to this phase where they're starting to understand it requires a different mindset. You need to understand how you productionize these things and do not just have POCs on top of them.

And so
beyond the challenges of using AI and productionizing
that to manage your data feeds, there's also the bigger question and something that I think is broader beyond just being able to process unstructured data assets
is how do you determine what is the data that is actually going to be useful for an AI agent to perform a given task? And I'm wondering how teams are dealing with that side of the equation as well of identifying

(13:13):
what are the data assets,
one, that they have if they don't already have a a decent catalog of it, but, also, how do they ensure that they're wrapping it in the appropriate semantics for the agent to be able to understand
where, when, and how to actually apply it to the problems that they're given?

So that that's a great question, and I think this is exactly like one of the premises that we had when starting Upriver.
The ability to serve if you want to get these models working correctly, you need to give them the right context and the right data. And, oftentimes, organizations,
they might have a data catalog. Usually, that's not updated correctly, I think. Like, a lot of companies have this catalog

(13:53):
phobia or fatigue by now. They don't know exactly which assets they have. They don't know what the semantics of the data exactly is, when they're trying to use this, especially now pushing this to an LLM to actually take on these tasks. So I think one of the key things that we've done in Upriver, and this now goes to how we've also built our system to help do things, is you need to do three different things in order to actually be able to use models effectively and agents correctly. One is collecting the right data. The second and that's a critical piece is curating this into something that actually

(14:26):
encapsulates and captures the ontology and semantics of what you actually have in your system. And the third is how you serve these two models. Correct? So that way you can actually make this usable. And I think if you skip any of these steps in the way, you're just going to get something that doesn't
meet your expectations,
and then you're probably gonna be disappointed from the results you're getting. Because just writing an SQL query is quite easy. If you know exactly how to structure it, what you wanna get from the data, that's quite easy to do. Being able to do these fully agentic flows, we're saying, I wanna build a pipeline that now allows me to check so and so, requires you to understand exactly what you have, what this means, and how this relates between the different entities in your system, and then how do you serve this to model in the right way. And these are the three components that are critical for actually being able to use AI here and making data available to AI.

The other interesting piece is that, generally speaking, as you provide more semantics and more context to the assets that you're producing, it's also beneficial to humans.
The main difference
being
that AIs can process much faster and at a broader scale than an individual human in terms of doing the discovery and doing the interpretation of which data assets to apply for certain use cases? And as we do open up the doors for these AI to do a broader analysis and broader consumption of those data assets,

(15:48):
how does that shift maybe the
visibility or highlight the gaps in the quality or reliability of data assets that have maybe been not necessarily neglected, but at least not used as actively by humans who have a much narrower focus on, oh, I'm going to check this dashboard, or I use this data feed for performing this particular task. And we say, now we can actually take advantage of the broader set of data assets that we've been collecting and maybe not paying as close attention to.

So I think that,
you
phrase it,
exactly as I see. Like, agents and AI
needs the same things that humans
needs when they're accessing the data. That they need that the data will be with no mistakes. They need to have the right semantic with connection to the business context.

(16:39):
And without this, agents, AI, and humans cannot really use the data,
correctly. The things that change is a with AI
is, first of all, as you said, he can process much, much more data. And the second thing, he doesn't have, like, external context. He have only the context that he sees when he access data. He doesn't know, like, what

(17:02):
some other person told him in the kitchen on a coffee when he come to to do the job. Just use the context that someone,
tap into the data.
So
in a sense, the problem
of how we manage the data, how we create the right contact rate, cataloging, semantic,
create, like, a high quality datasets

(17:24):
does not really change, but they need but they need to do it in scale and
fast.
The importance of doing so just accelerate.
Now you need to have proper semantic and proper standards to all your data, not only to the tables that right now the analyst analyst use that. And I think in that sense,

(17:47):
the importance of
managing the data and being able to structure it correctly
is just more is really more important than ever before.

Another interesting aspect of bringing AI to bear is that particularly when we're talking about a data warehouse context, there are established patterns for the structural semantics of the data, whether that's a star schema or a data vault or whichever
tribe you are a member of. And I'm curious how

(18:20):
using AI as the access layer versus
human analyst who is handcrafting these SQL queries or a BI tool that is using these dimensional structures to do a visual navigation,
are those still beneficial? Are those still the best way for us to be thinking about structuring the overall data assets, or do

(18:42):
the semantics and capabilities
of these AI systems and AI agents change the ways that we need to be thinking about the foundational structure and the foundational
semantics of the data assets that we're producing?

Great question. So first of all, I think that nobody,
still figure it out. What is the right architecture
for doing effective data warehouse for AI? I'm sure that there are things that will be preserved. For example, you need to create some kind of
intermediate stables in order
to be able to create this data efficiently.

(19:19):
And I'm sure that the ability to curate the data like bronze, silver, gold, in a sense, will stay because we want to clean the data and then put the right semantic.
But
exactly if star schema,
is the correct
architecture for AI,
I'm not sure.
I think that

(19:39):
it's things that we see every organization
solve differently today, especially try to stitch AI on its current architecture,
and it's work. Like, you don't need to change all your, architecture in order that AI will be able to process data. You just need to
make the data reliable,
high quality with the right

(20:01):
semantic, and we will see if well it will evolve.

Yeah. And I will add to that that one of the things that you did talk about, like, the tribes you have, h one, believing they have the right
the right schema and the right way to structure it, Snowflake, star schema, and so on. I think one of the interesting things we saw while working with companies
is that the models are very good at capturing the way in which you've structured your your data and work on top of that once you've put a good data model in place. So the need to do it and to define what your data model is and how you want it to look like, that's still something we'll need to do. I think the difference between whether we're doing this way or this way, the discrepancy will probably become,

(20:44):
much smaller over time because the models are able to understand it. So there it's kind of a preference. And as you just said, we're still not sure what the best schema will be and what the best data model will be going forward. But we do see that even when you're using different things, the LLMs are able to capture the essence of the data from it.

Another interesting aspect of the fact that we do have these AI systems
consuming the data and we need more data to be able to make them as useful as possible is the challenges of also not wanting to flood their context window. Because if you give it too much data, then it goes from being your smart sidekick to your dumb sidekick.

(21:25):
And the contrary is also true where it could be your dumb sidekick if you don't give it enough data, and I'm wondering how you're seeing teams think about that balance as well of either not feeding it too much data because it's not able to differentiate the useful stuff from the not useful stuff or making sure that you're
not starving out of data and just figuring out what that balance is. And, obviously, I'm sure the answer is it depends, but I'm wondering how you're seeing teams go through that discovery process and then maintain that balance as their systems evolve and as their data feeds evolve.

Yeah. So I think one of the critical things here is, and I talked a bit about this earlier, is the fact that you have to both curate the context for the kind of tasks you wanna do. And then there is a critical aspect of how do you serve this context out. So how do you make sure that you're delivering the right context at the right time? And maybe sometime that means you wanna,
minimize your context, like, the context you already have, summarize that, and then move it to a kind of sub agent architecture,

(22:24):
which focuses on something specific. So there are a lot of things in which you need to deal with that in order to actually solve the issue. And that means, how do I map the context that I wanna bring together for a certain task? Once I have that, how do I wanna continue maintaining this over a conversation and a task for an agent? There are a lot of different nuances there. And, again, I didn't wanna give the answer you said, but it depends exactly what you wanna do. But the curation and the serving aspect of this context, I think, are two of the most critical things we're seeing today with how people are engaging with agents and LMs, and this will be something going forward as well.

Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy?
DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems.
Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed price.

(23:32):
Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold
to book a demo and see how they turn months long migration nightmares into week long success stories.
And the other piece of it too is that because we have access to AI and we can start pulling in these unstructured data assets,

(23:53):
it also
necessitates that we process those so that we can have as much data as we can to feed into these systems.
And then, also,
it opens up the possibility because of the potential for increased developer productivity to start ingesting external and third party data sources and not only relying on

(24:13):
organizational internal data. And I'm curious how you're seeing teams tackle that as well of even seeing what data is available outside and then evaluating and identifying which assets are going to be useful for their particular problem space.

Yeah. So
I think that's, like, is
a two sided source.
With one head, teams have, like, access to much more data than they have before. Like, Like, scraping the web is easier,
and connecting to a new
system and creating the connector is much easier.

(24:48):
So
they have much more data.
And
then
they need
to understand how to make this like,
use this data. And this have two main problems. First of all, they need to understand,
how to connect the data that they gather to their already enterprise data, first party data. This is a problem that takes the data

(25:12):
engineering team a lot of time. And
secondly,
as they need to understand
how to make all of this flow all the time in a very reliable way. And this is also hard because as you know, the output of the LM is all not all always deterministic. And if you're scaling the web, the website can change and it can affect the data.
So after you already succeed to understand how to use this data, making this reliable is the second challenge. And but the amazing thing is that it's open

(25:42):
the
amount of use cases to much more. Right now, you can do fraud detection with external data and not just based on the activity
of the payments that you see in your system. You can do target marketing
for using, like, reviews in Facebook or other social network, and we see all this use case. So the possibilities

(26:06):
to do things with data just increase significantly.
It's good. And other challenges of the data engineering team because the bottleneck is not on
the collection of the data.
It's on the ability to process and really use that.

And beyond just the data and circling back around to the skills gap for teams who are responsible for providing that data,
what are some of the ways that they should be thinking about either bringing on additional talent or working more closely with other teams in the organization
or investing

(26:42):
in
new skills development for helping them figure out how best to actually
work with or understand the problem space so that they're not just left flailing or,
you know, trying to bail water out of a sinking ship.

Yeah. And I think that what you described now is the feeling of a lot of data engineer teams that we are talking with
them
in the routine. I think the most important thing is to understand
how they can use AI
in order to excel their work because
AI open for the organization,
also for communities

(27:17):
to change and to innovate, but also open
the the same opportunity to the data engineering team themselves. And
the nice thing that AI agents allow us is to stitch in a smart way a lot of different pieces.
And as you know, like, a lot of the technical difficulties
of managing data pipeline is the fact that you have data spread a lot of across different tools, across different stages,

(27:44):
and
I can really help you with that. I can really understand,
which tool you are using,
understand the data, connect between the pieces, and in a sense,
make all the tedious work of data engineers
much smaller. So what they see in advance
data engineering team, for example, a team in

(28:05):
Netflix, in Wix,
they really leverage AI in a smart way in order
to obstruct
a lot of the technical
details in building their pipelines
so data engineers can focus
on
understanding the business, understand what is the next business use case that we can use, which data we need to bring in in order to improve the fraud detection

(28:27):
or the conversion of our product. So in the all the
technical
work that data engineers doing today is done on this team with AI.

I think another interesting
aspect of the era that we're in right now and some things that I've seen in my own work is that the introduction of AI
as an end user facing capability in particular
is forcing
more of this combining
of different teams that maybe have worked in their own particular areas and re forcing more of that cross team collaboration

(29:02):
because there
is no longer
as clean of a handoff and those lines are blurring. And so
you have to move from application development
to data management because you need the data in the application to power the the AI, and you also need to bring in the machine learning or data science teams because they're the ones who understand how to deal with the experimentation

(29:24):
and
deal in that more probabilistic space.
And how are you seeing that shift maybe the organizational structures as well where maybe we've gone from having these dedicated teams to everybody is one team, or maybe you're doing more of the,
embedding strategy where you maybe group clusters of people who are in separate teams into their own little operational units to be able to have more of that end to end context and capability

(29:51):
without having to have as much of a hard handoff between those stages?

Yeah. So the first impact that they see in organization,
and it's really, like, you can see it actually in every big organization.
The amount of data people that's, like, trying to bring in and the people that really understand data is much larger. So you see the organization trying to bring the best data scientists, best data engineers, best data analytics
so they can really use their data. This is the first thing. The second thing is that all the organization

(30:21):
start to talk in the data language.
Suddenly,
the software engineers start to understand
that data that they produce really bring value to organization.
So you see a lot of organization,
starting to implement data contract and ability to manage the data across different
part of the,
application.

(30:41):
And
the last thing that we see that we are seeing squads
of, like, application engineer, data engineer, data scientist
working together on the same task. When in the past, it was much separate. It was like the application
engineer work. He threw data
somewhere, then the data engineer ordered that, and then the data scientist do something with that. Now we see much more squad teams that taking a specific task and work it together in order to solve it. And another interesting

approach that I can see being feasible
is maybe rather than restructuring all of those teams, taking some of the top performers from each of them and having them be an enablement squad and maybe defining some useful templates or
context
structures for being able to feed into agents that those teams can actually use

(31:36):
to do their own work or act as an automated reviewer of the work that they're doing or for that team to do more of a,
architectural consulting approach of
evaluating what are the current systems and helping to identify
some of the new capabilities or new workflows that need to be developed to make sure that everybody can operate at a higher level.

Yeah. So I think we're in the being beginning of the journey. Right? So in in the beginning, you are putting the best and the most talented people
on the innovation task. And I think this is what you tried to say, like, taking the best group of people that you have and tries and say, we'll, like, understand
how to take those POC into production first so all the organization can follow be follow follow after.

And in your own work at Uprover,
where you are trying to be more of that enablement layer and help teams automate some of the data work that needs to be in place to actually
enable them to explore this new
AI era. What are some of the core engineering challenges that you faced in terms of being able to actually build a system that moves more of these data workflows into an autonomous layer or works with those teams to be able to reduce some of the drudgery or automate discovery or,

(32:56):
those those various capabilities?

Yes. So I think one of the most interesting things we saw is that we kind of when we started out, we expected the models to be kind of like a magic black box that you can just throw whatever you want at them. And I think a lot of people had that illusion at the beginning where you can say, okay. I'll throw everything into the new GPT model, and I'll be able to get things working. And that obviously is not the way to go about it. And we've had to solve a lot of very big engineering tasks in order to make our our system actually usable and build, like, the right teams around it regarding how do we understand which context we need to collect from the environment we're connecting to. So this, it means also collecting

(33:36):
code information. How do we profile data and bring that context in to make sure that what the system is doing is reliable? How do we put in place the right validations in the system? And when do we need to go and then get the human in the loop to get all of these things working correctly? I think being able to do all of these things and understanding how to put them in place was one of the biggest challenges we had to face here. And when we look, for example,

(33:58):
one of our I think the most interesting things we did was we kind of reverse engineer how cloud code was working and client and all of those coding agents are working. And what you see is that essentially not only are they very good with the models, they know how to make the model and the full system work like a software engineer does. And you need to have that kind of mentality. You need to understand how somebody doing this task would work and then know how to tie in all of the relevant things from there. And again, as I said earlier, like, the idea of how you build and engineer things and the system architecture you have to do it in, I think that still is one of the most critical things that we have today. And that's where, like, you see the really good software and data engineers being able to elevate how things are being done. I think that's critical piece that that is still going to go forward with us as we automate more things and as we bring in new tools that can take a lot of their busy work away from data and software engineering team.

Another aspect of this overall space is
as is the case with software engineering, but I think more critically in, like, a data engineering context,
the fact that every team has their own opinions of what tooling they want to use, what platforms they're building on top of. Obviously, there are broad level abstractions that we can use as commonalities. But when you get into the implementation specifics, there are wide variances in terms of how those technologies operate, especially when you move between batch and streaming.

(35:18):
And I'm curious how you're seeing
that play out as far as what are maybe some of the core primitives that are most essential to be able to actually feed into those AI systems where maybe batch has been working well enough for human driven analytics, but maybe AI is a forcing function for a broader migration to streaming or maybe vice versa. I'm just curious what you're seeing there.

Yeah. So I think one of the, again, it goes how to how do we curate this context? One of the key things that's relevant is what abstractions do you need when you come to approach data engineering tasks. So a batch pipeline is a batch pipeline no matter what what underlying technology somebody's using for it. And the data model, you have all of the semantics within it, but still, you need to understand how things rely together and how the dependencies between them. And I think that's one of the critical aspects. Understanding the ontology of the platform and what abstractions you wanna put in place for it one of the critical things you need to do in order to actually make this usable across different data teams. And I think that's also one of the things you see in software. Like, you might have people writing code in Python, other people writing code in Go. Still being able to understand how you need to look at this modularity

(36:26):
is one of the critical things that you need to do in order to get these systems working. And, definitely, I think we will see more use cases for streaming if people are pushing towards more real time things and on the go analytics that we're seeing now. But the idea of how you abstract away these different things so you can make the model work correctly, I think that's a core that's going to stay across nearly
every aspect of agents working.

And
what are some of the key struggles that you're seeing teams deal with as they are going through all of this effort of building these AI powered systems, getting their data up to par, figuring out what data assets they have and which ones are actually useful?

I think one of the biggest challenges, and I think this goes to, like, data maturity that companies have, a lot of companies are now trying to say, I wanna use AI to enable my data teams, but they're still not clear on what they wanna do with the data. And I think that's, like, the first principles and fundamentals of what you want to do, and then you are able to build things on top of that to actually make this usable. So no matter how much AI and tools you put in place, if you don't know what you want to achieve, I think that's not going to work. Once companies do know how to do that, we see them trying to use the existing software engineering tools, which are amazing. We're also using them, but they're just not the right tools for data engineering tasks because they lack this context. So we see teams trying to manually

(37:47):
prompt all of the context that's relevant for data into cursor before going on each task they need to do. But we've seen teams taking lineage screenshots from other systems and putting that as prom as part of the promp into CheckatGPT
to then get it to write the right SQL. So there are a lot of issues there where you still need to understand how you bring in the knowledge that the data engineer has into the system to actually make them do the right task that you want them to.

And as you have been working with some of these teams and onboarding them onto Upriver and helping them understand the overall problem space, what are some of the most interesting
or innovative or unexpected ways that you've seen companies trying to make their data AI ready?

So I think that we see companies trying to
tackle this
across all the maturity,
levels from companies that hiring more data quality analysts that now needs to check all the data and all the records and create all the semantic layer, like, manually,
just from, like, sticking with all the business and trying to dynamically update that. We see companies

(38:53):
that try to use as a reset, like, automation
that software engineer uses,
especially like Cloud Code and Karsall.
But then they are trying to understand how to gather context
about the data using, like, MCPs
and
screenshots and some resets for, like, data systems. And we see very mature companies, for example, Netflix

(39:16):
and, like, Wix. They built, like, their own automation tools in order to understand what is this context. For example,
we heard we heard how they're, like, connecting to Slack and Notion
and creating all the data documentation
automatically
based on the
unstructured

(39:36):
organizational
and knowledge that they have in other system. This is, like, I think, very cool example that we have.

And in your own work of building the system
to be more of that autonomous
data engineer
and
exploring this overall space of how do we actually use AI to build these systems, what are the AI systems that are consuming from this actually need, and how do we make sure that they're speaking the same semantics? What are some of the most challenging or unexpected lessons that you've learned in that process?

So as we said, that there is no magic. Like, AI is not magic. You still need
to
properly build the context of the knowledge that you want that AI will use. So
for us, the biggest challenge is to understand how to create route ontologies
from the context and understand

(40:30):
how to create the right connection between different
entities. For example, what is a pipeline? What is a table?
What what is an entity inside the tables?
And then
to help the AI understand those technologies
and use
the specific context that we gather from our customer environment in order to enable

(40:51):
the AI to his use case, to his current specific task that he's right now asking for. And I think this is the
the challenges that we solve and craft. Building the right context is
the
things that makes our product really work for data engineer.

Yeah. And I will add to that one thing just to
because I think it's it's interesting in the way, like, you build the agentic systems. Not everything has to be directly done with the LLM. Right? There are things that you do know that are not that are going to be deterministic in your system, and you want them to be deterministic. And those are things that just regular software engineer can still do. And being able to play on the final of what is deterministic

(41:36):
and where can I use the LLM to enrich my capabilities
by miles? I mean, like, that gives you the ability to look at code and understand it. But then maybe I need to go back to a validation module inside to make sure the pipeline was done correctly. Being able to ping pong between these two modes of working, I think, is what gives the edge to really production ready AI system rather than just somebody

(41:57):
throwing things into an MLM model and expecting it to work correct.

And for teams who are
excited about the potential for AI to automate their data engineering work and let them focus on more of the business oriented aspects, what are the cases where an agent based or autonomous approach is actually
not the right way for these teams to be dealing with their data pipelines and data issues?

So it's depending on the maturity
of the team. In order that the team will need AI in order to manage his data is when he knows what is the exact
business problem that he want to solve,
and he wants now to deliver that. Teams that just right now understanding
what the business need and trying to map their data and understand which data they have in the organization.

(42:50):
Their work is more focused on the business
and the ability to
understand,
how they can enable the business. But once the focus move to the technical part of building its data, structuring it, and managing that, this is where AI can really hit in. Because the build standard statement
will always or I think will stay with the human for a lot of time. But technical

(43:14):
will be abstract over time, and this is where AI can really enable data engineering team.

And as you continue to invest in this problem area and as the overall industry
progresses further into our understanding
of how and when to use AI and what the requirements are of the models as they continue to evolve in terms of their capabilities,
what are some of the trends that you're paying particularly close attention to or any problem areas or

(43:45):
new engineering patterns that you are investing in or doing exploration of?

Yes. So I think, one of the major thing that came up, I think, like, in the last few months that we're really focusing on is sub agents and the ability to kind of send, okay. What context do I need to derive specific tasks within
my overall Altium task? I think that we've started using this. This has given a lot of value right now, and we see this going forward. And how do you manage these things?

(44:12):
I think that's one thing that will change. The second thing is how do we manage context as models are able to gain bigger context windows, and they're starting to gain memory with as part of the model as well. That will also kind of change the way in which we need to manage the context we're giving it. So that is a big thing we're paying attention to and seeing how this will affect the entire industry. And, again, I think, like, in essence,

(44:34):
the idea of how we store the context that will that will probably change in the next year or so. We see things moving so fast. But the idea of how we curate it and understand what context we need and what it means to actually know your data state, And that is essentially what a data engineer does best. Right? Knows what is actually happening in their data environment, what semantics and metrics. And so when they're trying to do and what they're trying to calculate, that will stay. And I think we're trying to understand what is the best way as the models and the paradigms change to be able to deliver this value.

Are there any other aspects of this overall
situation that we're in of the gap between
the needs of these AI systems and the
ability for data teams to be able to fulfill those requirements that we didn't discuss yet that you'd like to cover before we close out the show?

I think yes. And I would love to, like, picture
picture the image of how I see this space in couple of years for now. So if I'm looking on this space today, today,
most of the data management
work is done by data engineers. And you have, like, tools that helps you from data quality, data cutout, to semantic layer. In four years for now, I see it completely changed. I think that you will have an AI based platform that helps you to really

(45:54):
automate how data manage
from end to end, from the minutes that data land in your platform, to the quality, to the catalog, to semantic, to the building the pipelines themselves. And data engineers will oversee
those AI engines and will define what the business need. They will define which data they want to enter the platform. And they, of course, control it and, like, create the right

(46:17):
ontologies and the right metrics
for the business.
And this layer will be much more focused on what the business need and must latch focus on how I
tactically
create this pipeline and manage that. And we see the first step for that right now in the industry, especially in the very advanced companies, that

(46:38):
in four years, this will change how data management,

is done. Yeah. I definitely agree with that, and I'm seeing that in my own work where
I am very
knowledgeable about how these systems play together, but I don't want to get bogged down in the details of figuring out, oh, well, I missed this character in this batch call, or, oh, I need to make sure that I add this particular
function call or this particular annotation

(47:04):
to this model to make sure that it gets processed downstream. I just wanna solve the overarching problem and let the AI deal with those granular details. But I am there as the supervisor to make sure that it's not making foolish mistakes or going down the wrong rabbit hole, which is, I think, a a skill that people who have been very focused on the individual contributor track are going to need to really level up into because they haven't gotten used to that aspect of being more of that management role and overseeing other people's work where, in this case, the the, quote, unquote, people are the AI and just moving

(47:41):
beyond maybe whatever ego they may have attached to the craft of their specific
software practices or architectural patterns
and focusing more on those
broader
objectives
that they maybe don't have the time
to actually think a lot about because they're so busy with those granular details.

(48:01):
Exactly. And I think that the industry will more grow more and more for that area. Alright. Well, for anybody who wants to get in touch with you both, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.

So, again, the biggest gap today is the ability
to stitch the pieces Still today,
ability to,
with one hand, understand the data and understand what it mean. And in the other hand,
understand
orchestration,
like, DBT transformation,
partitioning
in storage,

(48:40):
and cataloging the data and data quality. Seeking all of that is a hard problem. Starting to hold all these, like, tools and orchestrate everything in the correct way, it's just too complicated today.

Yeah. Absolutely.
We are continuing to add more layers of complexity, and we're componentizing so you don't have to necessarily have the entire vertical complexity of some of the tools that we've dealt with over the time. But it just means that that knowledge has to be more diffuse.

Yeah. Yes. And I from what we see, AI helps you to really solve that, solve connected with between part and AI can use this tool and learn one time and do it, like, in every in every company.

Alright. Well, thank you both very much for taking the time today to join me and share your experiences
of working in this space and your thoughts on the demands and shortcomings
of the data systems that we have as we start to bridge into these more AI driven capabilities. It's definitely a very necessary field to be studying and focusing on. So I appreciate all of the work that you're both doing to help make that a more tractable problem for a broader set of engineers.

Thank you very much. We really enjoy it. Yeah. Yeah. Thank you for having us. Thank you.

Thank you for listening, and don't forget to check out our other shows.
Podcast.net
covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. Notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com

(50:25):
with your story. Just to help other people find the show, please leave a review on Apple Podcasts and to tell your friends and colleagues.

All Episodes

Episode Transcript

Popular Podcasts

On Purpose with Jay Shetty

Ruthie's Table 4

The Joe Rogan Experience

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Bridging the AI–Data Gap: Collect, Curate, Serve

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}On Purpose with Jay Shetty

Ruthie's Table 4

The Joe Rogan Experience

All Episodes

Bridging the AI–Data Gap: Collect, Curate, Serve

On Purpose with Jay Shetty