Summary
In this episode of the Data Engineering Podcast Alex Albu, tech lead for AI initiatives at Starburst, talks about integrating AI workloads with the lakehouse architecture. From his software engineering roots to leading data engineering efforts, Alex shares insights on enhancing Starburst's platform to support AI applications, including an AI agent for data exploration and using AI for metadata enrichment and workload optimization. He discusses the challenges of integrating AI with data systems, innovations like SQL functions for AI tasks and vector databases, and the limitations of traditional architectures in handling AI workloads. Alex also shares his vision for the future of Starburst, including support for new data formats and AI-driven data exploration tools.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
  • This is a pharmaceutical Ad for Soda Data Quality. Do you suffer from chronic dashboard distrust? Are broken pipelines and silent schema changes wreaking havoc on your analytics? You may be experiencing symptoms of Undiagnosed Data Quality Syndrome — also known as UDQS. Ask your data team about Soda. With Soda Metrics Observability, you can track the health of your KPIs and metrics across the business — automatically detecting anomalies before your CEO does. It’s 70% more accurate than industry benchmarks, and the fastest in the category, analyzing 1.1 billion rows in just 64 seconds. And with Collaborative Data Contracts, engineers and business can finally agree on what “done” looks like — so you can stop fighting over column names, and start trusting your data again.Whether you’re a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you. Side effects of implementing Soda may include: Increased trust in your metrics, reduced late-night Slack emergencies, spontaneous high-fives across departments, fewer meetings and less back-and-forth with business stakeholders, and in rare cases, a newfound love of data. Sign up today to get a chance to win a $1000+ custom mechanical keyboard. Visit dataengineeringpodcast.com/soda to sign up and follow Soda’s launch week. It starts June 9th. This episode is brought to you by Coresignal, your go-to source for high-quality public web data to power best-in-class AI products. Instead of spending time collecting, cleaning, and enriching data in-house, use ready-made multi-source B2B data that can be smoothly integrated into your systems via APIs or as datasets. With over 3 billion data records from 15+ online sources, Coresignal delivers high-quality data on companies, employees, and jobs. It is powering decision-making for more than 700 companies across AI, investment, HR tech, sales tech, and market intelligence industries. A founding member of the Ethical Web Data Collection Initiative, Coresignal stands out not only for its data quality but also for its commitment to responsible data collection practices. Recognized as the top data provider by Datarade for two consecutive years, Coresignal is the go-to partner for those who need fresh, accurate, and ethically sourced B2B data at scale. Discover how Coresignal's data can enhance your
Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:11):
Hello, and welcome to the Data Engineering podcast, the show about modern data management.
Poor quality data keeps you from building best in class AI solutions. It costs you money and wastes precious engineering hours. There is a better way.
Core signal's multi source enriched cleaned data will save you time and money. It covers millions of companies, employees, and job postings and can be accessed via API or as flat files.

(00:35):
Over 700 companies work with Core Signal to develop AI solutions in investment, sales, recruitment, and other industries.
Go to dataengineeringpodcast.com/coresignal
and try Core Signal's self-service platform for free today.
This is a pharmaceutical ad for SOTA data quality. Do you suffer from chronic dashboard distrust?
Are broken pipelines and silent schema changes wreaking havoc on your analytics?

(00:59):
You may be experiencing symptoms of undiagnosed
data quality syndrome, also known as UDQS.
Ask your data team about SOTA.
With SOTA metrics observability, you can track the health of your KPIs and metrics across the business, automatically detecting anomalies before your CEO does. It's 70% more accurate than industry benchmarks and the fastest in the category, analyzing 1,100,000,000

(01:21):
rows in just sixty four
seconds. And with collaborative data contracts, engineers and business can finally agree on what done looks like, so you can stop fighting over column names and start trusting your data again. Whether you're a data engineer, analytics lead, or just someone who cries when a dashboard flatlines, Soda may be right for you.
Side effects of implementing soda may include increased trust in your metrics, reduced late night Slack emergencies, spontaneous high fives across departments, fewer meetings, and less back and with business stakeholders, and in rare cases, a newfound love of data.

(01:50):
Sign up today to get a chance to win a 1,000 plus dollar custom mechanical keyboard.
Visit dataengineeringpodcast.com/soda
to sign up and follow SOTA's launch week, which starts on June 9. Your host is Tobias Macy, and today I'm interviewing Alex Albu about how Starburst is extending the lakehouse to support AI workloads. So, So, Alex, can you start by introducing yourself?

(02:11):
My name is Alex Albu. I I've been, with Starburst,
for about, six years now,
and I currently
am, the tech lead for our,
AI initiative.
And do you remember how you got started working in data?
Yeah. I I come from a software engineering background. But,
a few jobs ago, I was working for, IDEX,

(02:34):
working on
their veterinary,
practice,
software. And,
we
we had to build a few ETL pipelines,
pulling in data from various practices
into
into IDEX.
And
I I think maybe the the
point where I really got into data engineering was, when I,

(02:54):
was working on,
I on rebuilding an ETL system based on Hadoop.
And I I I replaced that with a spark based system.
And the the results were actually pretty spectacular.
The I think,
performance
gains were
let us go from,

(03:15):
running
a five node cluster,
like, twenty four seven to,
like, a smaller
three node cluster that was just running a a few hours a day. So so that got me into
into big data.
And,
when I moved on to my next job at,
at a company called TraceLink,
I, I built there an, an analytics platform,

(03:39):
using,
using Spark for detail and Redshift for querying data.
And we started, running into limitations of Redshift at that point.
Started looking at, other
solutions for analytics,
and
we came across Athena for,
for querying
data that, we are dumping into a data lake. I I thought this was a great solution,

(04:02):
until, again, we we started using it for more serious, use cases, and, I started running implementations.
And as I was,
researching,
you know, ways to optimize my queries to
at at that point, you couldn't even get a query plan from,
from Athena.
But,

(04:23):
my research
took me to,
to Starburst.
And
and that's that's basically how I ended up at, Starburst.
And at Starburst, I've been,
you know, I had kind of a nonlinear trajectory. I started as a software engineer,
Then
I I took on, an engineering management

(04:43):
job
for a few years, for about four years. And now I'm back, as an IC working on, on the AI initiative.
And for people who want to dig deeper into Starburst, Intrino, and some of the history there, I'll add links in the show notes to some previous episodes I've done with other folks from the company.
But for the purpose of today,

(05:05):
given the topic of AI on the lake house and some of the different workloads, I'm wondering if you can just start by giving a bit of an overview of some of the ways that the lake house architecture, the Lakehouse platform
intersects with some of the requirements
and some of the use cases around AI workloads.
Yeah. So

(05:26):
as part of the AI initiative, you know, we like, we started this with two goals in mind. One was to make Starburst better for for our users
by using AI, and the other one
was to
help our users build their own AI applications.
And so
what what does making Starburst

(05:47):
better for users?
You know, that covers a a wide range of of things.
But,
one of the things, we've done was,
build an AI agent
that allows users to explore
their data using a conversational
interface.
And the sort of the

(06:08):
the central point of that is,
is is data products, which are curated datasets
where users can can add,
descriptions
and, make them discoverable.
And and so
we we wanted to take that a step further, and we've added a workflow that,
allows users to to use AI to,

(06:30):
generate some of this metadata to enrich these data products. And then users can can review
these
these, changes, these,
this new metadata that was created. And,
and they they have they have the possibility to,
to correct or or add to,
what

(06:51):
the machine did. So then what this gives
gives users is not just better better better documentation
and, you know, help have some have some understand and discover their data, but it also
enables an agent the agent that we created to
to be able to answer
questions
and and gain,

(07:12):
deeper insights,
into the data instead of just just letting it look at at schemas. Right? We do we do other things with, with AI to make, make our
users' lives easier. Like, for example, we we've we've had for a while,
auto classification and tagging of of data. Right? We're using LLMs to mark, for example, PII,

(07:35):
columns.
And,
we're we're also looking at using AI for for things that are sort of more behind the scenes, like workload optimizations
and and, you know, like,
analyzing,
query plans and decide you know, using that as as input for
our work on, on the optimizer,

(07:57):
producing,
recommendations for, for users,
for to make their to write their queries in a more efficient way. So and then the other direction
that that we've been taking with AI was
helping our users,
employ AI in their own applications.
And so what we did for that was,
we built a set of SQL functions,

(08:21):
right, that that users can invoke from their queries
and that give them access to
to models that they are, able to configure in Starburst.
So
as with, you know, for for users who for for listeners who are not, very familiar with Starburst, I'll say that one of our tenets is is,

(08:42):
optionality.
So we allow
our, users to bring their own back ends to query,
write their own storage, their own,
access control systems.
So
we we pro we provide flexibility at every step of the way. And
AI models are no different. We allow users to configure,

(09:04):
whatever elements they they want to use. Well, you know,
from a from a from a set of supported supported models, obviously.
But the the key is that they can they can use, you know, a service like, Bedrock, or they can
run an LLM on prem in an air gapped environment. And we we support, those,

(09:27):
those,
scenarios. And and, essentially, what these functions allow you to do is express basically implement the rack workflow
in a in a query. Like like, truly, like, you can you can generate embeddings, you can do a vector search, and you can
feed the results
as, as context to an LLM call and get your result back. I think it's the easiest way to to basically get started with with Nela. You don't need to know anything about APIs.

(09:56):
Don't need to know Python. Nothing.
And another interesting
aspect to the overall
situation that we're in right now with LLMs and AI being
the predominant
area of focus for a large number of people
is that there are a number of different contexts,
particularly speaking as data practitioners,

(10:18):
where we want to and need to interact with these models, where you need to be able to use them to help accelerate your own work as a data practitioner in terms of being able to generate code, generate schema diagrams, generate SQL,
but you also need to
provide datasets that can be consumed by these LLMs for maybe more

(10:41):
business focused or end user focused applications.
And I'm wondering,
particularly for some of those end user facing
use cases, what you see as some of the
existing limitations
of warehouse and lakehouse architectures

(11:06):
typical datasets that, that LLMs
that users
will want to feed to LLMs. So so for example,
you know, a lot of work that users do with AI models is is on unstructured data. You know, like, it's cell spreadsheets or,
video
image data, stuff like that. And that's that doesn't fit well in into a into a warehouse typically. Right? Other other

(11:32):
there are other other areas where,
things may not be ideal. So for example,
you you were mentioning,
you know, like, query generation, things like that.
You you need good quality
metadata in order to,
to be able to generate,
accurate queries. Right? And

(11:52):
a lot of times, just just the schema reading the schema of, of of a table or of a set of tables is is going to be insufficient.
So there are some some limitations around the metadata
that, typical
warehouse or lakehouse will,
will be able to expose.
You know, we we say, here at Starburst that your AI is gonna be only as good as your data is, but maybe it's even more true that your AI is gonna be only as good as your metadata is

(12:22):
at the end of the day. And then there are there are other aspects. So so for example, if you consider,
training a model,
providing it a training set. Right? So I I I think that here again, like, maybe warehouses are not going to be ideal in the sense that the the data access patterns
that that you need when you train a model, like, where where you need to sample data, you know, specific datasets,

(12:49):
that that that would be,
an an access pattern that that would be typical for an NLM, while warehouses are optimized typically, right, for for aggregating data and,
basically doing huge scans.
And then on the other side, for data practitioners who want to be able to use LLMs in the process

(13:13):
of either processing data
or iterating on table layouts or
being able to generate useful SQL queries for doing data exploration?
What are some of the current points of friction that you're seeing people run into?
I think one
classic one is, I think,

(13:33):
around, data privacy and regulatory compliance.
That's that's going to be challenging, especially in multitenant lake houses.
There are there are
I can tell you from experience that,
many of our customers are have have pretty strict rules about
what data can be sent to an LNM,
and they're even stricter than,

(13:56):
essentially the
the rules on what, specific users can access. Right? So, like, it's possible that that the user is allowed to access,
say, a column, but
they don't want that to be sent to to an LLM. So that's,
I
that's where I think,
a lot of the these these friction points are. Adelands can also struggle with with large schemas when when it comes to to query generation. And also, if large tables,

(14:26):
the complex lineage, those are all problems for them.
And then
as far as the
interaction of being able to feed things like the schemas,
the table lineage,
the query access patterns into an LLM,
generally, that would be done either by doing an extract of that information and then passing it along or using something like an MCP server or some other form of tool use to be able to instruct the LLM how to retrieve that information for the cases where it needs it.

(15:01):
And
that's generally
more of a bolt on
use case, whereas what it seems like you're doing right now with the Starburst platform is actually trying to
bring the LLM into the context of the actual execution environment. And I'm wondering what are some of the ways that you're
seeing that change the

(15:23):
ways that people think about using LLMs to interact with their lakehouse data
and some of the
new capabilities or some of the efficiencies that you're able to gain as a result.
Yeah. I think that's
that's a that's a very pertinent con observation.
So, I mean, I'll
I'll say that, you know, the advent of MCP is

(15:46):
is, is great. I think, you know, like, it it opens up a lot of data to,
data sources to LLMs.
You know, it's similar to
how,
Trino,
you know, opens up like, has has all these adapters for other data sources, right, and and opens opens up access to
access to to lots of data sources.

(16:09):
But if you if you think about it, MCP is is,
is
defines a protocol
for communicating with with a data source, but it doesn't really say anything about the,
the data that's going to be exposed, right, or or the tools. Like, those are all left at the implementers'
latitude.

(16:31):
And so I think, you know, the usefulness of MCP is going to depend on the quality of the the servers that are going to be out there.
I I do I do I do think it's, it's it's going to be a very useful tool. But, like, with any tool,
you have to you have to it has to be used for the right case, use cases. Right? So for example, you might have some data sitting

(16:53):
in a
Postgres database and some data sitting in an iceberg table.
And you might have MCP servers that that
provide you access,
to both of those.
And,
you know, you're you're going to ask your LLM to or your agent to
provide a summary of data, you know, like, that that requires

(17:15):
joining the two datasets. Right?
I mean, I suppose
it may be able to pull data from the two sources and join it, but that doesn't
that doesn't seem right. Right?
So so, you know, like, you could
what we are proposing is go the other direction. Right? So you can
using Starburst, you have access to both. Yeah. You can federate the data. You can join it, and then you can, you can gather the data and pass it through a SQL function to

(17:45):
to the and and have it summarize it or whatnot.
So I do see the what we're building as as complimentary to MCP if you want. Right? We're definitely also also considering
building an, you know, like exposing an an MCP server. So but, again, it's, you know, use the right tool for the right job.

(18:07):
Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.
DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.
And they're so confident in their solution, they'll actually guarantee your timeline in writing.

(18:32):
Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds
today for the details.
And
digging into
the implementation of what you're building at Starburst
to bring more of that AI oriented work flow into
the context of the actual

(18:53):
query engine, this federated data access layer.
I'm wondering if you can give a overview of some of the features that you're incorporating,
the capabilities that you're adding, and some of the
foundational
architectural changes that you've had to make both to
the Trino query engine and the Iceberg table format to enable those capabilities?

(19:15):
So there are a few a few things. I I did mention,
SQL functions. Right? Like, we we do we
do have,
sort of three sets of SQL functions that,
that we provide.
There are a few task specific,
SQL functions that,
basically have some can't some some predetermined,

(19:36):
use some predetermined prompts to
perform things like, summarization
or,
classification, sentiment analysis,
things like that. We do provide, like, a more open ended prompt function that you can use to experiment with different,
different prompts. You know, we we we think that they may not be opened up to the same groups of users.

(20:00):
Like, the prompt function may may be,
used by, you know, sort of maybe
maybe the data scientists or, like, some somebody who who's more of a prompt engineer,
while the the task specific
ones,
they don't really,
require much
background in in NLM, so it can be used by,

(20:22):
a wider group of, users. And then the the category is, we we provide functions that allow you to generate to use, again,
LLMs,
generate embeddings,
for RAG use cases. So one, one thing I I think I've mentioned, before is that,
we allow users to configure their own,

(20:42):
models.
And
it's worth mentioning that you can you can generate you can,
configure multiple models. We
do offer quite a few knobs when you con configure a model, you know, not just,
temperature and top t top p and other other parameters, but we also allow
users to

(21:02):
customize,
prompts, use the my model because
one thing we've learned is that
there's a pretty wide
behavior,
gap between
between them.
And,
the other the other thing, that we offer for models,
is,
governance. Right? So

(21:24):
we do we do offer governance at at several levels. One is,
you can obviously control,
who can access specific functions.
But then,
we take it a step further,
and,
the these functions actually, I should mention,
take as one of the arguments,

(21:45):
the the specific model that they are supposed to operate with. And so we do we do offer the possibility for
an admin to
control access to specific models. You know, it it does make sense to restrict,
access to models that are very expensive. Right? Or so so that that gives gives,
admins,

(22:06):
data stewards,
quite a few levers in in
in configuring that.
Being able to to provide governance at the at the model level has has has required, you know, like, is
required a a bit of a novel approach in in the way we we have tackled governance in general
access control.

(22:26):
Some some other
some some other, things that that we're building here are model usage,
observability.
Right? So we,
all users are very interested in being able to set,
usage limits, you know, budgets,
even controlling
the bandwidth,
you know, that that, specific

(22:48):
user might might use up in terms of, you know, like, tokens per per minute or or so. And, I I did mention, the conversational,
data interface
that we've created.
And
we we we do think that, you know, like, sort of the differentiator there was building them around data products,

(23:10):
which,
you know, act as semantic layer and allow users to provide,
in insights into the data that
are,
actually difficult to glean by by an LMM or even a human. You know? So as as far as architectural,
changes, I I I did allude a bit to,

(23:30):
governance,
access control.
The the other
area where we kind of have encountered,
technical challenges
was was in the area of sending higher volumes of data to to LLMs. Right? So so that LLMs are are fairly are fairly slow to respond. Right? So that's they they don't fit,

(23:52):
very well with,
processing large datasets.
And so
we're we're looking into into ways to,
cost efficiently and be able to process
large amounts of data. LLM providers do offer
batch interfaces
for such, use cases. However, the the challenge there is,

(24:12):
integrating that with a SQL interface.
These batch APIs are typically async,
so they're not going
to work well with, with a select statement. We we are considering, you know, a slightly different paradigm there. Some some some other things that,
we've built is,
or or are currently in the process of building,

(24:35):
excuse me, our,
auditing.
That's that's another that's another big component. And,
some of our users actually require
a fair amount capturing a fair amount of, of audit data for for their elements.
So that's,
it's again another challenge.

(24:55):
And then on the storage layer, a lot of people who are using Trino and Starburst
are storing their data
in Iceberg tables on object storage,
and Iceberg as a format
generally
defers to Parquet or Ork as the storage layer. So there are a lot of pieces of coordination to be able to make any meaningful changes to the way that they behave.

(25:20):
And I'm wondering,
what are some of the ways that you have
addressed
the complexity
of being able to
store and access some of these
vector embeddings
living in lakehouse contexts
and some of the reasons for sticking with iceberg for that versus looking to other formats like LAANC?

(25:41):
So that's that's actually a very interesting question.
So it it turns out that there are already discussions
in the Iceberg community for supporting LAANC as, as a file format.
And,
we we are we are looking into into that. We're going to be working with the iceberg community, but definitely
using an alternate,

(26:02):
file format like LAANC is, is on the table. It's it's, it's it's an option that we, we evaluate.
I'll also say that for smaller datasets, it's it's it's also possible to to store,
data in in a different data source like, PG vector, for example. Right? So we do we do offer support for PG vector, so it's possible to

(26:25):
to use that as a, as a vector database.
But, you know, like, there is a lot of interest in,
in using installing
storing data in Iceberg. And, you know,
the,
the the the right data format and the the right shape of the indexes that are going to be required for, efficiently,

(26:49):
doing vector searches,
doing semantic searches,
That's, that's, very much an area that's,
that's under sort of active,
investigation and development.
And so for teams who are adopting these new capabilities
as you roll them out,
what are some of the ways that you're seeing them

(27:12):
either add new workloads to what they've been using Starburst for or some of the ways that it's changing the ways that they use Starburst for the work that they were already doing?
Yeah. So
like I mentioned before,
the the capabilities that we've added open up the possibility
of, essentially

(27:33):
building and running an entire work, a RAG workflow,
in SQL.
You can generate embeddings for
for for data you have, and we have we do have, you know, sort of bulk APIs for for doing that. You can perform a semantic search. You can you can and and based on
your top hits, you can,

(27:54):
build your context and pass that to to another without,
using a framework like langchain, for example.
You know, I I think this this opens up
new possibilities
for for analysts who otherwise would, would probably not have gotten into,
you know, gotten close to,
to these capabilities.

(28:15):
You know, we you can you can you can imagine,
you know, like, dashboards
built based on,
queries that employ,
SQL functions.
And then in terms of the education around these capabilities,
everybody is at a different stage of their journey of overall adoption of generative AI, LLMs,

(28:40):
being able to bring it into
the work of building their data systems. And I'm wondering
how you're
approaching the overall
kind of messaging around the capabilities,
rolling out the the the features, and some of the some of the validation that you're doing as you bring these capabilities
to

(29:00):
your broader audience and loop that feedback into your successive iterations of product development?
Yeah. I I think that the way we ran this project,
was to
get feed
customers in the loop quite early by, by doing demos of, essentially, unreleased
software. Right? Like,

(29:21):
we'd be our product would be demoing follow-up on builds to get feedback and,
validate that we're on the right track and we're building,
useful stuff for our, for our users.
I do think that
in in this area, the the best way to to document and and,

(29:41):
sort of make customers aware
of the value that, that we're providing is,
by,
by showing them, you know, sort of, like, small applications
that you can build using this capability.
Right? So, like,
you could you could envision,
you know, like, say say, a database

(30:02):
storing,
restaurant reviews.
You could you could
essentially,
you know, like, show show how how you can
do a sentiment analysis on that
and and then render that, you know, like, sort of the daily reviews as,
you know,
in a dashboard as, like, red, yellow, green bar charts. And you know what's cool about it is showing people how they how they actually can compose these functions. So, for example, if you had,

(30:32):
restaurant reviews
that that are in different languages,
you could you could, translate them all to English, before doing sentiment analysis or or or summarizing,
you know, sort of the the the general,
maybe
summarize the general complaints that that people might have or or things like that. So I think I think in in general, it's it's important to come up with some some good examples that that can highlight the truly the new capabilities that this,

(31:04):
opens up.
In the work that you did to bring these AI focused capabilities into Starburst, into Trino,
and tack them onto
the iceberg capabilities,
What are some of the interesting engineering challenges that you had to address, and what were some of the

(31:24):
preexisting
functionality that helped you in that path?
Yeah. I I think that the
the main preexisting functionality
is,
is the capability to access a wide variety of, of data sources. Right? Because
the the name of the game here is getting your AI access to to the data. And in that,

(31:48):
sense,
Starburst is uniquely positioned
to to to be able to to be plugged in into,
all of the our users' data.
And so,
you
know,
making that available to to LLMs,
you know, is is is challenging and is is an ongoing,

(32:10):
effort.
I did,
I did mention how
how,
sending you know, processing,
large amounts of data is,
is is challenging from a technical perspective because it doesn't fit well,
with, the the SQL paradigm.
But we we do have some some innovative,

(32:31):
ways in which
we are going to,
to allow the
that that sort of processing to be embedded in in, say,
a workflow that our users, might have.
It's, you know, like, some a few things that we've we've,
we've learned along the way
or that,

(32:52):
essentially,
working with LLMs is a paradigm shift. Right? So we we did learn that LLMs can be fickle, and,
writing it,
writing tests is is a is a real challenge.
Being able to to to write tests, is is very challenging when your back end when the system that you're you're testing is probabilistic

(33:15):
and is not deterministic.
Right? So you need to get to some extent into the mindset of,
of a data engineer and,
of of a data scientist, I'm sorry, and embrace,
experimentation.
Be ready to,
I mean,
everything here depends on,
is is very data driven. Right? So

(33:36):
generating,
generating,
meaningful datasets,
is critical for building such a product. And, we're looking at various,
various approaches for getting datasets
that we can thoroughly test our models on, our our functionality.
And as you have been

(33:58):
releasing these capabilities,
onboarding some of your early adopters, what are some of the most interesting or innovative or unexpected ways that you've seen these AI focused capabilities applied?
We're obviously at the beginning of the of, of of this journey. Right? Like, everybody like, the whole industry, I mean. Right? So we do have a few a few customers who who are are more advanced and, you know, they're we've seen them do some some interesting things. So for example,

(34:26):
when when exploring different,
datasets, right, coming from, say, different providers,
they might be ingesting data from,
they'll need to
join these, these datasets, but they they won't necessarily
have,
similarly named columns. Right? So,
inferring the say, the the join column is not not always easy based on just like doing a

(34:51):
column name match. But with AI, you can actually
do some semantic analysis and find the matching columns that way and and be able to join data, you know, figure out the join essentially or or have have the the machine figure out the join in ways that were not possible in, before. Some other interesting
things that, we've seen our customers do is and this is actually something that we're also looking into is for internal use is,

(35:17):
generating synthetic datasets,
which removes the danger of, of using PII testing,
PII data and testing and things like that.
So using LLMs to generate,
synthetic datasets is,
is another interesting,
use case that we've seen.
For people who are

(35:39):
interested
in being able
to use AI in the context of their data systems, what are the situations
where either Starburst
specifically or the lakehouse architecture generally are the wrong choice?
So I think at this point,
I I wouldn't recommend using Starburst if,

(36:02):
for a user for use case that that uses data like,
video or large blobs of unstructured data, you know, like one of our sensor data or or something like that. So we're not ready at this point to to deal with,
with these types of, of data.
Also,
high high volume, high concurrency

(36:24):
operations are,
are not going to
to fit well, in here. And, you know, a lot of it is is actually due
to the performance of of LLMs and, you know, like, the lack of support for, for such,
operations.
But, you know, again, as with and everything, you know,

(36:44):
choose choose the right tool for the right task. And, you know, while, I think Starburst,
can handle a lot of use cases, you know, it's definitely not going to handle all of
them. And as you continue
building out these AI focused capabilities,
the landscape around you continues to evolve at a blistering pace, What are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to explore?

(37:12):
Yeah.
You're you're right. This this is moving at a fast pace. Right?
We we do have a lot of plans.
We we do plan to add, MCP support.
And we're thinking about how to make this the most useful working with with our customers. But you'll you'll notice that a lot of database out there just just expose

(37:33):
a simple
API to run run a query in in their native query language. I think we can we can do better. We can do more than that and,
and allow
agents to to use
an MCP server to
to automate a lot of
the tasks users might might want to do, say, you know, in Galaxy, like spinning up clusters or setting up various resources.

(38:00):
I think there there are lots of lots of opportunities there for,
integrating with agents that,
our users,
have already built, actually. I I I did mention
you you were asking about the the work, for
storing vectors in, Iceberg.
We're that's definitely an area that,

(38:21):
we're,
building in,
you know,
more more essentially making
making Starburst a more performant,
vector database.
And, you know, that that includes
looking at, new,
file types,
for storing the data and looking at various indexes for vector indexes as well as,

(38:46):
indexes for that's, support,
full text search.
We were also,
you
know, talking about some of the
weaknesses that that I've mentioned for that that
are common for warehouses and,
where Starburst is no exception. I we we are going to be working,
in that area to to provide,

(39:08):
better support for,
these,
you know, data types that are
currently maybe not not as well supported in, you know, like, PDF file, Excel
spreadsheets, or whatever, like, sort of more unstructured,
data,
potentially, you know, like images and video.
So we're we're looking at extending Ice House,

(39:30):
our managed
platform,
for data ingestion and transformation.
And, we're looking at extending it to be able to ingest
various types of data and
generating embeddings for it, potentially
applying AI transformations.
Again, there are lots of, lots of possibilities that we see there. And we we do want to to continue,

(39:55):
extending,
the use of AI features throughout the product,
you know, for things, that I've mentioned,
I think,
that would allow us to essentially
make the product,
more efficient
and provide,
recommendations to our users for improving their their queries and and the way that they can, use the data. And then finally, we're working on extending

(40:19):
the agent that we built to
to be able to generate graphical visualizations,
you know, data explorations,
and, and, you know, like, sort of make it the the way I I I think I personally think this is this is sort of the the way in which BI tools are headed. Right? So, like,
allowing users to provide ad hoc explorations just using natural language,

(40:44):
and visualizing the the results of their, their questions.
Are there any other aspects of the work that you're doing,
the AI focused capabilities
that you're adding into Starburst or just the overall space of
building for and with AI as data practitioners that we didn't discuss yet that you would like to cover before we close out the show?

(41:08):
No. I think we do. We discussed a fair amount of topics. We we we did cover a fair amount of of topics here.
I think I think this it's it's definitely
a a a very exciting
area that that is going to,
you know, like, be a game changer for for the way we interact with data,
and, we we gain insights,

(41:30):
from
it. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and the rest of the Starburst team are doing. I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

(41:52):
Oh,
tough question. I I think if I was to,
to choose something, maybe
it's
a a lack of of, kind of
of intelligent
data observability
and context understanding of
of of your data.
So
I think I think,

(42:13):
current tools still struggle at, well well, they're good at at, say, syntax validation,
basic data profiling,
they they still struggle at, understanding,
you know, semantic understanding
of of relationships,
between data.
And,
incidentally, I think I think this is this is an area where where AI is going to

(42:37):
be able to help and and,
and and provide insights that were not achievable before.
Alright. Well, thank you very much for taking the time today to join me and share the work that you and the rest of the Starburst folks are doing
on bringing AI closer into the process of building with and for AI and ways that we can use them

(43:01):
to accelerate our own work as data practitioners
and working with these large and complex data suites that we're responsible for. So I appreciate all the time and energy that you're putting into that, and I hope you enjoy the rest of your day.
Thank you. I,
thank I I really appreciate the opportunity to talk to you, and it was, it was a great conversation. Thanks.

(43:28):
Thank you for listening, and don't forget to check out our other shows.
Podcast.net
covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com

(43:52):
with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Popular Podcasts

Las Culturistas with Matt Rogers and Bowen Yang

Las Culturistas with Matt Rogers and Bowen Yang

Ding dong! Join your culture consultants, Matt Rogers and Bowen Yang, on an unforgettable journey into the beating heart of CULTURE. Alongside sizzling special guests, they GET INTO the hottest pop-culture moments of the day and the formative cultural experiences that turned them into Culturistas. Produced by the Big Money Players Network and iHeartRadio.

On Purpose with Jay Shetty

On Purpose with Jay Shetty

I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!

Crime Junkie

Crime Junkie

Does hearing about a true crime case always leave you scouring the internet for the truth behind the story? Dive into your next mystery with Crime Junkie. Every Monday, join your host Ashley Flowers as she unravels all the details of infamous and underreported true crime cases with her best friend Brit Prawat. From cold cases to missing persons and heroes in our community who seek justice, Crime Junkie is your destination for theories and stories you won’t hear anywhere else. Whether you're a seasoned true crime enthusiast or new to the genre, you'll find yourself on the edge of your seat awaiting a new episode every Monday. If you can never get enough true crime... Congratulations, you’ve found your people. Follow to join a community of Crime Junkies! Crime Junkie is presented by audiochuck Media Company.