Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:01):
Hey everybody,
welcome back to a new season of
what's New in Data.
Today we have a very specialguest, andy Pavlo.
If you're in the world ofdatabases, you probably already
know his name.
Andy is a professor of computerscience and databaseology at
Carnegie Mellon University.
Andy has built a massivefollowing, with thousands of
people looking to him for hisinsights and his flagship
(00:22):
database course at CarnegieMellon, which is fully open to
the public online.
What makes Andy's work soexciting is his ability to
combine academia, industry and alittle bit of Wu-Tang Clan to
make learning about databasesgratifying, enjoyable and
practical.
Let's get right into it.
Speaker 2 (00:40):
Andy, super excited
to have you on the show today.
How are you doing?
Hey, John, thanks for having meWorking on my databases.
It's going to be a good timeAlways a good time.
Talking about databases, andy,I introduced you, but feel free
to tell the listeners your storyas well.
Speaker 3 (00:54):
Yeah, so I am, as you
said, an associate professor
with indefinite tenure ofdatabases in the computer
science department at CarnegieMellon, and this is my, I think,
11th year.
I've been there since 2013.
And prior to that, I did my PhDat in databases at Brown
(01:14):
University with Mike.
Speaker 2 (01:15):
Stonebreaker and Stan
Zadonik Excellent, yeah, and,
like I said, many people followyou.
One of the very interestingthings you do is this very
public course.
It's a.
It's officially a CMU databasecourse, but you open it up to
the public.
Intro to Database Systems CMU15-445-645.
It's generous how you've madeall the courseware, all the
(01:37):
lectures, all the projects.
You can submit projects.
I went through some of themmyself.
Why did you decide to make itso generously accessible?
Speaker 3 (01:47):
Yeah, how things
started was when I started at
CMU.
I was pretty much the onlydatabase professor there.
There was another professor,chris Vellutis, but he's in a
more data mining, graph miningworld.
And when you start off as a newprofessor you're like oh gosh,
how about get tenure?
Because I'm competing as themit, because there's mike
(02:09):
stonebreaker and sam maddencompeting at stanford and the
berkeley people.
So I'm like all right, how do Iget people to pay attention
what we're doing?
And we decided just just puteverything on the internet.
It's advertising for thecarnegie melon davis group and
the.
It's one thing, advertising forthe Carnegie Mellon Database
Group and the.
It's one thing to put thevideos online, but also put all
(02:30):
the course materials online.
The thought process there wasnot everyone can get into CMU
and, first of all, not everyonecan pay for CMU.
Cmu is very expensive.
It's not like they're paying mea lot of money, but it's a lot.
And so now everyone can go toCMU.
So like, why not just puteverything on the internet for
anybody who wants it?
I don't know why more peopledon't do this.
(02:52):
And the great thing about beinga private school is I didn't
ask permission, I just did itand no one cares.
Everyone's happy with it.
So I'm glad to help as manypeople as I can.
I will will say the positiveside of this has been one it
forces me to be moreprofessional.
I say less crazy things becauseI know it's going to be on the
video.
But from a pedagogicalstandpoint it's been helpful for
(03:14):
me as a teaching and lecturingbecause I'll watch the previous
years, like from one year to thenext, and I see where students
ask questions and I say, okay,they must be not understanding
something that I'm explaining,or the demonstration or the
diagrams aren't clear.
So from year to year I can lookand see where students are
(03:34):
confused and I make it better.
So that's been wonderful.
And the great thing also, Iwould say the personal
satisfaction has been you knowpeople that aren't at CMU, like
non-CMU students, send me emailsand say, hey, because of the
course, I was able to get thisjob right and I got this
internship, like it's helpedtheir career and then, without
(03:59):
paying CMU money to me, that'sfantastic.
So I like those emails, Iappreciate people sending them,
so it's been nothing butpositive yeah absolutely, and I
want to say it's developed itsown cult following.
Speaker 2 (04:09):
It has a Discord
channel that's super active.
Your videos are super engaging.
They all start out like somesort of Tarantino movie in a
beanie looking like you just ranaway from a crime scene, and
that really sets the tone goinginto the lecture.
And then you'll go in and starttalking about concurrency
(04:32):
theory or log structured indexesand things like that.
Speaker 3 (04:33):
I will say the phone
call for this fall 2024, the
phone call that I started withthat first lecture from the
Carnegie Mellon police.
That phone call is real.
So people are like, oh, did Ifake it?
That is a real CMU phone calland it's obviously not because I
robbed a bank or something likethat.
Basically, I started gettingthese emails over the summer
(04:55):
saying hey, your CMU voicemailbox is full, please go delete
things.
I'm like, okay, I didn't know Ihad a voicemail box because
I've never hooked up my phone inmy office.
I can tell you about why in asecond.
But, and so I find where thiswebsite is I realized I have 10
years of voicemails I've nevereven looked at right, and so I'm
going through listening to them, see if anything I should keep.
(05:17):
And sure enough, there's twophone calls from the cmu police,
and the one that they is on theYouTube video is because some
woman, not affiliated with theuniversity, not a student, not a
faculty, not staff, nothingcalled the CMU police because
they were saying that I had aphoto on GitHub of me looking
(05:38):
very threatening and they werewarning the police to have me
take it down.
It was something stupid likethat.
That phone call is because theCMU police is like hey, this
lady called and said you have avideo or photo of you with a
knife from undergrad and we'reasking you to take it down.
And we said like, yeah, there'sbeen shenanigans, but not all
of it is completely made up.
Speaker 2 (05:59):
Absolutely.
Someone really has to bringthis mentality into the database
world.
I think you've successfullydone it and I think that's why
the community is so engaged.
You have thousands of peoplefollowing you for this course
and the Discord is super active.
I went through the coursemyself.
I couldn't get through.
It's quite a bit of densematerial.
(06:20):
It's definitely reminded me ofmy college OS class Shout out to
the pintos class taught byfreck benson over at usfcs, but
just nights and weekends in thelab.
But the amazing thing is, eventhough it's very dense, very
challenging, you have people inyou know discord who are
chatting about going through the, the exercises.
(06:41):
People are sharing hints andalso very generous in the
community.
For instance, I just shared thetest output for the Hyperlog
project and someone said, oh,you just have a little off by
one error and I said okay,that's it.
It was just my indexing was off.
But it's very hard to getpeople to come around something
that's so complex and work andcollaborate and ultimately solve
(07:03):
these problems.
My personal recommendation isdefinitely try to go in with a
group, but if you do, thecommunity is awesome.
Definitely a shout out to youfor cultivating that that
awesome.
Speaker 3 (07:14):
I would say also that
discord channel that is not
affiliated with me.
That is some person in india.
I'm gonna make a discordchannel.
You send me an email.
I'm like, okay, great, go forit.
Like I, I don't monitor it, sothat's all organic outside of
cmu.
So give credit to those guysfor establishing that community.
We I had nothing to do withthat, but I'm happy to see if
somebody did it yeah, it's, it'sincredible.
Speaker 2 (07:36):
Definitely, if you
feel like the material is really
challenging, you just need somehelp, a lot of.
There's a lot of activecommunication there through
every single project.
But yeah, that's even cooler tohear that that started on its
own speaks to the what's becomea very viral course.
The other thing I want to askyou is a strong background in
(07:57):
wu-tang clan and officialprerequisites of the course, or
just recommended it'srecommended in In previous years
not this past year becauseother troubles, but in previous
years we've had a course DJ.
Speaker 3 (08:13):
We've had someone
like, if you watch the fall 2023
lectures, there's somebodysitting next to me, a full-time
DJ, and he mixes and does beatsbefore and after the lectures
and that's basically.
We did that because there's abit of a time setting up before
the class starts and studentsare walking in and it's always
like this dead awkward silence.
(08:33):
I'm like, all right, let's getsomebody to play music during
this time.
So that's the Wu-Tang stuffcame along because one I like
the Wu-Tang Clan, but also wehad a course DJ, so we plan on
having one in 2025 in the fall.
That's the number one complaintwe got this semester is that we
didn't have a course DJ.
In previous semesters I've donelike Easter eggs or had for the
(08:58):
final exam.
For example, one year, if thestudent could list all the
members of the Wu-Tang Clan thatwas on the original 36 Chambers
album, including the one memberwho was actually in jail, who
couldn't be on the album, then Igave him a 100% score on the
final exam.
So they had to list all theeight, the nine members, but no
(09:18):
one could do it.
One guy got close but hemisspelled like Ghostface Killer
and Master Killer.
He missed the H on.
Speaker 2 (09:26):
Oh, automatic Fail.
Speaker 3 (09:27):
right yeah I said two
A's had the exact names and
everything.
So yeah, we've done fun thingslike that before.
Speaker 2 (09:35):
Yeah, yeah, it's a
really fun class.
It's really engaging to gothrough.
Sometimes the material can getdense once you're in the weeds
of things, but it'll be funny tolook up and see RZA and GZA on
the board and think, okay, isthat some sort of acronym for
indexing steam?
Oh wait, no, that's from thebootlegs.
Speaker 3 (09:57):
Let me comment on the
density.
The course that I teach is, asyou mentioned, the OS class.
The OS classes at universities.
Those are expected to be verymuch dense and down in the weeds
of here's what the kernel ofthe OS is actually doing.
And no one really has done thatfor databases.
It's always been.
(10:17):
The standard database course atmost universities are like
here's what the relational modelis, here's what the normal
forms are, and then maybe that'sthe first half of the semester
of like how to do data ismodeling and the second half
will be here's what transactionsare, here's what the curtsy
rules are.
And when I first started at CMUthat's what the course I used to
teach.
It was a different number and Iused to co-teach it with the,
(10:40):
the other professor I mentioned,and honestly I didn't like it
because I just don't think itset up students for the things
they actually really needed toknow if they were actually going
to go in industry and actuallywork on the internals of
database systems.
Nobody does numeral forms.
You don't need to know thesethings Armstrong's axioms.
So I threw all of that out andthat used to be for the time
(11:02):
also when you used to searcheasy computer science courses at
Carnegie Mellon, this olddatabase course used to come up
as number one, so I was like Ican't do it.
I got to get to that, so wethrew it all away and I
basically decided here's thecourse that I would want to have
taken if I was an undergrad.
Of all the things I think areimportant, and that's how it
evolved.
And, yes, I understand that itcan be a bit down the weeds.
(11:26):
There's so much more material Icould cover.
There's not enough time in 14,15 weeks.
But I think that and theexpectation is not that
everyone's going to retaineverything that the course
covers, but the idea is thatwhen you go off in the real
world, either if you're workingon database systems or even just
using a database system whichpretty much everybody has to
these days in an applicationdomain that you want to
(11:50):
understand what the system isactually trying to do for you
when you give it data, or youwant to run to understand what
its behavior is.
So that's, yes, it's dense, butit's not meant to be like.
Here's everything you need toremember if you go out in the
world, but it's enough to giveyou the background and say, okay
, here's how things actuallywork.
Speaker 2 (12:08):
Yeah, absolutely,
it's a.
It's an incredible class andpeople who take courses like
operating systems, where youbuild internal of a database, or
getting into computerarchitecture, people always say
those are the most gratifyingcourses for computer science
students in a lot of cases.
So computer science students ina lot of cases.
So it's really cool to see thedatabase lens of that where,
(12:28):
okay, the database is also thisclient to the os and kind of
restricted by the os in a lot ofsenses.
So we'll get into that as well.
But a super fun course.
I took my first sabbatic, gotthrough the first project and
half the buffer project andthought, okay, I need to block
out more time for that.
I would probably have to take asabbatical to really do the
course justice.
But I I totally recommend itfor a lot of folks, especially
(12:53):
with a software engineeringbackground, who want to get more
into database engineering.
My, my brother's a cs student.
I said just go look at thisclass, just go snoop on this
class, since they don't offerthe equivalent at your school,
and he was very interested in itas well.
So I do want to get into someof the materials in the lectures
.
Obviously, vector databases areheavily associated with AI in
(13:16):
industry.
It's one of the use cases fortaking LLMs and tying it to the
database.
In your course, you categorizevector database with other forms
of indexing, so I want to hearyour thought process behind that
.
Speaker 3 (13:31):
Yeah.
So I guess we first explainwhat a vector database is.
So a vector database, at leastas it's been defined in the last
couple of years, is a databasesystem that supports taking
vectors or embeddings.
So think of a big array of abunch of floating point numbers
that are generated by atransformer.
(13:52):
So you take your text document,you run through this
transformer and then it convertsit into this vector of floating
point numbers or decimals, andthen you build an index on that
vector so that you can do quick,approximate nearest neighbor
searches.
So the idea is that you takeall your documents or all your
data, you generate all thevectors for them, you load it up
(14:14):
into this index and then nowwhen a query comes along and
says something to the vector,find me all the records that are
similar to this other recordand that similarity can be based
on the semantic meaning or abunch of other factors that your
transformer could take intoconsideration.
And then you do a nearestneighbor search and you find the
(14:35):
vectors that are closest to thevector you're trying to search
for.
And so when you break it downto the components of what
intakes actually do thatapproximately near-scenario
search.
The first thing is that youstore data that you're going to
then build vectors on and in allof the major vector databases.
(14:56):
This is typically JSON.
You take your JSON, identifywhich fields you want to then
turn into vectors.
So that writes out that says,okay, you need something.
Then turn into vectors, right.
So that writes out that says,okay, well, that's you need
something to store some kind ofstructured data as JSON, or it
could be relational data, itdoesn't matter.
And then the proxy nearestneighbor search that's occurring
on this other data structurethat it can do the nearest
(15:19):
neighbor search.
So, again, if you just take outthe AI, the veneer on top of
all of this, so the undertone ofit say, okay, let's just a data
store in an index, right, couldbe a relational database plus a
B plus three index or a hashindex, right.
And so the the practice ofnearest neighbor search, that
vector index.
It's exactly as I'm saying itis an index.
(15:40):
So there's nothing radical orcompletely different about the
idea that you want to do thisnearest neighbor search on
vector data.
That has to be a completelybrand new database system
architecture.
And when you go look at whenvector databases became hot in
like 2022, when ChatGPT becamesort of of a household name.
(16:03):
Within one year all the majorrelational database vendors had
all added vector indexes right,they didn't have to rewrite
everything.
They just added this newauxiliary index just like if it
was a new P3 or a hash table.
So it is basically just.
It is just that index.
Now there is some of thechanges you have to make in your
(16:23):
in the query optimizer.
Do you search the index firstor do you do it after when you
want to do additional filteringthere's maybe you change the
execution engine about how youactually invoke that index.
You're not getting exactresults that may actually make
it filter later on.
But at a high level the queryprocessing approach on a vector
(16:44):
index is about the same as itwould be on a relational
database or JSON database.
So in my opinion it is just anindex.
Speaker 2 (16:55):
Yeah, very cool to
hear that perspective and it
definitely makes sense,especially when you if you look
at all the parts that wereadjacent to vector databases in
your lecture, we talked aboutLucene, for example.
Lucene isn't necessarily neweither, right, it's over a
decade old.
It's been around for a whileand you've seen approximate
(17:15):
nearest neighbor search as a wayto store and retrieve data for
a while now.
For a while now.
But yes, it seems like now it'sviewed as one of the use cases
for tying AI and databasestogether in the storage layer.
We'll get more into that,especially when we get into your
paper.
What goes around comes aroundand around, but first I just
(17:36):
want to get dive into thetechnical side of it.
Speaker 3 (17:39):
Another.
Yeah, I would say one thingalso that the vector databases
are going to do a better job atdoing that approximate nearest
neighbor than, I think, therelational database systems.
So one of the things they dobetter is integration with all
the various AI tools or LLMtools and that sort of ecosystem
like LanChain and those kind ofthings.
(18:00):
They're going to do a betterjob of interviewing those things
.
I know that Oracle has put outsome stuff that can connect to
OpenAI other stuff more easily,but I think they'll do a better
job at that.
But again, the corearchitecture is essentially the
same.
Speaker 2 (18:16):
Yeah, yeah, it's
always interesting to see the
architectures that thesecompanies have.
There might not be anythinggroundbreaking or new in some of
it, but the way they've builtthe database for the
go-to-market around ai.
Okay, who are the?
How do we integrate it?
How do we make it easy fordevelopers to operate?
What the people who areactually doing ai in a company,
(18:38):
how do they work and how do wemake this really adoptable by
them?
I think those are the kind ofambiguous things that companies
that are super focused ondelivering something might
prioritize over, let's say, anoracle who you know by oracle I.
I don't want to downplay them,because you know, of course
there's a lot of impressivethings that they've done and
(18:59):
they're very focused on thisarea, but I would not say it's
like a repeat of, let's say,nosql 10 years ago, where
MongoDB was able to come intothe market and Cassandra and
others unopposed by thetraditional vendors.
A lot of interesting thingswill happen there, for sure.
Speaker 3 (19:19):
I think the JSON
story with Mongo is interesting,
because Mongo became a popularsystem in 2009, 2010, and maybe
even earlier, 2008.
But it still took a few yearsbefore the relational incumbents
(19:41):
added JSON support.
I think Postgres added it in2012.
The SQL standard added it in2016, at least the initial
support for Oracle, maybe like2013, 2014.
So it took about three, fouryears before the relational
database says, oh, this is athing, let's add it to SQL,
let's add it to our databasesystem.
Which is funny because theyalready had support for XML and
(20:03):
going from JSON to XML, it's notthat radically different, so
they could have repurposed someof that code to use it, but it
took a while for it to happen.
Whereas if you look at thevector search stuff and again,
like I said, within a year ofChatGPT becoming like this is a
thing you want to do and youwant to do these RAGs or do
these vector lookups in thesespecialized databases it took a
(20:25):
year for them all to add that.
And that either tells me thatthe effort it took to add a
vector index in a proxy nearestneighbor search to an existing
system like that was either easyto do, because again, it's just
another index you plug in orand it's probably a combination
of the two of these or peopleperceive, as this is a very
(20:45):
important use case that everyoneneeds to have.
It's the future.
We don't want to be left behind.
Let's go put all our resourcesat it right away.
Within a year of ChatsBDblowing up, all the major
databases had vector support,whereas with JSON it took
multiple years.
Speaker 2 (21:02):
Yeah, absolutely.
That's a great observation too,and just to see the signals
going out there.
Okay, larry Ellison probablywasn't super excited about JSON,
but he is going to dinner withElon Musk and Jensen Wang and
talking about AI and trying tobuild his own nuclear data
center, so I don't think he'sgoing to ignore this or let new
(21:23):
vendors come into theirterritory here.
It just has everyone'sattention and of course, the big
hyperscalers too are alsothrowing a lot of IP at this.
I could have seen why Mongo andsome of the NoSQL databases and
some of the moredeveloper-oriented databases
might have been able to take alead in the early 2010s.
(21:44):
Maybe the large, establishedvendors said that's just not our
market, it's not superinteresting to us.
That doesn't seem like the casewith AI.
It seems like with AI,everyone's going full throttle
into it.
It'll be interesting to seewhat happens there.
We don't even have anestablished repeatable
architecture yet for real worldAI points.
There's some examples, but Ithink a lot of the work is still
(22:06):
pending to see what becomesresilient.
I also want to ask you aboutsome of the other parts of your
lectures.
Without getting into too manyspecifics, I just want to
high-level compare columnarversus row-based and some of the
trade-offs there.
Let's say, could you walk usthrough some of the deeper
(22:27):
technical reasons why an OLAPsystem might prioritize columnar
storage and compression,whereas an OLTP system, a
transactional engine, mightfocus on row-oriented layouts?
Speaker 3 (22:39):
Yeah, at this point
this is established conventional
wisdom.
So I don't think I'm breakingnew ground here.
But the if you think of theoriginal databases from the
1970s, 1980s, originalrelational databases they would
store everything in rows,meaning all the values or
attributes for a single recordor tuple would be adjacent to
(23:01):
each other.
In his account informationyou'd have my first name, last
name, mailing address, like all.
That would be contiguously laidout one after another and then
the next record wouldn't startuntil the current record
finished.
We think the data wouldactually be organized in memory
on disk and for all twoworkloads or operational
(23:21):
workloads.
This makes sense because thequeries are typically are
looking for single entities inthe database Go get Andy's
record, go get John's record.
And so when the query wants togo get that data, you want to
land at some location and justread things continuously and
bring it back to the application.
So that world that's storingthings in the row store makes
(23:42):
sense.
And for analytics you're notlooking at single records,
you're looking at theaggregation of all the data
you've collected.
You're extrapolating newknowledge from the collection of
data you've accumulated.
So in that world you're lookingacross multiple records and
typically you're looking at asubset of the columns.
(24:03):
Find me all the records withinthis zip code.
So I don't care about your name, your last name and so forth,
right, I don't care about yourbirthday.
Those columns are allunnecessary for the query.
I'm looking at just one singlecolumn.
So people basically figured outin the 2000s although the idea
(24:24):
goes back to the 1970s that ifyou actually store the data
within a single columncontinuously for every single
tuple, break out all theircomposite attributes and store
them just one way or another,all the values within a column
one way or another, then that isbetter for these queries
because now you can go jump toone location on disk or in
memory and just read continuousdata just for that one column
(24:48):
and you're not polluting memory.
You're not doing much IO to gofetch data that you don't
actually need for the query.
If I don't need the first nameinformation about all your
customers, why bring that intomemory or go, why fetch that
from S3 or from disk for thequery?
So that's the sort of the majordistinction between the two.
The workload sort of dictateshow you actually want to store
(25:10):
the data, but it turns out forthe columnar organization.
There's a bunch of otheradvantages you can exploit
because now all the data withina single column are contiguous,
you can actually start usingcompression and other encoding
techniques to reduce the size ofthe data on disk and then, when
you actually bring it intomemory, other optimizations you
(25:31):
can do to keep things inlow-level CPU caches and allow
you to rip through the data andprocess it more efficiently.
An obvious thing would be, if Ihave a column with someone's sex
and and keep it simple to sayit's male and female I
understand it's more complicatedthan that Just say it's two
(25:52):
values male, female.
So instead of just storing malefemale, male female over and
over again, if I sort them basedon the value, I can have all
the males first, followed by allthe females, and now I just
need to record things like it'smales and I have a million males
and then followed by I have amillion females, and now I'm
(26:14):
taking what would have been 2million records and storing down
to just two records themselvesto say here's how many repeated
values I have.
So that's interesting about runlength encoding.
There's other optimizations youcan do, but repeated values I
have.
So that's interesting about runlength encoding.
There's other optimizations youcan do.
But if things are storedcontiguously within the column.
There's a whole bunch of otherways to make the data system run
faster, and that's why themodern columnar systems just
(26:35):
crush anything that was done inthe 1990s or earlier.
Speaker 2 (26:39):
Yeah, and Do you see
opportunities for convergence
here?
There's a lot written aboutkind of HTAP, hybrid
transactional analyticalworkloads, or the Orville
Stonebraker's quote one sizedoes not fit all still reign
true for the foreseeable future.
Speaker 3 (26:57):
So hybrid
transactional analytical
processing.
So this is a Gartner term.
It's an amalgamation of OOTP,or online transaction processing
, which I think dates to the 80s, and then OLAP, online
analytical processing that datesto the 90s.
That was invented by Jim Gray,the Torrenwood winner in
(27:19):
databases.
So HTAP is the idea that I canhave a single system support
both categories of workloads.
On paper, this clearly makessense, right, like why provision
one database for yourtransactional workloads and then
another database for your datawarehouse or your analytics?
It'd be great if I could have asingle system just do
(27:41):
everything.
In practice, though, thechallenge is that oftentimes
there's the stakeholders at anorganization are typically
separate, meaning the peoplethat want to run the
transactional system are not thesame people that want to run
analytics, and if you come alongnow with a system that kind of
does a half-assed job on both ofthem, the stakeholders for
(28:04):
transactional stuff is why wouldI want this sort of handicap
system that can do both?
I don't like about transactionsGive me the best transactional
database you have, and likewisethe data warehouse people are
like I don't care abouttransactions, give me the best,
give me the best system that youhave.
And so oftentimes selling anHDA system can be challenging,
because the people that arewriting checks for the software
(28:27):
often have different needs ordifferent desires.
I will say, though, what makessense is that if you can have
some minor support for, I think,analytics on the transactional
side of things, or I think,analytics on the transactional
side of things, so I don't thinkit makes sense to have or it
(28:51):
would be very difficult to sella full-fledged transactional
database system bolted on top ofyour data warehouse.
Snowflakes has their hybridtables, but that's certainly not
meant to replace a full-fledgedsystem like an Oracle RAC or,
say, like an Aurora, but it cando some transactional stuff,
which makes sense for certainworkloads, but you wouldn't want
to run your like full-fledgedfront-end application on top of
(29:12):
that, and so, on the flip side,if you add some support for
analytics on the operationaldatabase system, I think that
makes sense, because now you cando some minor analysis on the
data as it comes in, rather thanhaving to wait for it to stream
out to your data warehouse, getprocessed and then do analytics
(29:33):
and then send the results back.
But certainly we wouldn't wantto run heavyweight analytical
jobs on the transaction systemall the time, because that's
going to slow things downbecause that's going to slow
things down.
Now there are systems that dothe hybrid approach, which I
think is the right way to do, isthat Oracle does.
It's called Fractured Mirrors,where you basically have the row
store.
All new data comes in, getsstored in the row store side of
(29:55):
things, and then they make acopy of the data in the column
store format and so now when aquery shows up, you can figure
out okay, is the data I need?
Does that need to touch theanalytical side?
Therefore, I run on the columnstore or do I need to run on the
row store side?
I think that's probably theright way to go, but trying to
build a full-fledged columnstore that can do transactions
(30:19):
at the same speed that a rowstore can, I think that's
challenging.
Speaker 2 (30:23):
And that's one of the
interesting things you're
bringing up is that they're bothbetter at different things.
Right, the column engine isgenerally better for these large
scale analytics workloadsPeople are trying to get an
approximate sense of somemetrics in their business
whereas these transactionalworkloads you're thinking okay,
this is your ATM system, yourtravel reservation system, your
(30:46):
health records, all these thingsare very operational.
What's an example of some ofthe actual technical trade-offs
going on there?
Speaker 3 (30:56):
Of like why you don't
want to run analytics on top of
the road store or Just in termsof, you know, a transactional
system, like what is atransactional system?
Speaker 2 (31:06):
prioritizing that an
analytical system can't do, for
instance, very well.
Speaker 3 (31:11):
Oh, so a
transactional system is it wants
to run transactions, and so inthat world it's all about
minimizing latency.
Use the P99.
So trying to be able to executequeries and return responses to
the application as quickly aspossible.
And if you start running withtransactions either through the
(31:35):
Cucurcio protocol, whereas thedatabase system is trying to
make it look like you're runningby itself, when you're actually
trying to interleave multiplequeries and transactions at the
same time, in that world youwant things to be done as
quickly as possible becausethey're holding locks or they
are doing much of thebookkeeping on the inside.
That could slow down othertransactions.
So if your transaction runsslower, there's this butterfly
(31:56):
effect where that's going tocause other transactions behind
you to get slow.
And so I mentioned the fracturedmirrors or the split row store.
Column store approach is likeyou could run sort of way to do
isolation and say for all mydata, for all my transactions,
I'm with just as row data, I canrun those fast as possible.
And maybe the slower analyticsthat are run on the side here on
(32:17):
the column store data.
And if I build my systemcertain way like they don't have
to communicate about who'sreading what data what time I
can just let the row store runas fast as possible.
So the priorities for what youwould care about from the
application perspective in thetransactional system are just
different than the column storesystem, and that causes you to
(32:39):
make different system designchoices.
Speaker 2 (32:42):
Yeah, absolutely, and
I think there's always in the
markets.
You always see this attempt tounify everything.
Okay, we're going to unifytransactions and analytics,
we're going to unify batch andstreaming.
But I always get the consensusthat people have done this for a
long time will always saythere's a lot of trade offs
going on there.
It's hard to actually do bothof those things.
(33:03):
Well, this again comes back tothat quote from Michael
Stonebraker, which is one sizedoes not fit all, and yeah, it's
one of those interesting thingsthat you always want to see.
It sounds, when you look at itjust at a high level, intuitive
level, yeah, why wouldn'tsomeone want to just have one
storage engine that doeseverything for me?
(33:23):
I think from a layman'sperspective, you'd think that
would just be like a killerproduct for the market.
Speaker 3 (33:28):
But when you work on
this stuff that there's a lot of
technical trade-offs going onwhen she makes a plan as well,
but the reason why I don't thinkthe HF stuff has really taken
off is it's what I was saying inthe beginning it's not
oftentimes technical right.
So if you think about how, say,you're building a new
application from the verybeginning, whether you're a
startup or inside of a companywhat happens?
(33:49):
Well, in the very beginning youdon't have any data right, and
so you're building applicationto then interact with the
outside world or some otheroutside thing and to collect new
data.
You'll make a web app, you'llmake a new application that can
ingest data from the outsideworld.
At the very beginning you don'thave any data.
(34:09):
You need to get data, and soyou're building essentially a
transactional databaseapplication because you want to
get.
You're ingesting things fromthe outside world.
You're updating things,updating state.
At the very beginning you havenothing.
So you start with atransactional database system
like a Postgres or whatever.
So now you start ingesting data.
Then typically the path forgrowth for these applications is
(34:32):
not the next day you have oneuser, the next day you have a
billion users.
That is rare it happens, butthat's usually not the case.
So it's a gradual increase ofusage and new features get added
and new data comes in.
So the's a gradual increase ofusage and new features get added
and new data comes in.
So the database is graduallygrowing over time.
And then at some point you say,okay, now we do have data,
(34:53):
let's run analytics on this.
And so at that point you eithertry to run analytics on the
data, the data that you have,and maybe it does okay, but not
great.
But then you say, okay, I don'twant to touch this operational
database, this transactionaldatabase, right now, let me
start offloading the data to adata warehouse, a snowflake or
whatever, to start runninganalytics on that.
(35:14):
So then over time now these twothings are growing up
separately and then the H-tapmarket is basically trying to
say, okay, these things grew uptogether but separate as the
siblings.
Let's now go sell you somethingthat's going to go sit in the
middle and can do both.
And it's very hard to sell adatabase system to replace a
(35:36):
working and functionaltransactional database system,
because the risk is so high.
So you're saying you're runningPostgres, you're running Oracle
, rack or whatever on thetransaction database.
Go get rid of that and replaceit with my new database.
But if that new database fails,then you're screwed, because
now you can't take in new data,you can't run orders, you can't
(35:58):
take payments, you can't get newdata, whereas on the data
warehouse side you can stillkeep up the old data warehouse.
You just make copies of thedata into whatever the new
product you have and if it fails, no big deal, because you're
still ingesting data in yourtransaction database and you can
always roll back to the otherdata warehouse.
So the barrier of entry toreplace a data warehouse is much
(36:19):
lower than it is to replace atransaction database.
And so now if you're trying tosay I have an H-tap database,
you're still going to face thatsame barrier where people don't
want to break, don't want to fixwhat's not broken.
So it's very hard to get peopleto replace a transaction
database once it's grown up andaround for years.
There's a reason why IBM stillmakes a ton of money on IMS.
They built that database forthe moon mission in 1960s
(36:42):
because people don't want tobreak stuff that if it's running
just fine, or don't want tochange it instead of changing it
yeah, yeah, and it's that's.
Speaker 2 (36:50):
That's such a great
point.
Like, practically, it'd be veryhard to take that to market
because, okay, you have yourctos and teams who've already
built applications running ontop of the transactional
database that could be in placefor, let's say, a year.
It could be 40 or 50 years.
And then you have this new CIO,this chief data analytics
officer whatever you want tocall it who wants to come in and
say, hey, I want to unifyanalytics across all my systems.
(37:12):
The transactional system isjust one of my sources.
And then I also have mymarketing data, I have my
third-party data.
Just get that all into theanalytics warehouse.
And then I'm going to have whoknows into the analytics
warehouse and I'm going to havewho knows, hundreds to thousands
of data scientists and analystsrunning queries on that thing.
Right, and then, from thatperspective, you, you can see
(37:33):
why it would be hard to say,okay, hey, transactional
database owner, let's onboardall these analytical users for
you, and then they would.
That would be its own kind oflogistical nightmare and
technical nightmare, like howwould you even?
How would you do that withoutimpacting the production,
operational applications?
It's definitely very cool whenyou see some of the new
technology coming in.
(37:54):
Let's say, for instance, duck db.
I I know that's in your in someof the exercises in your course
as well.
I I did like that part of thesql exerciser where you run some
queries in SQLite and then puta timer and run the same queries
in DuckDB and it's justautomatically faster.
And for me personally, I useboth DuckDB and a data warehouse
and the product I work on doeschange data capture and data
(38:18):
streaming.
We'll always stream data outfrom our transactional database
into our warehouse with stream,but there are some kind of
institute and analyticalworkloads that we want to put on
top of the operational database.
Duckdb has been awesome forthat because in that same
application that's managing thedatabase, I can just instantiate
(38:38):
DuckDB and it'll do these superfast analytical queries and
that just solves our problem.
It's not super scalable, butit's fast and it works.
Speaker 3 (38:48):
Yes, that'd be
fantastic.
I think what Mark and Hanneshave built is phenomenal.
I'm jealous Not in I hate themway, but they've got a database
system.
I like database systems andthen they got people to actually
start using it and I thinkthey've've hit.
They had the idea this thingneeds to be super portable and
(39:09):
have a low barrier entry fromthe very beginning, and I think
they crushed it yeah, and theout-of-the-box performance is
always awesome.
Speaker 2 (39:16):
Engineers love
publishing mini benchmarks of
their internal workloads andit's always a big win where I
can go show my manager hey look,we sped things up by 510x by
throwing this technology inthere.
Of course, my management is allvery incredibly smart.
People had a ton of questionsand I said, okay, we're not
solving that problem, so we justsolved this problem and
(39:37):
increase the performance there.
There's always some excitementaround performance gains, so we
talked a bit.
There's always some excitementaround performance gains, so we
talked a bit.
I mentioned MichaelStormbreaker's name.
You did as well.
I said in the context, one sidedoes not fit all, but you
recently co-published a paperwith him called what Goes Around
(39:57):
Comes Around and Around.
Yes, I want to hear the storybehind that paper.
Speaker 3 (40:04):
Yeah, I want to hear
the story behind that paper.
Yeah, we should preface bysaying Mike wrote another paper
with Joe Hellerstein at Berkeleyin 2006, called what Goes
Around Comes Around, and so ourpaper in 2024 is the 20-year
follow-up to this.
So the, I guess, for thebackground, people don't know.
So Mike Stonebraker won theTuring One databases in 2014.
(40:25):
He's the inventor of Postgres.
Prior to that it was Ingress,but he's been involved in the
data scheme for a long time.
He's brilliant.
So he wrote a paper in 2006, andbasically it was a historical
summary of all the data models,going back to the hierarchical
and network data model whichpredates the relational data
(40:46):
model in the 1960s, andbasically shows that there's all
the attempts to try to buildsomething better than the
relational database, therelational model, and here's why
they're not as good.
Here's what doesn't pan out andhere's why the relational model
or, as he puts it, the objectrelational model, which is what
technically Postgres' data modelis.
(41:08):
It just means it's a relationaldata model.
That's extensible why this oneis superior.
And so in I think it was duringthe pandemic it was like 2020,
I saw a Hacker News post wheresomeone was like hey, I don't
know why people are usingrelational databases.
We should just use graphdatabases for everything.
And I was like, oh man, throughmy proselytization of the
(41:30):
original data model, I feel likeit's basically Mike's thesis of
what goes around comes around,where people are just like
reinventing the wheel over andover again.
So this person saying why arewe using relational databases,
why aren't we using the graphdatabases they're just unaware
of?
Oh, people tried that with thisthing called CODISO, the never
data model, in the 1970s and itdidn't work for reasons X, y, z.
So I just felt we want to putout another paper as a follow-up
(41:54):
and say here's what all thethings people tried in the past
if you don't know your historyand here's why the original data
model is going to be the bestor the better.
And so we wrote the follow-uppaper that came out this year
and it's basically an updatedanalysis of data models that
have come out since the original2006 paper.
(42:14):
We do discuss key value stores,which predates that.
Key value stores are probablylate 80s, early 90s.
But we cover array data models,metric data models, vector data
model, the document data model,graph data model and then sort
of map reducers that doesn'thave a data model or text
(42:36):
searchers that don't have a datamodel.
We'll discuss those as well.
So that's the first half of thepaper, and the second half of
the paper, which we haven'ttalked about too much, is the
developments or advancements indatabase systems since the early
2000s, and the main takeawayfrom that part of our analysis
is that most of the improvementsand advancements have been in
(42:57):
the context of relationaldatabase systems like cloud
architectures, hardware,acceleration, column stories
we've talked about before.
There was this sort of group ofsystems called NewSQL where
they kind of do high-performancedistributed transactions.
So we basically talk abouthere's all the things people
will try, here's what worked,here's what didn't work.
Speaker 2 (43:20):
Yeah, and it's really
important that you put out
papers like this because, justto, I don't want to get too much
into this new database.
Companies are really good atmarketing.
In a lot of cases they do havea lot of value, but sometimes
they'll make claims that are alittle hard to defend when you
(43:40):
look at it objectively.
So I don't sell to.
Speaker 3 (43:44):
It's quite often
right.
I frat to shit on blockchaindatabases.
Like blockchain.
That's another one I'd be likewhy would you ever want to store
your things in a relationaldatabase?
You just use a blockchain.
No, it's stupid, it's slow,it's wrong.
Here's why.
So, yeah, it's basically how tosay having to do like a Google
News Alert or something to findout every time.
People are saying like, hey,relational databases are stupid,
(44:06):
sql is stupid.
Here's a better one Instead ofhaving to have me go reply to
them automatically.
Just write this paper andpeople could point at it and
then 20 years, people forget wewrote it and have to write the
next one.
Speaker 2 (44:17):
Yeah, amazing, and
I'm sure it won't be hard,
because I think 98% of what youwrote in this year's paper will
probably be relevant in 10, 15years as well and probably just
need to be restated andrepublished.
I did want to get some of yourperspectives on data lakes and
data lake houses.
You did mention that they'veemerged.
(44:40):
A challenge the monolithic datawarehouse.
There's, of course, all theseconversations going around
Iceberg and Delta and, of course, Databricks bought Tabular,
which is the For a lot of money.
Yeah, for a lot of money.
And now AWS created S3 tableswhich will manage Iceberg tables
.
From their perspective, AWS isprobably seeing all these people
(45:03):
managing tables on top of theirobjects object layer on S3.
They're probably saying, hey,why don't we just do that?
I want to get your perspectivesthere.
Do you think it's a resilienttechnology that a lot of
companies can go ahead and adopt, or do you think that there's
some challenges ahead there?
Speaker 3 (45:18):
Yeah, I guess we
first want to define what a lake
house is, or data lake, datalake.
So the way people ran datawarehouses before the data lakes
was that you would provision avery expensive machine or a
large number of machines andthat would be this model
architecture where you would putall your data in for the
organization, and so it was.
The data system was basicallythe gatekeeper.
(45:39):
So if you want to sort data ina in your data warehouse, they
do analytics on.
You had to define a schema, youhad to the hardware or you get
permissions to put it in, andthen you would insert it into
the database.
And the database is what we do,what we call managed storage,
where they were responsible foringesting the data, organizing
(45:59):
the actual physical bits youwould then store on the disk
right and maintaining thatinformation.
So with the data lakearchitecture, the idea is that
instead of having everyone to gothrough a central control of
putting data into this datawarehouse, you could allow
anybody just to write a bunch offiles to S3 or whatever your
(46:20):
object store is, and then theidea would then be if someone
wants to then do analytics onthis data, they wouldn't have to
go through again to the datawarehouse.
They could just grab the dataoff of S3 or whatever your
object store is and process itthere.
So these files worst casescenario, it's just much a JSON
(46:41):
or CSV, like text formatted data.
But things like Parquet and Orkthese are now binary encoded
columnar formats that be as ifit was being organized by the
data system.
But there's libraries thatgenerate these files yourselves
and just write those things out.
So that's the idea of a datalake is, instead of having
(47:02):
everyone go through a monolithicdata store, you just put things
out in S3 and then you woulduse a catalog service like
Databricks has Unity.
The idea of a data lake is,instead of having everyone go
through a monolith data store,you just put things out in S3.
And then you would use acatalog service like Databricks
has Unity.
There's H Catalog, there's nowSnowflake, polaris.
It'd be a way to go find thedata lake that provides a SQL
(47:28):
interface to go run queries asif it was in managed storage
from a data warehouse.
But actually it's just on S3.
Dremio Databricks are examplesof this.
So that's the background there.
Where Iceberg comes in, there'salso Hudicom.
I think there's another one,I'm forgetting Delta Lake from
(47:51):
Databricks, what those are.
Instead of having people just,in their application code,
generate the Parquet file andthen shove it into S3, now you
have a sort of interface whereyou can do insert, update,
deletes on essentially whatlooks like relational tables on
files, as if they look liketables to you but underneath the
(48:12):
covers it's just JSON andParquet, and then Iceberg or
these different middleware.
They're responsible forcollecting the information,
running transactions for you ifyou want to update things and
then compacting and coalescingthem into Parquet files that you
then write out into S3.
And then again they would alsoprovide the catalog server so
(48:33):
you know what files correspondto what tables and so forth.
As you mentioned, iceberg cameout of.
Iceberg is the standard now.
Iceberg came out of Netflix,hootie came out of I think, uber
.
H Catalog came out of the HiProject, which came out of
Facebook.
So there's Iceberg, if peoplehave coalesced on that as being
(48:56):
what the standard is.
And, as you mentioned, amazonnow imports it natively,
although they had support fornative Parquet select as well,
but although they haven'tdeprecated it.
But you can't get it now ifyou're a new customer but, like
existing customers, can still dobasically predicate, pushdown
or filtering on parquet filesdirectly inside of S3.
So Iceberg looks like it'sgoing to become the standard.
(49:18):
The backstory is that Snowflakehas slowly been, or has
incrementally added support forIceberg, I think since 2002 or
2001?
Sorry, 2021.
And then they were in talks tobuying Tabular, I think for $600
million, either early this yearor late last year, and then
(49:40):
Databricks came in and justthrew $2 billion at them in
their face and stole them On theday of the Snowflake Summit
CEO's keynote session, where hewas announcing their Iceberg
support, polaris, yeah.
Polaris catalog and then thenext week at the Databricks, I
guess, summit, they announcedthat they were open sourcing the
(50:03):
Unity catalog.
Yeah, but I say Polaris, Ithink it's rewritten in.
I think it's in Rust or C++.
It's not written in Go or Javaand it has become an Apache
project.
So it's not written in Go orJava and it has become an Apache
project.
So it's not just Snowflakebuilding it now.
Dremio's involved, a bunch ofother companies are involved.
So yeah, basically Iceberg isnow the standard Audio Decovers.
(50:23):
It's basically again, as I said, json files and RK, but I think
the interface to the catalogand the transactional updates,
that's people have sort ofcoalesced around that, I think.
Do.
I think this is like long-termhow people are going to build
things, absolutely yes, I thinkthe idea that you're going to
have a monologue data warehousewhere some administrator is
(50:46):
going to have complete controlover who gets put things in and
out.
You're still going to alwayshave that kind of governance.
But I think the disaggregatedarchitecture of something like
Iceberg is the way to go.
It just makes sense, becausewhy spend so much engineering
resources to make a storagelayer?
If you're trying to build asystem, just rely on S3.
(51:06):
It's infinite storage, infinitein quotes.
Obviously Amazon, there's afinite limit, but you'll run out
of money before you can fill upS3 on Amazon or GCS, whatever
it is.
I think this is the right wayto go.
This is basically how you wantto build a modern OLAP system.
I think it's sustaining.
I don't see it changing anytimesoon.
Speaker 2 (51:31):
Do you trust that
Iceberg is the best table format
this time?
Speaker 3 (51:37):
When you say Iceberg
is the best table format, it's
Parquet.
Underneath the covers, do Ithink Parquet is the best format
?
No, parquet was designed in2013, 2012 and in a totally
different world, where disk wasalways considered the slowest
thing and therefore you had aheavyweight compression to
(51:57):
minimize the amount of diskvalue you have to do.
The hardware has changed enoughthat network and disk are
actually pretty fast and the CPUhas actually become the
bottleneck.
And so we have a paper thatcame out in VODB last year that
did analysis on Parquet and Organd basically showed here's a
bunch of assumptions that theymade back in the day that don't
make sense now and we've beenpushing for a sort of a new
(52:21):
format based on this, which Icould talk about.
Microsoft put out a paperaround the same time as ours,
basically found corroboratedresults, found the same thing
Parquet and Orgg because they'reold.
So we have a line of researchthat is trying to build a new
file format to replace Parquetand Org and we're not so much
(52:42):
interested in here's theencoding for the different
columns and, like I mentionedbefore, doing this run-like
encoding for column-straightdata, like there's
implementations of that.
I don't like that's going toevolve over time doing this
run-like encoding forcolumn-store data.
There's implementations of thatthat's going to evolve over
time.
I'm more interested in thescaffolding around a file, so
how you could define what thefile should look like and how
you store the metadata to sayhere's how the data is actually
(53:03):
being encoded.
So the reason why this mattersis, if you want to add a new
encoding scheme to Parquet,right now you can't do it.
You'd have to go modify theoriginal Parquet code.
Then also, this makes your datanot portable, because if my
application is not using mymodified version of the Parquet
reader, which there is a bunchof different implementations,
different languages then I can'tread this new encoded data that
(53:24):
I have.
And so we're interested inbuilding a new file spec that
allows for the portability andextensibility so that we don't
have to reinvent the wheel every10 years with a better version
of Parquet.
We can just build one file specsimilar to POSIX, one spec that
has evolved over years but it'sstandardized.
That's what I'm interested indoing is building that
(53:46):
specification, not like here'swhenever we want it, for this
exact hardware that we haveright now.
Speaker 2 (53:52):
Yeah, that's exciting
for sure, and it will become
every, I think, every four orfive years.
The architectures shift alittle bit.
People standardize on whenapproach.
I think the vendorshyper-optimize, selling around
that approach and then thingsmagically become too expensive
again and then people find a wayto make architectures cheaper
(54:16):
and more distributed and getmore components in the place
they compose.
So I think, yes, coming up witha new, a file format that's
more efficient and accessible, Ithink that's also a very
exciting area and it will becool to see how that gets
plugged into some of these.
Yeah.
Speaker 3 (54:36):
Can I give you a
preview of what makes this, how
we're going to handle it?
Future-proofing this thing,please do.
I would say also too, like wehave our format.
We were actually in discussionswith the Velox guys.
They put out a file formatcalled Nimble.
We had discussions with NVIDIAabout some stuff.
We had a larger collaborationin the works but for lawyer
(54:56):
reasons it all fell apart.
But Facebook has, or Meta has,their Nimble file format.
There's LanceDB has theirformat.
There's another system out ofNew York City called SpiralDB.
They have Vortex, the DuckDB,the CWI team has their own file
format, so there's a bunch ofsort of fun ones going around.
As I'm saying, I don't want tobuild, I don't want to have a
competing format.
It's more about the again, thescaffolding.
(55:19):
And so the way we can handleextensibility and portability is
that you actually embed a WASMbinary to decode the data in the
file inside the file itself.
So now, 10 years later, if Ihave some file that was created
10 years ago that I don't havethe code to actually go process
it innately, I can process itusing the WASM that's embedded
(55:42):
inside of it, and so this allowsyou to future-proof the
architecture.
I fully meant we didn't inventthis idea.
This actually came.
Well, the change came from WesMcKinney, who has been working
with us.
The men are pandas and ApacheArrow, but he actually got the
idea from Mark and Hannes atDuckTB.
So this is our new file stackwe hope to put out this year
will be incorporate this WASMpiece.
Speaker 2 (56:04):
Okay, excellent,
definitely looking forward to
that Again.
A lot of excellent ideasfloating around right now in the
data industry.
It's really fun to follow alongwith.
Let's say you are a softwareengineer working in data or even
all the way up to a CIO level.
(56:24):
What's your advice to stayeducated and make people's
technical skills resilient inthis industry?
And emphasis on resilientmeaning, yeah, you could learn
one technology, but thattechnology could become out of
date in 10 years.
At this rate, everything'sgonna be out of date in six
months, but I'd love to hearyour advice on staying resilient
in this market so this getsinto.
Speaker 3 (56:45):
What does it mean in
the modern era to have a, let's,
a degree in computer science?
Right, because I'm in thatbusiness of selling education
and without touting too muchabout Carnegie Mellon, the other
top schools, I think, make asimilar attempt at this.
Right, it isn't about learning.
Here's what the hot thing onHacker News is learn how to use
(57:08):
it, and so forth.
Right, it's about, first,principles and the fundamentals
of the concept needed in dataprocessing, data analytics, data
analysis in general, and so, aslong as you understand the
fundamentals of what does itmean to what is the transaction
actually trying to do?
What does it mean to haveprotection or the isolation and
(57:28):
control and durability?
If you understand thosefundamentals, then no matter how
the hardware evolves or theworkload evolves or the use
cases evolve, that you canalways map whatever those new
things are to your background,to the fundamentals.
So I would say for developers,I think understanding the
fundamentals is key For thehigher on the higher level, like
(57:53):
for, like a cio, how do youmake sense of whatever the
buzzwords are, the challengethere?
Again, it isn't always fortechnical reasons why you make a
choice to go with one vendorversus another.
It oftentimes it has to be do.
I already have my credit cardwith this company and therefore
they can sell me another product, and this is easier not have to
go through procurement to beable to get access it.
(58:13):
I would say that, at themanager level, I think being
skeptical about claims thatpeople make about their products
is it's always better to bemore skeptical than less
skeptical, right, as you said,when people come along and say,
hey, here's my brand newdatabase system or technology
that can change the world and doeverything you need, you really
(58:38):
understand what your use caseis and understand what this
company is bringing to the tableand then understanding OK,
here's what the actualarchitecture of what they're
actually doing, how it'sactually implemented, to
understand is my use case evengoing to make sense or not make
sense or not?
The example I always like to usewas in my classes is this
episode at Uber, I think in 2016, where they were running on
(58:58):
Postgres.
Actually, they're running onMySQL.
Someone said, hey, let's switchto Postgres.
So they switched to Postgresand because of their
applications workload patterns,that was actually a terrible
choice.
Postgres is the way they domulti-versioning and so forth.
That was absolutely the worstthing you could do for Uber.
So then they had to go switchback from Postgres to MySQL so
(59:20):
it had someone understood.
What does our applicationactually try to do, what do the
queries want to do, what doesthe data actually look like,
what is the access pattern?
And then understanding thefundamentals of how Postgres
does mult multiversing versusMySQL does multiversing Like you
could then map and say doesthis make sense or not?
So I think it makes sense tohave understand what the
internals are for certain things, and then, as new things come
(59:43):
along, it's very unlikely thatsomeone's going to invent
something crazy, brand new thatno one's ever thought about
before, and so it's getting pastall the marketing BS to figure
out, okay, what are theyactually doing?
Speaker 2 (59:55):
And then
understanding how that maps to
your needs Absolutely, and Ithink that's great advice both
for the data engineer, thesoftware engineer and the CIOs
ultimately understand the firstprinciples and then be a little
skeptical of vendor claims,because there's really even when
things seem very new orcompanies raise a ton of money
(01:00:17):
or something like that, it'srarely completely novel
something that no one's thoughtof before.
Maybe they found a better wayto commercialize it than others.
But yeah, there's always a bitof work to do to actually test
this stuff.
Speaker 3 (01:00:29):
So the other thing I
did can I give one example
without naming names?
Yeah, sure, I got an email froma ceo and a davis company that
you heard of, but I can't seewho they are.
They sent me an email and saidhey look, I watched your
lectures on distributedarchitectures.
There's like shared disk,shared nothing.
We think we have a newarchitecture that doesn't fit in
any of these.
We think it's brand new.
We want to tell you all aboutit.
(01:00:50):
So I got on the call, talked tohim and, sure enough, like it
was just shared disks, separatecompute versus storage, and so
again, without understanding thehistory and the background and
the fundamentals of these things, people would make claims that,
like again, that wouldn't passthe sniff test for anybody.
Again, who does know thesethings?
Speaker 2 (01:01:08):
yeah, absolutely, and
my especially.
As a building change.
Data capture and dataintegration is constantly
straddling.
All these different new agetechnologies and replicating the
established vendor storageengines have solved the scale
issue for sure for companiesthat actually need scale, and I
(01:01:38):
think that there's a lot ofgreat technology coming in to
make this stuff more practicalfor enterprises to implement.
Now the other side of this is,of course, there's all this
innovation happening in AI andmachine learning, and you spoke
on this in your paper as well.
Course, there's all thisinnovation happening in ai and
machine learning and you spokeon this in your paper as well.
Which architectures andcapabilities would you say are
absolutely critical to adopt fora cio or cto to build into
(01:02:00):
their strategy over the nextthree to five years to actually
realize the potential of ai in?
Speaker 3 (01:02:07):
terms of like
databasetype architectures or
machine-led architectures, orboth.
Speaker 2 (01:02:13):
Yeah, that's a good
question too.
Speaker 3 (01:02:18):
The answer is it
doesn't matter, right, because
here's what matters If your datais dirty and it's total crap,
then who cares if you're runningon whatever NVIDIA's latest GPU
or whatever, if your data issuper dirty and messy garbage in
, garbage out.
So I would say putting up theregardless.
(01:02:38):
Actually, if you're going to doAI stuff anyway, putting up the
right controls and themechanisms in place to make sure
that people can't give you crap.
Data is super important andit's hard to justify because
it's like hey, we think we mightneed this in three years, so
let's do a bunch of stuff tomake sure our data is clean now.
Yeah, that sucks.
(01:02:59):
I would say that's the mostimportant thing.
If your data is completelyuseless and garbage, no AI magic
is going to make this betterfor you.
Speaker 2 (01:03:09):
And when you say
useless and garbage, let's say
big enterprise database.
You have hundreds, maybe eventhousands of these kind of
normalized tables with columnnames that no one can really
make sense of.
Tons of foreign keys toactually get any data.
And people are saying, okay,we're going to throw vector
(01:03:31):
databases at this problem.
The first thing I think is okay, try getting like a human
analyst to go through all thesetables and make sense of them
without reading a lot of likeinternal documentation.
Do you actually see like aisolving that problem?
Speaker 3 (01:03:45):
it will help.
So, certainly, like entityresolution, like the idea that,
like john kute versus jay kute,like realizing that they're the
same people, that's an oldproblem and absolutely I think
LLMs or AI tools can help withthis.
But, like I'm saying, if it's WCoutet versus Jay Coutet and
(01:04:11):
someone did a typo, then you'rescrewed, it doesn't matter.
Yeah, I see.
So I think that's a contrivedexample, but there's other.
The other example people alwaysgive is oh, someone put an
email address instead of a phonenumber.
Sure, like that one you cancheck for as well.
(01:04:31):
It's the nuances of really largedatabases and what the semantic
meaning is, the latent orimplicit meaning of what does it
mean to have this column emptyversus null, like all that is
usually in the applicationdomain and oftentimes isn't
documented?
And so if you can prevent thatfrom happening easier said than
done if your data is running for20 years from happening.
(01:04:53):
Easier said than done if yourdata is running for 20 years.
But I think that would setpeople up to better leverage,
whatever the new AI stuff thatcomes along in the future.
Speaker 2 (01:05:02):
Yeah, that's a great
point and I think, cleaning up
messy data there's all thesekind of master data management
tools out there and there's alot of people who claim to have
great solutions of that.
Ultimately, when you see ithappen in large companies, it's
just someone's just got to grindand go through that work and I
think, yeah, there are someopportunities for LLMs to go
(01:05:25):
solve those problems.
And the other approaches tothis, aside from vector
databases, is Texas SQL.
They're not totally mutuallyexclusive, but Texas SQL this
idea that you have this naturallanguage query and know how to
convert that into thisdeterministic SQL query that'll
go in and retrieve the exactdata that you're looking for Do
(01:05:49):
you see that as being a strongalternative to vector databases
or just a different approachthat could work well with it?
Speaker 3 (01:05:58):
I think it's
independent of whether the
underlying database isrelational versus vector versus
JSON.
But, as you said, the idea ofgoing from natural language to a
structured query language,typically SQL, first of all
again, it's oftentimes indatabases it's not a new idea.
People have been trying thissince the 1970s.
(01:06:18):
Llms are just a modernincarnation of it.
I think it's a good idea.
It doesn't replace SQL entirely.
It's good for quick, one-offthings.
The problem with the challengeis, if I don't know SQL, or if I
don't know SQL, or if I don'tknow, I don't know what the
answer is.
If I knew the answer, Iwouldn't run the query but then
(01:06:38):
be able to take my naturallanguage, generate a SQL query
and then get back a result fromthat and to know whether it's
actually correct or not.
That's a challenge right there,and there's some benchmarks
that can, on existing data sets,to try to figure out how good
you are.
I think that the challenge isgoing to be that the, the
english language or whatevernatural language you're using is
(01:07:00):
imprecise, whereas somethinglike ideally sql would be
precise, but it isn't always.
But like the idea that you'regoing from an imprecise language
to a precise language and thentry to tweak the natural
language to make it then contortif it's not exactly what you
want.
That's a big challenge.
Where I see the natural languagetools for these, these
(01:07:22):
converters, being useful for isas the first attempt, give me a
sql query, but then I can thensomehow show a more structured
form or interface that I canthen tweak it by clicking
buttons in the dashboard orsomething like that, so as a
first pass, but then put thatinto a form that makes it easier
to edit.
That's where I see the futureof this being.
(01:07:42):
But, like for one-off things,the results are pretty stunning,
but you obviously you wouldn'twrite your application doing
this, so you wouldn't like builda web interface or a website
where you have, like, naturallanguage and the queries,
because if the LLM model changes, then the query changes, then
everything breaks.
What do you do?
But for one-off analytics?
I think this makes sense.
Speaker 2 (01:08:03):
Yeah, absolutely, and
I think this also comes back to
the differences between theoperational, transactional
workloads and the analyticalworkloads and the search
workloads, and ultimately itcomes down to smart people
designing their applications theright way, rather than assuming
that AI is going to sort of Idon't want to say idiot proof,
but make everything reallyaccessible to, let's say, just
(01:08:25):
natural language business userswho don't know how to write SQL.
I think that's like you said.
That's been a challenge for along time.
Speaker 3 (01:08:32):
But go beyond natural
language, though, If you're
taking English or whatevernatural language you want and
going to SQL.
Why not go from whatever querylanguage you have now to SQL, or
whatever SQL to whatever querylanguage you want?
So you can think of thesethings as actually being this
integrated bridge that allowsfor portability in a way that we
(01:08:52):
didn't have before.
Now the challenge, of course,is going to be there's nuances
and semantics of certainoperations that make sense in
one data system versus another,but it's very hard to get off
Oracle because the SQL syntax isjust a little bit too different
than CodeCast and other things.
You can see LLMs as reducingthe barrier to switching off and
(01:09:12):
changing things around, sothat's another avenue of
direction I think is interestingas well.
Speaker 2 (01:09:18):
Yeah, absolutely.
I've seen migration tools thatare doing a really good job of
doing the SQL conversion andthat is a bit more of a you can
call it predictable workload.
Andy, I really want to thankyou for being on this podcast.
You're really always verygenerous with your insights.
I do encourage everyone to gofollow along with Andy's work,
(01:09:40):
Block out some time maybe three,four months to take his
database class.
If you're really adventurousand willing to do the work, it's
really fun stuff, Even to justgo through the lectures.
I think even if you just dothat, there's so much to learn.
Andy, where should peoplefollow along with you?
It seems like social media.
Yeah, when are you most activethese?
Speaker 3 (01:10:00):
days I try to get off
Twitter.
I do have an account there, butwe post most of the news stuff.
We're going to Blue Sky now,okay, great, but there's the
YouTube channel.
Everything we do is alwayspublic.
Everything's on YouTube.
The course next semester it'llbe a special topics course on
query optimization.
The lectures will be on YouTube.
We don't do any advertising onthe same.
(01:10:22):
Hey, here's the course thatpeople find Excellent.
Speaker 2 (01:10:26):
You can follow Andy
on BlueSky.
We'll have his Twitter or Xhandle there as well and a link
out to his YouTube course.
Andy Pavel, thank you so muchfor joining this episode of
what's New in Data, and thankyou to the listeners for tuning
in.
Hey, john, thanks for having me, it's always fun.
Speaker 3 (01:10:43):
Bye.