All Episodes

December 1, 2024 59 mins
Summary
In this episode of the Data Engineering Podcast Sam Kleinman talks about the pivotal role of databases in software engineering. Sam shares his journey into the world of data and discusses the complexities of database selection, highlighting the trade-offs between different database architectures and how these choices affect system design, query performance, and the need for ETL processes. He emphasizes the importance of understanding specific requirements to choose the right database engine and warns against over-engineering solutions that can lead to increased complexity. Sam also touches on the tendency of engineers to move logic to the application layer due to skepticism about database longevity and advises teams to leverage database capabilities instead. Finally, he identifies a significant gap in data management tooling: the lack of easy-to-use testing tools for database interactions, highlighting the need for better testing paradigms to ensure reliability and reduce bugs in data-driven applications.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • It’s 2024, why are we still doing data migrations by hand? Teams spend months—sometimes years—manually converting queries and validating data, burning resources and crushing morale. Datafold's AI-powered Migration Agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today to learn how Datafold can automate your migration and ensure source to target parity. 
  • Your host is Tobias Macey and today I'm interviewing Sam Kleinman about database tradeoffs across operating environments and axes of scale
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • The database engine you use has a substantial impact on how you architect your overall system. When starting a greenfield project, what do you see as the most important factor to consider when selecting a database?
  • points of friction introduced by database capabilities
  • embedded databases (e.g. SQLite, DuckDB, LanceDB), when to use and when do they become a bottleneck
  • single-node database engines (e.g. Postgres, MySQL), when are they legitimately a problem
  • distributed databases (e.g. CockroachDB, PlanetScale, MongoDB)
  • polyglot storage vs. general-purpose/multimodal databases
  • federated queries, benefits and limitations 
    • ease of integration vs. variability of performance and access control

Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from
Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:11):
Hello, and welcome to the Data Engineering podcast, the show about modern data management.
It's 2024.
Why are we still doing data migrations by hand?
Teams spend months, sometimes years, manually converting queries and validating data, burning resources and crushing morale.
DataFold's AI powered migration agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

(00:41):
And they're so confident in their solution, they'll actually guarantee your timeline in writing.
Ready to turn your year long migration into weeks? Visit data engineering podcast.com/datafold
today to learn how DataFold can automate your migration and ensure source to target parity. Your host is Tobias Macy, and today I'm interviewing Sam Kleinman about database trade offs across operating environments and axes of scale. So, Sam, can you start by introducing yourself? Yeah. Sure. I

(01:08):
am a
software engineer with
a eclectic background, I guess.
I, you know, like, I I live in Brooklyn, and I've I I worked for a I spent a long time
working on
MongoDB in a couple of different
roles and then have, like, spent some

(01:29):
time working
on other
other engines,
Neon and and and also a startup called ClearAD. And I think
I think databases are
or database engines are are sort of
the I don't know, have always sort of been the, like, the biggest

(01:51):
lever in the in sort of that you can that you can jump on to, like, impact the way that software engineers
are able to
be productive and build cool things. And so it's,
it's always been a really exciting it's weird to say, oh, databases are exciting, but but they they they have always been an exciting
place to to, like, do work and see and, like, have an impact on

(02:16):
our industry. So that's that's sort
of how I got here,
belatedly.
And do you remember how you first got started working in data?
Oh, yeah. No. I, like, I I have a
a liberal arts
undergraduate degree in social sciences, basically, and and interdisciplinary

(02:37):
stuff that I, and I
had a dumb job,
like, in college, like, filing
paper for the, fundraising
department
in, like, file cabinets,
which I can't imagine anyone ever used. But, you know, it's a sort
of

(02:58):
economic endeavor, and I
I would listen to podcasts, which is the thing. It's ironic here we are on a podcast, but I'm, like, not a I don't listen to a lot of podcasts
since then, but I listened to something and,
where they they interviewed some people who are working on
me as
who's sort of in the the last

(03:19):
data
tooling
boom. And I, you know, listened to people talk about a lot of a lot of these databases that were, like, it was the sort of peak of NoSQL
sort of or the
heading towards the peak of NoSQL's moment. And I and I was like, oh, this data thing is, like, real big and

(03:40):
gonna
you know, like, it it sort of seems you know, and I had been interested in open source. And and I I had always been a nerd, but somehow
as a as a younger young person,
I was very resistant into
in in, like, taking taking

(04:01):
CS classes or whatever. And so I I never really studied it, but I definitely
got I got interested in in learning about the technology and thought this that there's, like, a lot of
cool things to be done,
around
infrastructure. And so I I think, like, I can't you know, it's sort of weird now if you say infrastructure, it means you're a sysadmin, and I'm not exactly a sysadmin.

(04:25):
But I think infrastructure in this, like, abstract sense is really cool because of what it enables and the sort of power
the ways that infrastructure lets you focus on the important things of your problem. And databases mean that you're not
spending lots and lots of time
sort of
figuring out how to

(04:46):
absolutely, like, indelibly save data or
make sure that you, like, recall like, there are these problems that databases just, like, solve for you. So as a programmer, you don't have to sort of wonder if you've, like, written the search algorithm correctly. Right? You just
select whatever whatever you want.
And and I think that that that that there's, like, many levels of that. And I think we're actually, like I mean, we've been having

(05:12):
relational databases
for
however, you know, 4 years. And we are, like, not yet to the point
where the, like,
the abstraction between
application code and
and state persistence and and retrieval
code or the database engine. Like, that boundary is not exactly where it should be, and it's, like, interesting to watch it sort of settle over time, like, over the, like, long time.

(05:40):
But
yeah. So all of those things, like, have have sort of you know? And then I you know? And then you get a job and you keep working at it. And I and I've
and then your next job looks like your previous job, and suddenly you're, you know,
suddenly suddenly you have a thing,

(06:01):
somehow. I think that's a, like, an accident that many of us kind of stumble into.
And on the point of database engines,
they are
the system of record for whatever application or service or system you're building
where
database
is a very broad term where it can be anything from I happen to have written a file to disk and then I am reading it back to

(06:29):
some of these deconstructed database engines where you have the lakehouse architecture that's emerging where you have a query engine, you have some object storage, you're off live somewhere else,
and everything in between across different modalities of data types and data formats,
vectors have become increasingly popular as a data type because of the AI growth.

(06:52):
Yeah. Well and and there's, like, there's, like
indexing is sort of a classic. Like, I I kind of actually think of, like, vet there's a lot of vector search problems, which are, like, subclasses of indexing problems or sometimes they can be. There's also,
like, database is an abstraction.
I think I think that's the, like, the thing for for doing this, like, save access and retrieval

(07:17):
piece.
You said something that I thought was great, just like
deconstructed database. I think the there's this
sort of orthogonally to all of this, what is a database, right? Like,
there's also these questions about how we build applications
and
that shape how you need to save data. So you have

(07:40):
you know, there are very architectural
decisions which influence the shape of the data that you have. There are deployment considerations which shape how you how you deploy things.
There's, like, patterns or, like, workload
properties that constrain how you manage the data that you have.
I I think the truly revolutionary thing that's, like, happened

(08:04):
in
the last 15 years in computing, I think, is that that, like, s 3
has become
reliable
in a way that's kind
of alarming,
cheap in a way that's also alarming,
and
and ubiquitous.
And I think that and the cheapness makes it ubiquitous, but but that

(08:28):
that means that we
it used to be the databases were like, how do you attach enough disk to a machine that has enough RAM to do the thing that you need it to do?
And and the, like, the size of the disk and the you know, like, that was the sort of thing that mattered. And now, like, you just put it on s 3. You can, like and, like, network

(08:48):
having storage that is connected over Ethernet rather than, like,
over, like, some sort of, you know, PCI bus or whatever, SCSI bus, is is like that that shift changes how all the rest of these systems work. And I think that and it lets you it lets us as database makers, engine makers,

(09:10):
separate
storage from compute and have them vary separately. And that's real like, that's really powerful.
And it gives rise to, like, some of these, like, serverless things
that are that are cool.
And I think, again, we haven't quite seen all of it. Like, there's gonna be, like
s 3, as good as it is, is gonna get better along

(09:34):
a number of dimensions.
And and I think we will,
we'll start to see even even cooler things there.
And I think on the s three point, there have been a few developments over the past
5 or so years that have also increased its utility,
most notably the read after write consistency that didn't exist for a while where you would make a write, and then depending on where you are reading from, it may or may not exist yet. Now they've got guaranteed read after rate consistency.

(10:06):
And most recently, they've added a
optional update
pattern where you can say only update this file if it is actually different than the one that I'm trying to upload, which can add a lot of efficiencies as well as far as IO for conditional updates or where you're not actually changing any of the data.
Yeah. Yeah. Yeah. No. It's I even think the, like,

(10:29):
non
the, like, in the inconsistent read after write version of s 3 is, like, is,
like, so
good
in a lot of ways that that you
or that I like I don't even think about all these other things as I'm like, yes. This is great, of course. But but I think there's they have also been adding,

(10:50):
like,
faster s 3
tiers, which are more expensive, but, like, gives you a level of consistency. And then you can do all these, like, crazy tiering. But but regardless of all of that, it means that you can, like,
allocate your compute
very elastically
and and not have to worry about

(11:12):
stateful,
like, not have to worry about managing disks. Right? And and it's actually like when people are like, why do you need a new database now? Aren't the things that we have good enough? And the and, like, the things that like, the these, like, old
industrial
battle tested pieces of software
are are really

(11:32):
good.
And, like, it takes a long time to make a database kind of robust in that way just because, like, you have to bang on it from all sides. But the
but the but they're all they all assume local storage in this way that,
like,
you can sort of rip out the bottom and put something new in, but it doesn't it, like the the transition

(11:55):
from,
like, colocated
storage and compute to separated storage and compute is, like, a really huge change in in, like,
in the software. And and sort of get getting us to that point, I think, is is really,
is gonna be is, like, is an interesting,
like, technological

(12:17):
shift,
that that, like, that all that people writing software today are just just beginning to sort of figure out how to do. And there are a lot of legacy tools which are sort of, like, sort of shoehorning themselves into being
serverless
in a way that is, like,
absolutely

(12:37):
not gonna, like,
fly,
from a,
I mean, frankly, from an accounting perspective, let alone, like, technology.
But but I think it's it's this is another thing that's super interesting to watch.
And one of the points that you made early on is the impact
on the system design that the database has,

(13:00):
and
I think that that's definitely a very real consideration
for anybody who is trying to design or evolve
any sort of service, whether it's a
consumer facing transactional application
or a
analytical workload or machine learning system.
And I'm wondering if you can give some of the

(13:24):
main points of consideration
that engineers and teams need to be thinking about as they're starting to determine what is their persistence mechanism for the system that they're building
and how that shapes the capabilities
of that overall system.
Yeah.
The
it's sort of I I think that I I often kind of, like, think about backing into this in a different in a different way.

(13:52):
Or, like, the the answer to the the way so we'll get I think we'll get there, but it won't feel like it for a little bit. So I think, like, the thing that you end up wanting to do
is to think about
the queries that you're running and think about things like,
how does the data get updated?

(14:12):
Is this aggregating data from a lot of different places, or is this
data that we're writing and reading?
Does
data get modified
after after it's created or not? There are some kinds of data where that's
true or not.
There's also a
a expense question. Like, how much money do you have to spend on

(14:35):
on engine on the, like,
on the database?
And then and then there's this, like
this sort of hints at, like, the the analytics question, but, like, are you returning
records in some kind of what like, business object
record of
that is, like, persisted in a database, persisted in the database, and modified in in the application? Or are you

(15:01):
returning aggregates or computed values that sort of summarize
larger parts of the dataset? And and the thing
or sort of so, like, once you answer those questions, once you have answers to to, like,
what kinds of queries you're doing, which really kinds of depends on
like, there's no wrong answer to any of these questions. There's just

(15:24):
you you sort of,
like,
you have to do the thing where you write all of your business objects on index cards and lay them out in front of you and see how they interact with with,
each other and and where the you know, what what's what needs to where things need to go. But you, that's a very old school way of thinking about this, I guess. But but you you once you sort of have that, like, what are the components, what are the objects in the system, and how do they interact? You can sort of and then you have to figure out how you need to save them and and

(15:58):
or where where the objects exist in existing systems if they're, like, accessed over over APIs. And then
so you
you have so you have to, like
once you have where all of the data is
and have all of
the
the the, like, components and the objects,

(16:21):
you
you start to sort of
think about how you're gonna use it in the in in whatever thing that you're building.
And there are lots of different tools which do,
you know, will do sort of, like,
one thing really well and a and a a few things

(16:43):
less, like, fine, but not great. And the and the challenge is, like, always, like, picking the sort of the best tool
for what you're doing and what you're planning to do based on
how and where you're going. And I think that, like, the thing that that I
the thing the sort of, like, sort of key things are, like,

(17:07):
right throughput
is sort
of hard to
like, that's like there are some I guess, like, maybe maybe actually the way to think about this is, like, what do you need to think about first?
And or, like, what are the things which you can't compromise on? And so there is this, like like, write throughput is really like, if you pick a system that has slow writes,

(17:30):
it's really hard to make it fast.
If you
have a system which doesn't really support efficient
updates,
then you can't like, and your workload is very update heavy, then
you really like,
you're in trouble.
And
and and and then you sort of, like, go there are, like

(17:52):
you know? So, like in some ways, like, insert throughput, update throughput, and then
various kinds of, like, can you do recall at the level that you need? You can make there's, like there are ways of making every
query fast. If you're if you're, like, if you're determined, you can make a database give you a query faster usually,

(18:15):
But but there are it's hard to make it write faster if if you're sort of, like, teetering on the edge of its, like, tolerances.
But the
but I think one of the sort of big,
big one of these, like, big sort of top level sort of properties of engines is, like, is the data stored in a in a, like, row oriented

(18:36):
format or a column oriented format? And I heard and I, you know, as a
early career
pre pre working on database engines, like, heard these people say columnar databases. And it took me a, like, really long time to sort of make sense of, like, how and why this was important. But, basically,
the idea is that if you so row row oriented is very sort

(19:00):
of reasonable, and you can think about it. You have records or rows.
Everyone sort of thinks about this now. It's like an Excel
spreadsheet, but you know what a row is and all of the all of the columns,
all of the fields,
positions in the row are things that you, like, are related to each other. And then columns, right, are sort of all of the all of the different values for each of your records

(19:25):
are
stored
sequentially. And so you will till you have, like, every you know, if you have an ID
and
and
a, you know, like, a a name and a balance and a last updated time, you would store the names and then all of the

(19:45):
balances and all of the last updated times. And when you do that, you can there are, like, ways to compress
it more quickly, and there are ways
of computing averages
very quickly because you you can just sort of, like, shove,
shove the the data through through the
process or really, like, there's, like, very little copying that goes on. It's it's it's straightforward

(20:09):
and and quite fast. And this is how analytics engines,
like, make things fast is they just write the data out in a way that makes the query that they're gonna do
quicker.
And so knowing how these things
these things work is
like, really
lets you answer the questions of, like, what kind of engine do we need? Because

(20:32):
we're gonna be computing averages, and so we need a columnar format. Or we're computing
rows, and so we need or we're, like, recalling
one row at a time or a collection of rows or a a filtering of a section of rows. If you say have,
I mean, this is sort of obvious, but if you're just, like, returning records

(20:54):
again and again, you
like, columnar formats may be very quick for answering some kinds of
queries, but they're quite slow at returning rows. That's obvious. Right? But,
sounds dumb to say it. But but
that, like, either either you store things as rows or or you store them as
as columns are is sort of one of these, like, big choices in in databases. And it impacts things like write throughput and update performance if the data isn't if you're updating a whole row and they're in columns, you have to jump around a lot to get the update to work.

(21:31):
So that's that is, like, one of these, like, fundamental decisions that you make.
That dichotomy
of the row versus
column oriented
record storage
is
the fundamental divide that has introduced
the practice of
extract transform load or extract load transform whichever sequence you want to do it in because of the fact that if you have row oriented data, it's easy to insert an update,

(22:00):
but it's slower to do aggregates. And if you're trying to do aggregates across your transactional database, it's going to negatively impact the application.
So you move all the data to a different system to do all the analytics,
which increases a lot of cost and complexity. You're duplicating the data,
so you have to duplicate the storage cost.

(22:20):
And
so that's definitely one of the fundamental aspects of how do you think about the system where
are you primarily doing analytics, or are you primarily
doing
single record
events or updates, or how how what what do you need to be able to retrieve?
Yeah.
I and I but and I think that's like a I I kind of think of this like the ETL probe problem is

(22:47):
I sort of I I kind of feel like the ETL is kind of this, like, problem of our own creation.
The the there is this to to sort of
that that thing that you
it it I mean, this is this is like,
this is this is a sort of interesting thing. But we have there is this sort of to like, there is this, like, historical myth that you can, like, have one

(23:12):
database, and you'll just, like it will be huge, and all of your
code will access the same database. And and, you know, and this is, like,
a sort of this is the business model of, like, general purpose database vendors is that you're just gonna, like, you're just gonna make one giant machine or a couple of giant machines, and, you know, it's gonna run DB 2 or Oracle or one of these, like like,

(23:38):
behemoths,
and
and it's gonna solve all of your problems with data. And
and this is, like,
basically always been a myth
for the for literally this, like, this thing that you just said where
if you run analytics queries against your transactional workload,

(23:58):
then
the workloads conflict with each other, and they make each other slow. Right? And the mechanism for this is, like, quite simple. You have you have more data than RAM. The data isn't in you know, the data working set
for your instance at at
any given time isn't in memory.
But

(24:20):
if you're really smart
or clever, you can keep the working set for a given
workload
in memory, and you'd sort of optimize the workload in the application so that
for the most part, you're not modify
you're not going to disk
as much as you can. And transactional systems use, like, indexes for this. Right? And indexes are kind of like real time ETLs

(24:48):
of of your of your system,
in a lot of ways. There are these there are things like, materialized
views, which
are, again, the same kind of thing.
And they, like, put the data in a different form
and keep it up to date sort
of sequentially,

(25:09):
or, like like, strictly, like, in a consistent sort of way. But but but
those those also sort of solve this, like, how like, ways that we can kind of keep the data sort of positioned in a way that makes the queries that you're gonna run stay in memory, not have to sort of bounce
bounce to random IO on on disks a lot. And then and then you're sort

(25:33):
of good, but if you have more than what if you have too much of this going on, then the workloads thrash a limited set of memory. And then and then you're into this, like, okay. We have to ETL and copy it over. You can also like, there are
you can also imagine having
having
diff like, using
read replicas

(25:54):
to satisfy some of that some of those workloads.
You can
imagine
having,
you know, like, having ways of persisting data
that
don't involve updates very much. Right? There are these, like, event driven systems,
and so you will, like, materialize
out

(26:14):
the row data,
you know, on some interval. And maybe that's, like, enough. Like, you don't like, having
this, like, having the objects be sort of
strictly
read from memory and, like, material like, in memory
as, like, a row and record is, like, maybe not always the thing that that we

(26:35):
we need. And so there is like, you can write applications
that do both, or you can have some of the like, you can use
materializations
or or other sort of
tools within a system
along with replication and and all of this to get kind of,

(26:55):
to get the the
the sort of query performance that you need for a given workload. Right? Because everyone some things like, you're always doing both.
Right? There's not a no. Look. This is the application where we save numbers, but we never average them. Right? Like, no. I'm like, okay. What are you doing? Like, you know, that's a that's not a that's not a thing. Stop it. Right?

(27:17):
There are,
you know and and and you say this other there is this other thing of this, like, assumption that
that, like, ETL ing is
is, like, all like, has this this, like, flaw of
of, like, your data's stale and you're saving another copy. But if you're doing like, if you're on some kind of event driven,
like, message passing,

(27:40):
like, persistent like, using persistence via like, with messages and records and events,
then and not, like, objects that you're you're updating in place, then you which, like, that's a thing we do. Like, that's sort of, like,
pretty good way of doing things in in a lot of cases.
Then
the thing that you're ETL ing is archives, which you need to do anyway. And so so, like

(28:04):
and and it's not even it's not even an interesting ETL at that point. Right? You're just, like like,
streaming data out to a file. Right? Like, it's and it'll so so so
all of this all of this to say, like,
there are a lot of tools. There are a lot of these, like, patterns that we use.
And I think that we tend to sort of

(28:27):
default to using databases
with in these, like, very sort of specific ways with, like, all of these, like, very you know, like, I'm I'm, like, kind of, like, picking apart all of your assumptions and all of your, like,
the all of these things that you're saying. But you're also sort of saying exactly, like, the thing
that is is sort of like, the the the common

(28:50):
way that we think and talk about
about these, like, solving these data persistence problems.
Like, of course, we use ETL or ELT or
TL I don't know. Whatever. It doesn't like, the the the this is, you know, like,
processing your data and writing it, you know, like, that's it's called a query. Like, you know, it's somewhat like like like like, ETL should, like, be just like, I I've I've always sort of you know, like, every so often, I, like, stumble across an ETL flow. I'm like, this is you're just running a query in a CronJob.

(29:23):
Like, I I get it. I like it. I like CronJobs
a lot, but
but I think that I think that we we often, you know, and we often sort of fall into,
like, leading. And this is
this is
this is a,
like, common

(29:44):
engineering.
Engineers do this a lot, but you sort of start with the tools and work backwards to the business problem. And I think they'll, like, actually,
you
know, they're it take you know, like, you need the sort of grumpy,
like, architect or principal engineer type to sort of say, like, okay.
Like, let's back up and, like, go go in the other direction and sort of say, like like, ask why a few more times. Like, why are why are we doing this, and what what is the thing that we're trying to accomplish? Not, you know, like, do we need to archive? Do we need to get this this one query really fast or this other query?

(30:23):
Or we need write throughput or or whatever the whatever the issue is. So,
yeah, I
I think it's I think it's I think it's these are, like, really important questions, and it and it impacts, like, how we how we
what tool like like, if you you you get to the same like, you still build a thing that works at the end of the day.

(30:44):
But the
the sort of
assumptions that you need a thing of a certain shape is is
you know, leads leads us into some weird contortions,
I think, actually.
That's definitely
true.
And
I think another interesting angle to it is

(31:07):
starting from that business case of I need to solve this problem.
So given the requirements constraints of the problem space, I will then choose this, this, and this solution
is that
businesses
and the context in which they're operating are constantly evolving.
So what is the right solution for a business at a certain point in time

(31:32):
will not necessarily
be the right solution for all time, And so you get into those situations
of iterative development, iterative architecture evolution.
And in a lot of cases, the
database engine that you selected because you thought that it was going to solve all the problems that you had,
and maybe it did for a time,
then becomes a liability

(31:53):
because
the assumptions
of the business are no longer true.
So you need to be able to evolve to accommodate
the new realities of what you're trying to solve for.
And
I'm wondering, what are some of the useful
ways for teams to think about,
1, future proofing, recognizing that that is never the reality? You can never actually future proof, but

(32:20):
planning for
evolution of the architecture
and some of the ways that teams, when they do hit that point of realization
that the engine that they have selected
is no longer viable and is now a liability,
how can they
then make that transition to a new system
as painless as possible?

(32:41):
Yeah. I mean, I think you sort of you, like, you you got it in the, like, you can't future proof anything,
point. So, now I have to think of something cleverer
to say
yeah. So I think that
I think that
what you
I mean, in some ways, like,
don't

(33:02):
try is sort of the like, the answer is, like, you know, I sort of I obliquely referenced this the, like, the, like, x y problem of sort of jumping to a solution
too soon. And I think that, like, the the sort of the thing that you can do
to sort of be the one who says time out time out, folks. Let's let's, like, let's ask a few more questions is to all to think that, like, okay. I'm an engineer.

(33:29):
I
should I should always ask one more question and not or, like, I should always sort of state my assumptions.
And and often,
like, that is the that is the tool that you use to future proof what you're doing.
I think the so that's, like, that's that's a nod in the you know, that's, like, too too sort of hand wavy because people will, like,

(33:51):
you know, like,
engineers will almost always
think that they're doing that and not quite do it. So, like, asking
asking more questions and, like, starting,
like, I assume starting sort of, like, every ticket with every every Jira issue, every document that you write, every meeting with, like, these

(34:12):
are the things that I'm assuming. And then and then you sort of do
do design work and architecture thinking, like, based on that requirements gathering based on on those assumptions. The the sort of
not
the the sort of
the the second so that's like that's a whole that's like a whole kind of we could spend there's like a whole podcast

(34:34):
series that we could have on that question. There's the the sort of on the other side of of it is this, like, realization that engineers
have this
tendency and desire to over engineer things and to know that sort of this, like, planning for future scaling
is

(34:56):
is is always this, like,
over engineering, premature optimization
kind of
path
and and to sort of,
in some ways, just, like, spend less time
thinking about it because
you're always going to have to do it again in the future,
and you should wait for it to be a problem before you start optimizing it

(35:21):
or or fixing that kind of thing. And I
and that's that's, like, okay. That's, like, really easy to say.
But I think, like,
having
having
an architecture that makes sense for the team that you have
that, like, keeps, like, the assumptions, like, well well declared,

(35:42):
uses
standard
tools that that do scale in some way. So,
right, like,
if you
there there is this, like, example. Right? If you have if you're, like, writing all of your data manipulation
using
pandas or polars, you have basically, you have your queries written in Python.

(36:04):
And if you then need to take your 4 gigabyte dataset or your
32 gigabyte data set or what you know, or
or, you know, small and terabytes
of data that you that have been running okay
with pandas, and then suddenly, like like, multiply it by a 100,

(36:24):
you have to redo it. Like like, there is just like that's not you know, there's no way around that. You have to redo it. Fine.
That's fine.
If it were SQL,
you may have to redo it too, but there's a chance that, like, the SQL will, like, translate
better across
scales

(36:46):
because SQL's declarative, and it sort of says, this is the thing I want.
You,
mister database,
missus database,
go forth and give it to me, and I don't care what you do on the inside. And you can sort of load the data into whatever tool you need to and see what happens in terms of performance. And so so that that, like, that sort of using

(37:09):
using
tools in that way, like, maybe is, like, the right answer, which isn't to say that, like, you shouldn't use
pandas,
because, like, a lot of times you're never gonna do that. And
and and there are you know, and sometimes the redoing it is
less

(37:30):
less intense in one way or another.
Or you or
or you have to, like, do some kind of or you have to there's a group by very early in your execution, and you just there's no way to to have
that query
execute in parallel anyway. And so

(37:50):
your scaling options are are limited. Right? So the so it's, like, very kind of use case specific about, like, which which
tools you use to kind of
to to handle handle that, which is why, like, all the things we said before about
about knowing what kinds of queries you're gonna be doing and and how, like, how that relates to where the data is is absolutely crucial.

(38:13):
Going
and then
you,
you know, you sort of hit this scaling wall or you watch the graph kind of get close, and then you and then you sort of say, okay. Now we have to, like we
we deferred doing this earlier,
but now
we have the SQLite database, which is
it's 2 terabytes.

(38:34):
And,
you know, we need we
really need to, like, have it not be,
like, on the application server because it makes deploys take too long.
Fine. Right. You know, like, there there I like, a year ago, I had
had coffee with a former coworker who

(38:55):
who told you know, described the architecture of a component of the of their of the system they were working on now. I was like, oh, yeah. It's just like it's like an EC 2 instance with with a really big SQLite database. And, like,
it's fine.
The you know? And and I was like
and I was like, but really? And and, you know, if you think about it, like, sometimes,

(39:18):
you know, having a having the app and the database in the same place, like, you there was always gonna be
one one thing anyway, and there are backups no matter what it is. And so you you end you end up with these situations where, like, yeah. Actually, like, people probably should use SQLite for way more than they're currently doing with it. And and then having

(39:40):
and then having have there are ways to, like, architect that in a really sort of sane way.
And then, you know, and then
there are
sort of you can kind of take that in in lots of different directions and
having,
you know, having ways of testing it locally. Like like the the oh, this is just a SQLite database or whatever. It's actually, like, gets you, like, really far, you know, in a distressingly

(40:07):
really far way. But if the, like, thing if but but, you know, like and then
maybe that's fine. And, like, you can
later, you know, move it to Postgres
or MySQL or whatever you want.
And and that, like,
that gets you that gets you a lot. And but but and some of this is, like, what tools are you comfortable with? And and some tie and there are definitely database

(40:33):
tools, right, like,
that do
scale up pretty well along
any number of dimensions.
Though
though it is really it is also true. So, so, like, the the databases at scale, like, there is this there are these, like, class of

(40:54):
of databases which have sharding built into the database,
Cockroach,
MongoDB. I mean, like, obviously, like, I know
things about the MongoDB
world pretty pretty well. But there are there are, Couch
Couchbase
does this. There are there are, like,
Cassandra's and other sort of

(41:16):
native, like, sharding inside of the database
tools, and they
and they're really like, that's, like,
next level stuff. It's also
absolutely true that
in practice, if you're gonna use sharding, eventually, you sort of have to, like,
know that you're potentially gonna be at that scale and, like, spend an hour making sure that you haven't sort of tied yourself

(41:43):
to some kind of railroad track eventually. Right? Like, there is this, like, you kind of need to, like, plan how your data is is, like, laid out in order to to make that work. So
it it is the the answer, you know, the answer is sort of like don't overengineer
anything.
Upgrade the scale when you need to.

(42:05):
Use be use the abstractions
to your advantage,
and also sort of gut check. Like,
it's not possible for this dataset to get larger than 2 terabytes.
Okay.
Like or this is a dataset where we're, like, modifying things all of the time, but it's small. That leads to a different set of a different set of challenges.

(42:28):
This is an append only data dataset.
Great. You know, there are different
different
different like, you
if you start at a certain place, you, like, get to
get to an end more more or less quickly.
And and I think
scaling
is always

(42:50):
or scaling is a thing that application
developers
who know things about who, like, once you know things about how databases work, application developers are really the people in the best
position to sort of
in coordination with the product managers that one hopes they have

(43:10):
to sort of make these make these kinds of decisions,
when building when building apps. But it's but you but it's really sort of a thing that, like, works its way all the way up and down
all the way up and down the stack.
Another interesting
outgrowth
of the complexity
of selection of a database engine

(43:32):
can have
is that the engineers who are building on top of that engine
become
skeptical of
the longevity
of that system, and so they will
refrain from actually taking full advantage of the capabilities of the database
and instead move more of that functionality and logic into the application layer.

(43:53):
So the application becomes more complex and heavier weight and harder to maintain,
and the database
becomes effectively just dumb storage.
And so I think that that is definitely
a a risk that exists as well. And I'm wondering how you have seen teams try to
manage that balance or some of the ways that you think about

(44:14):
how and when to actually invest in the capabilities of the database engine to
improve or enhance some functionality
or manage some of the actual application logic versus just being this dumb storage system?
Yeah. I mean, I think
it's I mean, the things like, I think this is, like, a class of, like, over know, like, oh, we're gonna put an abstraction layer in front of x thing because we might change it later. You're never gonna do it. Right? Like, sorry. Like like, that's not a that's not a thing.

(44:47):
Like, no one so there there are a couple of, like, fallacies, I think, in in in the way that people think about databases and data selection. The first is that someone that there's ever a day when anyone in the world ever wakes up and is like, you know what? Today, I need to decide if we're gonna use MongoDB or Postgres, or we're gonna use MySQL or Postgres. Right? That decision has always already been made

(45:10):
by someone else. Right? You know, unless you are, like, literally, like, starting out, like, that's we just don't make that decision
at a at a low level. And the second thing is we don't change it. You know, there are occasionally,
there are people who will do there are systems where you do need to migrate
from one to another

(45:31):
for
some reason,
But
but that's incredibly
rare, and you should and sort of at the point when you need to do that,
you have users and money.
And so so it doesn't so so you can afford to do it. Right? If you don't have users and money, it doesn't matter

(45:52):
because
your data whatever it is, it all fits in RAM and you like like like, you're not making money anyway, so who's gonna pay you to, like, change databases? Like, that's not that's not what we're doing. Unless it's for fun, in which case, you know, go wild.
But
but we don't
we don't have that these kinds of, like, migrations happen very often. So these abstractions like, you should have good tooling around your database that, like, lets you do it. But that that lets you sort of, like,

(46:19):
handle like,
have some abstraction between the database and and
and your code. Right? You, like, you shouldn't be, like, I don't know, like,
writing your own database driver inside of your inside you know, in in your model's package or whatever. But but but but you also
you don't need to sort of build your own abstraction on top of

(46:42):
on top of your database and do this this thing where you put the the all of your business logic in your app that could easily be pushed into the database. It would be faster and cheaper if you did it that way. Again, like, pushing things into the database is like writing SQL. And, you know, if you take a SQL
statement and you're like, okay. Now we have to take this SQLite and make it run on Postgres 1. That that way should be pretty easy. And 2,

(47:08):
like,
it's it's you know, if what you have to do is go through and find a bunch of functions that don't exist or or and write them a different way, like, that's that's a much less
onerous kind of thing. The
I mean, even the real sort of the thing that I think like, you you sort of have this you the question you posed was, like, do you, like, take advantage of the database?

(47:33):
I think the, like, the risk that people like, say, in some ways,
like, it's not just like, taking advantage of the database is, like, a relatively safe thing because people don't change.
The thing that, like, I think is risky and people are less afraid of doing
for the wrong reasons are are the are systems that have
user defined functions,

(47:54):
which means that you've sort of pushed some logic
custom logic into the database, which has to sort of run and it's difficult to version it correctly. Right? That's kind of terrifying
and hard to migrate. There's also,
and and various things like triggers and stored procedures and all that, like, have the same same property. There's also if you're using an extension,

(48:16):
you get,
you know, extend so for systems like Postgres where you just, like, hot load code into into the database, it becomes very version dependent. And using that functionality
is, like,
actually sort
of that. There is some like, I I I get
that gives me the willies.
But,

(48:37):
but I think, like,
I think that I think that all mostly, like,
teams
I think mostly teams should should just not be as worried about it and and think of think of the these, like, adding extra indirection between the database and the application is really, like,
is really a is really sort of a premature optimization.

(48:59):
It's also true
having said all of that, it's also true that sometimes you do actually wanna write application code. And this was, like, one of the things that, like,
the early NoSQL
or, like, early mid NoSQL
era really kind of, like, leaned into is that
you just you know, sometimes

(49:21):
the clearest way to express something was in fact,
like, writing a 4 loop. And
you you you can't do it, like, all of the time, and it's not in a universal rule. But but sometimes
that's what you want. Like, you just want a for loop. And having some application code,
like, there are times when

(49:41):
when you you when it's fine.
But but, I mean, this is like, we're we're sort of the all of my answers to this question are, like, you know, classic engineer. It depends. And and also, like,
have a have a really good understanding
of
the problems that you're trying to solve and

(50:02):
the size of data at any given point in a pipeline.
And
and try and also sort of, like, try and avoid
spending your time solving
problems of your own creation. Right? Like, there are a lot of you know, like, I sort of I think that,

(50:23):
like, teams aren't gonna do this for lots of reasons, but, like, in a lot of cases, ETLing
is
a sort of problem that we've created because we haven't sort of extended engines in the right ways, where we haven't provided,
like, interoperability
in one way or another. And and, like, this is a lot of what Better d's working on is this, like, how do we

(50:44):
provide the same interface, but also
not
acknowledge the the, like, reality of
database this monolith of give all your money to IBM, get d b 2, put all your data in there, go to sleep happy
is
is just not true. And there you know? Or any any

(51:05):
it's very rare that you find an organization that has only 1 database
or only 1 database type. And and I think being
being pragmatic with that is is really,
really sort of crucial,
a really crucial way of of, like,
avoiding

(51:26):
falling into some of these traps.
I think that's probably a pretty good place to start to close out the conversation.
So is there anything that you think is useful as
a closing argument or anything that you wanna say to wrap up that we didn't already touch on?
I think we got through we certainly got through most of it. And there are definitely, like, a lot of, like,

(51:47):
side quests in this conversation,
which I think
bear a little bit more attention
at some point.
I I mean, I think they'll, like, the part the parting
parting shot, parting wisdom is really, like, ask more questions
and acknowledge your ask more questions, acknowledge your assumptions,

(52:10):
avoid premature optimization.
You like,
be aware
and be aware sort of pragmatically
of, like,
what
what's going on, what's possible,
where
you're we're
almost always

(52:30):
data
we have app there's we spend a lot of time talking about applications, but there's, like, on top of there's, like, core application developers and then people working with data generated by applications on top of it. And which are themselves writing
writing application
applications of a different sort. But that all of these

(52:51):
all of these projects really
there's there's, like, this, like,
pragmatic,
like, this this is we don't have control of it. Most of our almost every line of code like, people very rarely just, like, open a text editor and say, today, I'm gonna write a data driven application. Right? That's another this is, like, another one of these, like, things that doesn't happen.

(53:15):
Not you know, don't wake up today and decide which database to use. Don't wake up today with a with a blank application
and no data. Right? There's we're building on top of things. And so being letting making it possible and and
and acknowledging that we're we're gonna be pragmatic. We have to be pragmatic. This is this is,

(53:37):
you know, like, we're not
we're building applications
and tools and software to solve,
hopefully, as much as possible other people's problems. And and,
you know, and I think, like, make it so that
we can make decisions about
businesses, about the world, about our organizations,

(53:58):
about whatever the thing is
based on
data and not vibes.
And if we screw up with
you know, like, if a query is too slow,
people are just gonna make decisions based on vibes. Right? People will always fall back to the, like, vibes
vibes driven,

(54:20):
decision making. And,
you know, people
do okay with their vibes, but we have all these tools.
We have all of this data,
like,
being being
sort of
able to and flexible with with building

(54:40):
building out infrastructure and tools around it is the thing that lets us
make
better decisions, more clear decisions
about
about whatever it is that we're doing. And I think that's, like, a really
that's the thing that it's, like, important to to keep in mind and
and is I know it had and has

(55:01):
been true for a really long time at at,
you know, the scales of 10 years ago are not the scales of today and aren't gonna be the scales of of,
you know,
10 or 20 years from now. And I think that's all of that's okay.
And and our jobs are to sort of
make sure that we don't get

(55:24):
get we don't look back in 10 years and say, oh, we just we just sort of, like,
did that all on 5. It was you know? And I think that's I think that's sort of the the important part. Yeah. And I think to making sure that you actually understand what the
understand what the database that you're selecting is capable of and is optimized for before you decide to invest in it.

(55:49):
Yeah.
Absolutely.
Like, know your tools. Like like, I you know, I you know, as a as a programmer, I, like, have a collection of absolutely absurd
facts in my brain about, like, what various run times do. Like, I you know, like, Python programmers sort of, like, have an intuitive sense of where the GIL is. You know? Like, the

(56:14):
all there are all sorts of all sorts of cases where just, like, knowing your tools
is really good
and really, like, gives you a lot of of power and databases are
kind of the same. Like, it's a it's a runtime. It's a it's a
it I mean, they're they all use the same mostly use the same programming language, but it's all, like, it's all it's all one of those. And so knowing

(56:40):
knowing how all of that works makes you a a better consumer.
And I think that's
that's useful.
Absolutely.
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

(57:07):
This is gonna sound like it's out of left field, but I think
easy
to use
testing tools are really the, like
is the thing that just, like, always feels like it's missing. You you sort of write all of your application code. You can unit test a lot of stuff if you're, like, functional enough. But then you you start talking to a database and you're like, well, okay. Here we go. I hope I I hope I got that right.

(57:35):
And and there's, like, there's nothing. There's like like as great as everything is, there's just, like, absolutely
nothing
that exists where you're like, oh, yeah. I wanna run a I wanna run something that feels like a unit test with
you know, I wanna make some fake data, and it just, like, doesn't exist, and there's no way to do it. And

(57:58):
and I
and so, you know, that's where all the bugs are all the time. Like, you know, you spend if you if you have an app that someone's using
and
all every every outage, every user complain is always something
that isn't
covered by testing. And we have these, like, great tools in other areas

(58:20):
of our code
for for doing testing, and
and, like,
there's some paradigm or syntax or something that just, like, doesn't exist today.
And
and I think,
like, that
that that's it.
Alright. Well, thank you very much for taking the time today to join me and share your thoughts on this overall space of

(58:47):
selection and optimization
around the database engine and how you're applying it to the problems you're trying to solve. It is a perennial
problem and definitely subject to the overarching
it depends answer. So I appreciate you taking the time to explore that with me, and I hope you enjoy the rest of your day. Yeah. You too.

(59:08):
Thank
you
so
much.
Thank you for listening, and don't forget to check out our other shows.
Podcast dotnet covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering podcast is your guide to the fast moving world of building AI systems.

(59:29):
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com
with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Advertise With Us

Popular Podcasts

Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

Stuff You Should Know

Stuff You Should Know

If you've ever wanted to know about champagne, satanism, the Stonewall Uprising, chaos theory, LSD, El Nino, true crime and Rosa Parks, then look no further. Josh and Chuck have you covered.

Intentionally Disturbing

Intentionally Disturbing

Join me on this podcast as I navigate the murky waters of human behavior, current events, and personal anecdotes through in-depth interviews with incredible people—all served with a generous helping of sarcasm and satire. After years as a forensic and clinical psychologist, I offer a unique interview style and a low tolerance for bullshit, quickly steering conversations toward depth and darkness. I honor the seriousness while also appreciating wit. I’m your guide through the twisted labyrinth of the human psyche, armed with dark humor and biting wit.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.