Building for Scale: AWS’s Marc Brooker on Distributed SQL

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:05):
Welcome back to what's New in Data.
I'm your host, john Coutet.
Today I'm joined by MarkBrooker, vp and Distinguished
Engineer for Databases at AWS.
Mark has been at the forefrontof building some of AWS's most
critical infrastructure andtoday his focus on their
distributed SQL databaseservices.
In this episode, we'll diveinto the architecture behind

(00:27):
AWS's distributed SQL offerings,namely D-SQL, and we'll also
talk about the future of clouddatabases and lessons from
building resilient,high-performance systems.
Let's get right into it, mark.

Speaker 2 (00:41):
How are you doing today.

Speaker 3 (00:43):
Yeah, I'm doing great .
Super, super excited to betalking about D-SQL and
databases more generally Alwaysa fun thing to talk about.

Speaker 2 (00:53):
Yeah, absolutely.
I know the listeners areexcited about this one.
You've written a lot of awesometechnical blogs on this topic
and we're going to dive intosome of what you discussed.
But first, Mark, just tell thelisteners about yourself.

Speaker 3 (01:08):
Yeah, so I work at AWS and I've been at AWS for 16
years, engineering the wholetime.
I've done a little bit ofmanagement on and off, but I've
been a primarily individualcontributor throughout that
career and I have worked on awhole lot of the different, a
whole lot of different servicesat AWS, Most recently in our

(01:31):
databases and AI businesses butstorage.
Before that I was I joined theteam that built AWS Lambda.
You know, soon after thepreview launch of that product I
worked on the EC2 team and inthe early days when folks were
still saying why would I buy myIT infrastructure from a

(01:51):
bookstore?
And so it's been a super funjourney of just seeing the cloud
grow and this whole industrygrow around it.
Before that, I joined AWSalmost completely from a
completely different background.
I was working, I was at gradschool, I was building radar
systems, doing electricalengineering and kind of

(02:16):
distributed background, but notcomputing, not distributed
computing, not the cloud at alland so it was a really fun kind
of career pivot back in 2008,and one that I think in
retrospect I've been prettyhappy with.

Speaker 2 (02:33):
Yeah, excellent, and definitely so much value has
been delivered with serviceslike EC2 and cloud computing in
general, and you know it's.
It's changed the entirelandscape for how you build and
scale applications, whetheryou're a mom and pop small

(02:55):
business or appear one of theworld's largest and most
sophisticated uh enterprises andtech companies as well.
So very cool to uh chattingwith you about what you've been
working on recently, which isdsql.
So I wanted to ask you who isdsql designed for and what key
challenges were you aiming tosolve for your users with this

(03:16):
service?

Speaker 3 (03:17):
yeah.
So there are kind of two thingsthere.
It's, it's, it's almost kind oftwo product, uh, you.
One of them is we have this setof customers who have been
building on serverless.
They've been building oncontainers and what they've
noticed is that well, they'vecontinued to love the relational

(03:38):
data model, love SQL for all ofthe reasons for that, want
transactions and all of thatgood stuff.
But what they've noticed aboutmost relational databases is
they require a lot moreoperational care.
They require a lot more scalingcare and attention to scaling.
They don't have the kind ofscale and dynamism that folks

(04:00):
expect from containers andserverless.
Often they have things likeconnection limits which get in
the way, and you know.
So one set of customers we werereally trying to help out with
with D-SQL is people building onthat kind of modern cloud
native computer architecture togive a fully relational database

(04:23):
experience, to give SQL, togive transactions, to give ACID
and all of that good stuff.
You know, to folks building onthat kind of infrastructure in a
way that can keep up on scale,you know equals on simplified
operations and so on, and thenalmost at the other you know, I
wouldn't say the other end ofthe database market, but in a

(04:43):
different part of the overall oftransactional databases.
There's a set of folks buildingextremely highly available
applications, building criticalapplications, often applications
that are critical to theircountries or the world's

(05:04):
economic or transportationinfrastructure.
And these are folks who arebuilding, often have internal
requirements or regulatoryrequirements to run in multiple
regions, to have geographicdiversity, to have the ability
to move workloads away fromentire regions.

(05:25):
And one of the patterns thatfolks love, and for good reason,
is active-active right, likethey want to build in a way that
their application is runningconcurrently in region A and
region B.
They want to have again SQL,acid, the relational data model
and all of that good stuff.
They want strong consistency,they want transactions, and dSQL

(05:49):
is also being built for thatset of folks, with multi-region,
active-active, highavailability and durability
across three AWS regions, theability to remain available even
when an entire region fails, nodata loss or consistency loss
on failure, and so two fairlydifferent use cases.

(06:12):
There's some intersectionbetween them, but not an awful
lot.
But those are the two bigthings that we were trying to go
after with this new Auroraflavor, aurora D-SQL.

Speaker 2 (06:23):
Yeah, and D-SQL is serverless and you know that's
one of the big advantages of it,and it's distributed.
But you don't lose it seemslike you know, based on how you
designed it you don't lose theavailability and consistency of
having a you know this massivededicated maybe it's a cluster
database, right.

(06:50):
Maybe it's a cluster databaseright.
So you know, when Itraditionally worked in building
cloud applications, it seemslike you know if you wanted to
quickly build a demo orsomething, sure, use serverless
runbooks for failover and you'reprobably going to need some
dedicated infrastructure forthat and significant planning
just to maintain theinfrastructure for an
application in addition to thedatabases and the software stack

(07:14):
.
But what you're saying is youget all the advantages of
serverless and the advantages ofa clustered, highly available,
distributed database.

Speaker 3 (07:24):
That's been our goal, and as we get into the
technical details there, we cantalk about some of the
trade-offs.
But one of the things that wedo at Amazon is when we're
building a big product like thisand not always, but especially
when we're building a big newinitiative like this is to spend
a lot of time talking tocustomers and prospective

(07:46):
customers about their needs, andone of the words that I heard a
lot having those conversationswas was regret.
Right, like hey, I want tostart with something that's
that's really easy to get going,that's low risk, that I can
build a meaningful applicationwith a, with a small engineering
team.
But I don't want to regret thatchoice as I grow, as I'm

(08:06):
successful, right, like as mybusiness or this internal
product or whatever moves up andgets bigger and gets more
critical.
And so that's really been ourkind of core focus here is
saying you know, this is goingto be a great choice at small
scale, it's going to be easy toget going, going to be easy to
build that sort of V1 of yourapplication.
But then you're not going to bea great choice at small scale,
it's going to be easy to getgoing, it's going to be easy to
build that sort of V1 of yourapplication, but then you're not

(08:28):
going to regret that choice.
Over time it's going to growwith you, it's going to meet
your most critical operationalrequirements.
It's going to grow with you onscale, on availability, on
durability, on enterpriseintegrations and all of those
other pieces.
And so that's a big part ofthat vision, because we don't.

(08:48):
You know, it's great to haveserverless products to make it
easy to get going with something, easy to build a V1 of
something.
That's valuable in itself.
But it's even better if you canbuild things that way and then
you can slowly evolve them andadapt them into being huge and
meaningful businesses.
Because the worst day in anyapplication team's world, or

(09:10):
nearly the worst day, is whenthey have to go to the business
and say, well, for the next yearyou have to turn customers away
while we do a bigre-architecture, while we
replace our data store.
I think nobody wants to havethat conversation, but too many
people do have it.
I've had that conversation inmy career and I would.
You know, I hope to never haveit again, and so that's part of

(09:32):
our.
You know, that's been a bigpart of the kind of drive behind
this vision is that you know,folks can pick D-SQL and then it
will grow with them, grow withtheir requirements to businesses
of literally any size.

Speaker 2 (09:48):
Absolutely, and that's a great point that, yes,
teams build that V1 version oftheir product on serverless or
however they design it, but itmay not scale once they've
become successful and have tensand thousands of customers.
And, like you said, migrationscan take years.
Like infrastructure, migrationscan can certainly take years

(10:11):
and you know a lot of additionalplanning and costs for zero
downtime in those scenarios.
So I do want to get into someof the you mentioned trade-offs,
but specifically you talk aboutactive-active and multi-region
architecture and I'm sure evenacross availability zone.
I'm sure that's all there.

(10:32):
Can you elaborate on how youbalance the trade-offs between
availability and consistency andlatency in this design?

Speaker 3 (10:40):
Yeah, yeah.
So you know, I think, probablythe first law of physics there
is a consistency versus latencyone.
And so at the moment that yousay to the database, commit,
right, like I want you to makethis transaction durable for me.
You know it can make thatdurable locally, but in the

(11:04):
multi-region setting it caneither send the data over and
then, while it's in flight, sayit's committed, or it can send
the data over, wait for it tohave been sent, wait to get
acknowledgement, know that it'sdurable on the other side and
then return that commit and youknow, and so the the difference
between those can be, you know,no impact, no rtt's of of

(11:28):
communication versus having towait for at least one round trip
between a, a pair of regions,um, and so that's, that's
trade-off number one.
And so what we've done with,with the sequel and it's in its
current guise and we mightprovide more control over this
later as the product, you know,grows um, but we have chosen
synchronous right, like when youcall commit um, in that three

(11:51):
region setup, we're going tomake sure it is durable in
storage in two of those threeregions before we return.
You know that that commit to youand that does increase commit
time latency not write timelatency, crucially, but commit
time latency but it does ensurethat you can lose an entire

(12:14):
region without losing any data.
As an upshot of doing thatdurability work almost secondary
to that is, now that we'vemoved the data over, we can,
with very little additionallatency cost, provide strong
consistency.
And so that's the other part ofit.
Is to say, can I assume that ifI have done a commit in region

(12:37):
A and I immediately go to regionB, I can see the effects of
that commit immediately?
And again, if you do thisreplication asynchronously?
The answer to that would be no,right.
That replication might still bein flight and might still be,
you know, being applied.
But what we've chosen isconsistency, and so that's

(12:59):
related to the synchronization.

Speaker 2 (13:02):
And when you say synchronous versus asynchronous
you mean when you are ensuringthe durability, sending the
commit out to the other nodes orwhatever pieces of the stack
exists.
Are you sort of holding theworld in a sense and kind of
pausing the database at thatpoint before you've ensured that
it's been acknowledged by theother?

Speaker 3 (13:23):
nodes?
Yeah, so that's a greatquestion.
The answer, literally, is no.
You don't need to pause thewhole database, but you do need
to potentially wait on otherwrites to the same keys.
And so if I have a million rowsin my database and I'm update
row A and row B and those are inseparate transactions, they

(13:44):
don't have to wait on each other.
They can be fully parallelized.
However, if I update row a andin region one and I update row a
in region two, there needs tobe some handshake, and that's.
That handshake is there mostlyto ensure the, the other
important property here that wehaven't spoken about yet, which
is isolation, right, thistransaction isolation, uh, and

(14:06):
and providing the, thetransactionality guarantees in
in this multi-region active,active setup, um, and so you
certainly don't want to be in aworld where you're slowing down
an entire data set and and soyou don't have to do that from a
kind of law of physicsperspective, but once you have a
single key which is beingaccessed from multiple places at

(14:27):
the same time, there does needto be some at least one and a
half, I think, is the physicallimit rounds of communication to
ensure isolation and to ensurethat consistency.
I wanted to jump quickly back toavailability, because you asked
about that, and so I thinkwhat's important in our model is

(14:49):
this is a three-region model.
We have two primary regions thatare witness and if one of those
regions fails or becomespartitioned off and not
communicatable, the database isnot available in that region but
will continue to be availableand strongly consistent in the
two remaining regions.

(15:09):
That's kind of the majorityside of a partition, and you
know the alternative to thatwould be to be available on both
sides, but to do that you wouldhave to give up consistency and
you would have to.
Once that partition healed youwould need to be able to
post-merge data.
And so if you look at anotherproduct like DynamoDB Global

(15:32):
Tables, that is the decisionthat we made there is to be
available on both sides and thenpost-merge.
That's got great availabilityproperties, and what we've
decided for dsql is to beavailable only on the majority
side and then, when that thatpartition heals, you can copy
the data over, repair thecluster fully deterministically

(15:54):
without giving up any of theasset properties, and get back
to uh, you know, get back tothree running regions.

Speaker 2 (16:04):
Yeah, absolutely incredible.
The other thing that'simpressive and notable about
this is you fully leaned intothe AWS software stack to
deliver this solution.
So one example which you wroteabout is using EC2's precision
time infrastructure forsynchronized clocks.
Using EC2's precision timeinfrastructure for synchronized

(16:24):
clocks and people who arefamiliar with distributed
systems engineering know thesignificance of having clock
synchronization and even theprecision of the operating
system, how it perceives timeand things along those lines,
making sure that's all in sync.
So can you explain how you useEC2 and AWS's clock
synchronization support?

Speaker 3 (16:45):
Yeah, and so you know , as your listeners might know,
we've been rolling out you knowwe rolled out TimeSync service
to EC2 a few years ago and we'vebeen rolling out a big
improvement on that over thelast year or so to bring you
know time synchronizationglobally to EC2 instances down

(17:05):
to tens of microseconds, and sothat's available to regular EC2
instances and over time will beavailable to all EC2 instances
in all regions, all regions, andso, taking advantage of that,

(17:28):
one of the things that we usethat for is ensuring that reads
are strongly consistent withoutneeding any coordination between
regions or even within a regionfor reads, and so we might
touch on this in more detaillater, but the the core idea
there is um, when you send aread in a transaction, and so
that could be a select.
It could be an update right,because an update in it in in

(17:50):
sql, is a read modify right.
It could be an insert, becauseinserts on on um unique keys
also require, you know, somesome amount of reading, um or or
any combination of those things.
At the time your transactionstarts, we look at that local
clock from the EC2 time syncservice, we read its current

(18:13):
value and then throughout therest of the transaction.
Every time the query processorwhich is running that SQL goes
to storage, it goes to storageand says give me this data I'm
looking for.
As of that time we call ittstart.

(18:34):
And by saying, as of tstart,what we can do, when combined
with the fact that storage keepstrack of multiple versions of
data and can answer that as ofquestion with perfect accuracy,
what we can do is give thattransaction a fully consistent
snapshot across the entire dataset in the database, even if you

(18:58):
know, with no locking, with nolatches, no locks, even if you
know, with no locking, with nolatches, no locks, and even if
we don't know, and you don't, inSQL in general, you don't know
which keys are going to beaccessed.
And so you know, this is a bigpartition database, it's a big
partition feature set, bigpartition data set.
And so you'll go to partition Cand say you know, give me the

(19:20):
row for cat.
You know, as of this time,you'll get that back.
You do some thinking, maybe yourun some of your application
code, you go back to the clientand you know oh, actually I also
need the row D for dog.
You know, go to D, say give methe row dog as of that t start,
and what you'll be sure of isthat the row c for cat and the

(19:45):
row d for dog that you pulled upwill be at a consistent point
in time in the database.
One won't be a transactionahead or a transaction behind or
whatever the case may be.
And that's really great forisolation.
But because of the use of timeand the use of multiversioning
at the storage layer, thatdoesn't require any coordination

(20:08):
between shards, even betweenreplicas in a shard.
It doesn't require us to have asingle primary replica for a
shard, it doesn't requirecross-region or cross-AZ
coordination and so there in anavailability zone between one
pair of machines, we can dothose reads, do those reads to

(20:29):
another shard or partition, pullthat data together and still
have that strong read isolation,snapshot read isolation, which
is very cool.
It's a very cool property.
And then doing that with aclock that's well synchronized
to physical time allows us tomake sure those responses are

(20:49):
also ordered in physical time,which provides this additional
property of linearizability orstrong consistency, where you
can be sure that you're going toread your most recent writes
and you're going to see all ofthe effects as of you know, a
real time in the database and ismost of this happening?

Speaker 2 (21:09):
there's an underlying database engine behind this,
but is this synchronizationhappening in kind of a software
layer abstraction uh, that'shead of the database, or
separate from the databaseitself?

Speaker 3 (21:22):
yeah, it's kind of separate.
So you know our literal sqlengine.
The thing that is running sqluh that is, you know, running
those transactions is is, uh,doing query planning and
optimization and so on ispostgres.
Um, and we talked about how wemade that choice, I think, think
it was a really great choice.

(21:42):
I'm a huge Postgres fan.
But then below that, whenPostgres says, hey, I need some
data to return to the client orto process this query, that's
where our custom implementationstarts.
And so Postgres will say I needthis row as as of this time and

(22:02):
that's where we'll have animplementation that goes off and
finds the right partition orright shard, you know it figures
out, uh, you know what, whichof the the sql operations it can
push down to be processed downat storage, which is a big
latency.
When, um, it finds a healthyreplica, it finds a nearby
replica, maybe it talks to thecontrol plane to see say, hey,

(22:25):
you need some more replicas ofthis data.
It does that whole thing andthen it gets back to postgres
and says you know, here are therows that you need to process
this transaction.
And then postgres will dothings like the join logic and
and you know all of those, thosekinds of things that relational
engines are, just, you know,really great at.

Speaker 2 (22:44):
Absolutely, because what you're describing sounds
like something beyond anydatabase engine itself has
offered in terms of having, youknow, there's been databases
that you know clearly can scalein the cloud and they're being,
you know, cloud uh databasesoffered and mostly using, you
know, sharding and partitioning,and then there's cluster

(23:06):
databases as well, but to dothis with a serverless offering
is truly unique and you know,it's the first that I've I've
seen of it at this level ofcontrol around uh, you know, you
disaggregation of compute andstorage, and you know strong
consistency and isolation andthese very sophisticated

(23:29):
multi-region setups with highavailability.
So that's why I was asking, youknow, I don't think there's a
database engine that would just,you know you could, you know,
pull some knobs and setsomething up like this yourself,
pull some knobs and setsomething up like this yourself.

Speaker 3 (23:45):
Yeah, I'm not aware of one, although there are some
really interesting distributedSQL engines in the world and
some really cool technologythere.
One of the choices that we madeas we designed D-SQL was to try
and disaggregate the databaseinto multiple concerns, and so
we tried to separate out thedistributed system concerns from

(24:08):
the core relational concerns.
Now there's some leakage acrossthat boundary.
The relational engine can't beentirely oblivious to the fact
it's running in a distributedcontext, because it has to know
some of that to make good queryplans and to do good query
execution, but it doesn't haveto know anything about how many

(24:28):
replicas there are.
Are there replicas?
Where are they situated?
How do I find them?
Have they failed?
And so by taking all of thosehard distributed system concerns
and separating them out into aseparate component from the
database engine, it's allowed usto keep that database engine

(24:49):
quite simple, quite stock, andthat's useful operationally.
But it's also organizationallyallowed us to have the folks
working on that part of thesystem be deep database experts
rather than deep distributedsystems experts.

(25:10):
And that's a really big winjust organizationally and as an
engineering team is to be ableto, you know, have a level of
specialization like that, andyeah, yeah.
And so what we've tried to dois make that Postgres layer as

(25:35):
unconcerned I wouldn't sayoblivious, but unconcerned with
the realities of being in adistributed system as we can.

Speaker 2 (25:42):
Absolutely, and I'd love to hear about how you
approach offering strongconsistency and isolation in
multi-region setups.
So that's also very interestingand unique and you wrote about
this but the use of optimisticconcurrency control and
multiversion concurrency controlas well, and how, how does that

(26:04):
contribute?

Speaker 3 (26:07):
yeah.
So you know, that's uh, let'ssay, a really deep, really deep,
yeah, yes, uh.
But yeah, let me get into someof that.
I, you know, I think the, youknow, multi-version concurrency
control is the first, you know,is the first part of that, and
that is this idea that on thestorage node we have multiple

(26:27):
versions of each piece of data,versions spanning over a time
window, and so that allows thequery processor to go and ask
for data as of a particular time, go and ask for data as of a
particular time, and so that's adata structure at the storage

(26:48):
layer with multiple pieces ofdata in.
And the benefit of that is sortof touching on what I said
earlier is that the queryprocessor can form a consistent
snapshot of the data in thedatabase without having to do
any coordination between nodeswith other query processes.

(27:08):
It doesn't even have to beaware that other query processes
exist, it doesn't have to go toa leader for a shard, and so
that multiversion concurrencycontrol step on the read side
pretty much eliminates the needfor distributed coordination in
the system.
And then we get to the writeside and that's where things get

(27:29):
really interesting, and so Iwould sort of break that down
according to the ACID properties.
And so, as you commit atransaction, you need to do two
things.
One of them is isolation, andthat simply means, given a
particular isolation level, canthis transaction commit to a

(27:52):
transaction while keeping acertain set of rules about the
transactions that it ranconcurrently with?
And at the isolation level thatwe currently support, which is
called strong snapshot isolation.
The rule is fairly simple.
It's a transaction can commitas long as, given the set of

(28:15):
keys that it's updating, thekeys that it's writing, nobody
else has written those keysbetween when that transaction
started and when it's trying tocommit.
And so it does the step wherewe have another separate piece
of the database called theadjudicator, where it looks at
the set of rows that it wants tocommit, and it goes to the

(28:36):
adjudicator and says mytransaction start time was five.
It's now time seven.
I want to commit keys A, b andC.
Are there new versions of thesekeys in the database?
Since I read those and theadjudicator can say one of two
things.
It can say you know, no, thereare no new versions, that's fine

(28:57):
, you can go ahead.
Or, yeah, there are newversions of these keys.
You need to abort thistransaction, and that is all the
coordination that is requiredto meet this strong snapshot
isolation level.
The next step is the mostimportant one, and that is
atomicity and durability.
Right, this is where a commitactually becomes a change to the

(29:20):
database, and so what we do,once we've passed those
isolation checks, is we packageup the set of changes to the
database we choose, based on theisolation checks, a version
number for it, a point in timefor that transaction to take
effect, and we write it to aservice called Journal.
This is an internal componentwe've been building at AWS we

(29:43):
use for over a decade.
We use it in all kinds ofplaces.
We use it in S3, we use it inDynamoDB, we use it in Lambda,
we use it in Kinesis, and whatJournal provides to the database
is two things it providesdurability.
In the single region mode thatmeans on storage across multiple
availability zones, and in themulti-region mode it means on

(30:05):
storage across multiple regions.
And it provides atomicity, thiskind of core database property
of I'm either going to acceptthis whole transaction or I'm
going to accept none of it wholetransaction, or I'm going to
accept none of it.
And so once we've handed thatoff to journal.
Then we know that we've passedour isolation checks, that that

(30:26):
commit is is atomic and thatcommit is durable, and we can go
back to the client and say yourcommit has committed,
congratulations.
Now then at that point, inparallel, what we're going to do
is, uh, apply that commit,apply that change to the, to the
relevant, to the relevantstorage nodes, and and so that's

(30:49):
.
And then then, sort of loopingback to your question about
optimistic concurrency control,I that check with the
adjudicator, is optimisticconcurrency control right?
We have allowed the, thetransaction, to run in its
entirety, to do a bunch of reads, to spool up a bunch of rides,
to communicate with the client.
We have not coordinated withother transactions up until this

(31:12):
point.
And then at that commit time wego and check have we met the
rules?
And one of two things canhappen yes or no.
Your transaction needs to be,needs to be aborted, um, and so
there's some database.
Academics will will notice thatthis is is not a pure
optimistic concurrency controlscheme, because it's sort of

(31:34):
mixed with multi-versionconcurrency control.
In that way, the core rightpath is is optimistic, um, and
so we can go into the, thetrade-offs between optimistic
approaches and pessimisticapproaches, if, if, that would
be interesting, but the bigthing we get out of this is is
it minimizes the communicationbetween regions and azs for

(31:55):
these distributed settings yeah,excellent, and there's always
trade-offs you're going to makein distributed systems, and the
most popular one is CAP theorem,which is between consistency,
availability and partitiontolerance.

Speaker 2 (32:09):
But really there's hundreds of thousands of
trade-offs that you'll make whenyou're actually in the weeds of
implementing a service,especially at this scale.
I personally can't imagine, butbased on what you're describing
, it's certainly verysophisticated.
And so, coming back to yourearlier point about you're

(32:31):
relying on Postgres as your mainquery engine let's call it that
, which is great because so manyapplications are built on
Postgres.
There's a lot of developmenthappening there, whether it's
building relational databaseapplications there Now there's
some recent AI-basedapplications with the PG vector

(32:53):
and has a great open sourcecommunity of extensions, and we
had Gwen Shapira from Nile whocame on the pod and really dove
into just how rich the extensioncommunity is too.
We really appreciate learningabout that.
Now, of course, postgres has itsown logging as well.

(33:15):
You spoke about journaling,which is a little higher level,
but Postgres the database itself, like a single instance, will
have its own write-ahead log andit'll have its own buffer cache
.
It'll have all these componentsthat sort of come in with a
database off the shelf.
So are you also piggybacking onthe logging of Postgres, or is

(33:37):
that something you've also builtinto your own layer of
abstraction?

Speaker 3 (33:41):
Yeah, so that whole kind of durability storage layer
of Postgres you know we aren'tusing, because we, you know we
wanted to build in this fairly,you know, fairly unique set of
distributed properties, whichisn't you know, isn't Postgres'
core concern.
Right, you know that it isdesigned to be durable to

(34:04):
storage on a single machine.
You know there's obviously somereplication machinery there too
, but we needed to replace allof that, both for scale-out
reasons and for fault-tolerancereasons.
Right, we wanted to be durablesynchronously across multiple
AZs or across multiple regions.

(34:25):
We wanted to be able totransparently partition data
into multiple write partitionsand multiple read partitions,
and those were things that thatlayer of Postgres, as great
properties as it offers, doesn'tdo, and so we replaced those
pieces of the engine.

Speaker 2 (34:46):
Yeah, wow, that's incredible to hear about and you
know, certainly a verysophisticated design and it's
really focused on serverless butalso high availability and and
scale, which is a truly uniquecombination, and it makes you
know the the end results users.
It makes it easier and lesscostly to run databases than

(35:10):
than ever before.
So what would you imagine aresome of the new types of use
cases this architecture canenable that just weren't
practical.

Speaker 3 (35:19):
Yeah, well, before I answer that and that's a great
question I did want to say oneof the reasons that we chose
Postgres is that it is thisreally great, extensible,
modular architecture.
It's actually for its age andcomplexity it's it's.

(35:40):
It's a beautifully architectedpiece of software and, um, and
you know, and so it, it verynaturally allows the kind of
deep surgery that we've we'vedone on it, uh and uh and and
that's.
You know, that's, that'sexciting, and I think it's one
of the reasons that you know,postgres is having this kind of
moment across the industry,right, like it's just super

(36:03):
popular right now, for goodreasons, and one of those
reasons is that, you know, itprovides a huge amount of
extensibility and flexibilityand lets folks, you know, build
all kinds of cool things with itand around it in interesting
ways, kinds of cool things withit and and around it in in
interesting ways, you know.
Jumping to your question about,about use cases, um, you know,

(36:24):
one of the things you know we'rereally trying to do here, uh,
is simplify architectures, andso, if you look at um, if you
look at an architecture todaybased on, you know, a, a
traditional relational, thereare often a lot of blocks, right
, you will have your primarydatabase.

(36:45):
You'll maybe have some failoverdatabases in multiple data
centers.
You'll have machinery aboutmanaging you know detecting
failures and managing thosefailovers.
You'll have replicationmachinery keeping those replicas
in sync.
You know, often, because youonly have a single consistent

(37:06):
primary, you'll have a cachinglayer on top of your database.
You'll have some kind of changedata capture or replication
mechanism keeping that cache upto date.
Or you'll have logic in yourapplication you know keeping
that cache up to date or you'llhave logic in your application,
you know, keeping that cache upto date Often.
You'll have a bunch of plumbingout to infrastructure for doing

(37:27):
reporting and analytics.
You'll have plumbing toinfrastructure for doing backup
and restore, point-in-timerecovery, audit and so on.
And so these architectures, asyou try and take a database into
being a reliable, whole kind ofend-to-end application solution
, they become quite complex.

(37:48):
And so what we've been trying todo, both with B-SQL and kind of
across the whole AWS datafamily, is really simplify that.
And so you know, with B-SQL,what we've said is you know,
with with D SQL, what we've saidis you know, hey, this is
already multi-AZ, you don't needto worry about those, those
multiple replicas We'll.
We'll handle that replicationand and and and moving of data.

(38:10):
If they're failure failures foryou.
You don't have to worry aboutabout caching.
In the same way we can, we'llscale out for reads and for your
read load.
We'll keep your reads local,within an AZ or within a region
and optimize that latency foryou.
You don't need those blocks inyour block diagram.

(38:30):
Backups are built in, restoreare built in.
These pieces are built in,right like these pieces are are
built in.
And so what we've really triedto do is take all of those use
cases that folks have, all ofthe things that that add
complexity to architectures, andreally simplify them.
And I think you know if we'vesucceeded there.
Um, I think there's a cool setof applications we're going to

(38:53):
unlock, which I'll talk about ina second, but really it's about
simplifying those things.
Uh, and then you know, what dowe unlock?
Well, I think you know, I thinkthe new architectures is we're
hoping that it becomes mucheasier and much more cost
effective to run active, activearchitectures, especially in the

(39:15):
multi-region setting.
Folks tend to go to active, yougo to active failover
architectures, mostly foroperational simplicity reasons,
mostly because their databasedoesn't do active-active in a
nice way and those architecturescan be hard to reason about.
They can be hard to make surethey actually work in the days

(39:36):
that you need them.
They can be more expensive torun and so, you know, we've
really tried to make it as easyas possible to run active.
Active, uh, we can have bothsides of your application
running.
You know they work becausethey're running all the time.
They're both are handlingcustomer traffic all the time
you're able to.
Uh, you know, keep, keep thatinfrastructure paying.

(39:57):
Uh, you know, on on both sidesyou don't have infrastructure
that's just sitting theregathering costs and no value to
your customers.
You can send traffic to theregion that's closest to it to
get better latency.
It's a whole bunch of glass andwe're just trying to make those
architectures easier to achieve.

Speaker 2 (40:18):
Yeah, that's incredible.
So, based on what you're saying, you'll have an even higher
level API in addition to whatPostgres offers out of the box
to store and query your data.

Speaker 3 (40:30):
And you were talking about reach and locality and
things along those lines andthose are just built in right,
like you just connect with yourPostgres client, you look up the
DNS name, you connect from yourPostgres client and all of the
locality stuff will be handled.
If you're running in EC2, eventhe kind of AZ and data center

(40:51):
locality will be entirelyhandled by the infrastructure
without you having to worryabout it at all as an
application programmer or as a,you know, as a system operator.

Speaker 2 (41:03):
That's absolutely incredible.
Developers who work withPostgres before and especially
optimizing it.
You know you might use.
You know shared buffers or youknow PG pre-warm to govern.
You know which data stays inyour buffer cache.
Would those features beabstracted away or also

(41:25):
something that, like a nativePostgres developer, is
experienced with that be able toaccess?

Speaker 3 (41:31):
Yeah.
So I think that's a sort ofmixed answer to that.
So, a Postgres developer,you'll still need to think about
the way that your queries andschema come together into
execution plans in the database.
You're still going to want topay attention to that.
Explain and that explain,analyze, like how much work am I

(41:52):
asking the database to do on mybehalf?
On my behalf, you know, becauseanyone who's written you know,
or written a lot of a databasecode or or operated database
systems, knows that thedifference between the database
doing, you know oh, one, oh,login, oh, or even oh n squared
data, you know, work for you, uh, it's hard, you know, it's hard

(42:17):
to look at a query and and know, like how much work is the
database going to do.
And that's where that explaincomes in, like, hey, what are
you going to do when I ask youto run this piece of code?
And so that remains just asrelevant.
The stuff that you know wedon't think is going to be as
relevant is that, you know,preheating of the cache, is that
tuning of buffer and cachesizes and so on.

(42:39):
We've tried to abstract thoseaway and automate those so the
operator doesn't need to worryabout them.
Oh, wow the returns still exist,but they are not exposed to the
operator, to the developer.
They're things that are handledin our system.

Speaker 2 (42:58):
Oh wow to the developer.
There are things that arehandled in our system.
Oh wow.
And just clarify, if I can askfrom my side is that with some
sort of per-tenant cachingdesign?

Speaker 3 (43:08):
Yeah, so there is some caching, especially around
very frequently accessed tableslike the catalog, which is a
per-ten isolated caching design.
But most of what we've tried todo is avoid the need for
aggressive caching at the queryprocessor layer by pushing down

(43:32):
work to storage, and so thestorage interface is not key
value.
It's actually quite a rich pushdown interface where you can
push down filters, you can pushdown projections, you can push
down aggregates and those kindsof operations, and what's super
valuable about that is it makesthe interface between the query

(43:55):
processor and storage so muchless chatty, right, um?
If you think about the, theinterface that postgres itself
would have to storage, it's verychatty, it's very low level.
It's like you know, hey, get methat btree page, get me that
btree page, give me that page,right, um, whereas you know, in

(44:16):
in dsql it's a logical interface, that is, get me all of the
rows matching this predicate andall of that low-level data
structure wrangling happenslocally on the storage node,
where it can be done in memoryor done locally against fast
storage, which saves a whole lotof back and forth and then

(44:38):
allows us to scale out thatquery processor layer
horizontally without thechallenge of keeping a large,
coherent cache and cachecoherency is a classic computer
science problem for a reason,and the reason is that it's
really hard to get right,especially in the distributed
setting.

Speaker 2 (44:59):
Yeah, absolutely, and it's so interesting to hear
about the implementation hereand the design choices you made
and it's truly remarkable.
You know what this can enableand you know we talked about use
cases.
I on a completely separate.
You know this is a tangentialpoint.
The most popular, one of themost popular and you know

(45:22):
revenue generating AI use casesright now is code generation.
It's developers can write codesuper fast and blurred.
Looking ahead and saying, youknow we're going to have AI
agents write more code and, youknow, accelerate that process
even more.
And one of the things that AIis good at is kind of generating
procedural code and logic.
One of the things you wouldn'twant AI to do is provision a

(45:44):
bunch of infrastructure for you.
So having a service like this,where you know you can just have
a postgres api or postgresinterface but the, the scaling
and high availability and that'sall being abstracted away from
the developer, whether it's, youknow, human or you know, ai
agents that couldn't be.
You know a future softwaredevelopment approach that we see

(46:09):
becoming more popular.

Speaker 3 (46:11):
Yeah, I think that's very astute.
In a lot of ways, ai andserverless are just really
deeply complementary the abilityto have agents or

(46:51):
human-in-the-loop codegeneration, generate, code,
generate applications that thenrun against compute services,
against databases, againststorage services, that handle
all of the operations withouthaving to push that complexity
onto that ai to solve.
You know, I think that's that'sgoing to be.
Uh, those are going to behugely you know, hugely
complementary technologies.
Now there is also going to be abunch of you know, ai powered
things that run on on more sortof classical architectures, um,
you know, and serverless thingsthat are not, you know, not ai
powered right, like, I think youknow, we, we fill in the whole,
the whole kind of matrix thereof options.
But I do think that combinationof kind of AI, agents and

(47:15):
serverless is going to turn outto be a really popular and
really productive one for, youknow, for a lot of folks
building systems lot of folksbuilding systems.

Speaker 2 (47:28):
Yeah, it's really an exciting time to be a software
engineer and a data engineer.
I know there's some kind of popscience fiction writing out
there that's going to replaceall engineers.
I mean, I don't know.
If you go look at OpenAI,they're still hiring a lot of
engineers, or any AI, even AWS.
So you know, I think it's gonnait's.

(47:54):
It's certainly very cool to seesome of the new like you said,
complementary technology come in, like serverless it.
It does at least simplify thedeployment process for ai,
especially if there's a commonlyused api that's built on at
least foundationally, likeClosegres, which most LLMs are

(48:14):
already trained on, andleveraging open standards.
I want to get your thoughts.
You also mentioned that, interms of use cases, just
simplifying architectures is thefirst one.
Is there anything else radicalthat you see being enabled by uh
dsql?

Speaker 3 (48:33):
um, yeah, that's, that's a really uh, you know,
that's a really interesting,interesting question.
I, you know, I don't I don'thave anything particularly
radical that I can that I canshare.
Obviously this is somethingthat we think about, you know,
think about super deeply.
But, you know, maybe, maybe one, you know one, one hint that I
can drop is, I think, you know,having this journal at the core

(48:54):
of the system, having this, thiscommit log, at the core of the
system.
Um enables us to do some reallycool things with the kind of
history of transactions and uhand and so on that is is hard to
do in in classical systems andand I think we'll be able to
build some very cool datafeatures around that over over
time.
And then I think having this,you know, serverless

(49:17):
infrastructure and scalableinfrastructure gives us a lot of
flexibility in the ways thatdata is accessed over over time.
Right and so right now it'sPostgres and SQL, but it could
be a whole lot more options thanthat.
So very fun.
And to your larger point aboutthe industry, I think this is

(49:37):
honestly one of the mostexciting times in the data and
computer industry.
Certainly that I can rememberJust the last year in the data
space has been extremelyexciting.
There's so many interestingtrends and so many cool new
technologies coming along, butit's just a really, really cool
time to be involved with thestuff absolutely every every day

(50:02):
there's, there's newannouncements.

Speaker 2 (50:04):
It seems like we're making these big kind of
categorical advancements intechnology on a much more
frequent basis now since the,since the start of the ai wave.
So it's been very fun to followalong with it and you know
especially the, and you know I Ihave a you know distributed
systems and data background, soespecially the work that you're

(50:27):
doing and I've had a lot of funfollowing along with your blogs.
I just love how generous youare with your insights and the
work you're doing, because therehave been companies that have
built incredible IP anddatabases that they don't really
get into depth of theirimplementation as much.
They will.
You have to go find it, but theway you just blog about it is

(50:52):
is really cool and I Idefinitely recommend the
listeners of this podcast, ifyou haven't yet to, to subscribe
to.
To mark.
Follow him on uh x.
Uh, follow him on blue sky.
I'll let you actually answer.
You know where can peoplefollow along with?

Speaker 3 (51:06):
yeah, x, x and blue sky are great, great options.
Uh, you know you can follow myblog.
I tend to put most of my longform long form writing there.
I've been doing a little bitmore on linkedin recently, uh,
but uh, but still not a whole, awhole lot there, although
there's a really interestingkind of technical community that
seems to be building aroundaround that platform, and so any

(51:27):
of those four options is isgreat and should keep you up to
date on on the things that I'vebeen doing, and I really
appreciate your comments on onmy blog.
Great, great to hear that folksenjoy it absolutely they.

Speaker 2 (51:39):
They do, uh, you know me, me and my friends who also
work in the industry, you knowtalk about it and uh, it's just
really interesting to read andand and you know, see what's
going on.
It's almost like, uh, I want tosay, uh, you know watching a
movie and then watching thebehind the scenes footage, you
know it's, it's, it's just verycaptivating and and cool to

(52:00):
learn about the, the process andthe thought process too yeah,
that's really cool.

Speaker 3 (52:05):
Thank you, mark Brooker.

Speaker 2 (52:08):
thank you so much for joining this episode of what's
New in Data.
Yeah, thank you to thelisteners for tuning in.
Yeah, really appreciate it,thank you.

All Episodes

Episode Transcript

Popular Podcasts

United States of Kennedy

Dateline NBC

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Building for Scale: AWS’s Marc Brooker on Distributed SQL

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}United States of Kennedy

Dateline NBC

Stuff You Should Know

All Episodes

Building for Scale: AWS’s Marc Brooker on Distributed SQL

United States of Kennedy