Schema on Write vs. Schema on Read: What’s Faster for Your Business?

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Brick Thompson (00:00):
Rick, welcome to the dashboard effect Podcast.

(00:08):
I'm brick Thompson

Unknown (00:09):
and I'm Landon Ochs.

Hey Landon, good to see you. We wanted to talk
about a topic today which is alittle bit technical, but we'll
try to explain it in a waythat's not too complicated, but
basically it's a concept ofschema on read versus schema on
write that already soundscomplicated. Maybe we can define
it. Let's define that. So whatis schema on write? What does

(00:33):
that mean? Yeah,

Unknown (00:34):
so schema on write is the whole idea of storing your
data in a format that you wantit to be in, mainly for
reporting, for

reporting. So it's already got the
relationships, the transforms,anything that needs to happen
exactly the data is sitting inplace, sort of, sort of the, I
mean, the old data warehousemodel where you got a SQL
Server, yeah, you've got therelationships, the tables.
Everything's already set

Unknown (00:59):
up. Yeah, yeah. With databases, you know, you define
this as my table. These are thecolumns I need. These are the
constraints. So it's there, youknow, it's defined, okay.

And so then what is schema on read?

Unknown (01:12):
Yeah, so schema on read is the whole idea of taking your
data, giving it, let's say, aselect statement as an easy
example. On top of that, andyour SELECT statement is what's
actually going to transform yourdata and serve it out in the way
that you want it. So you're notactually saving it anywhere.

(01:32):
You're just saving how to getthe data into that format
whenever you need it. Kind

of okay. So you the easy example is you have a
data lake. It's got raw data inthere. It just sits there.
You've got a semantic modeldefined that gets activated when
you're running some BI reportsor doing a query, something like
that, that that is sort of thetranslation layer into the data
lake. So the the schema is beingdefined as you're reading it.

(02:01):
Yeah, exactly. Okay. So that'ssort of the modern way that
we're doing a lot of stuff thesedays. There are still use cases
where you want schema on writeso that the data is sitting
there with the schema alreadydefined. What typically would
lead to that?

Unknown (02:19):
Yeah, definitely. So I will say, you know, we haven't
seen this too often with thesize of clients we typically
work with, but there is one veryrecent, not, not too recent, but
there's a, there's a prominentuse case that that comes to
mind, where this particular firmhad, you know, 10s of billions
of rows that they were dealingwith. They also had six to seven

(02:43):
different sources. So all ofthese sources served up data
differently. They had differentrules, different logic that
needed to happen to get theinsights that they were hoping
to get out of the data. Okay, sowhen we're talking that much
data, and you know that complexof transforms, it was just crazy
trying to use that whole schemaon read methodology. It spun and

(03:07):
spun for Okay,

so, yeah, so you were asking the system basically
to create conform dimensions andall the stuff you need on read,
and it just took a lot ofcompute. I'm sure you could
scale the compute so it wouldperform well, but then that gets
expensive, yep. So you're justbetter off doing it on right,
sort of the old style, Kimballstyle, you know, data warehouse.
Okay, that makes sense. So whenwe're doing schema on read,

(03:32):
typically for our clients, hesaid, you know, our size clients
just, just so listeners canknow, typically, a client of
ours is sort of in the $50million revenue range to maybe a
billion or 2 billion in revenue.
So very mid sized company, not ahuge enterprise company. So when
we're setting up a schema onread, we're typically doing that
with serverless SQL, so usersstill feel like it's sort of the

(03:53):
old model, like they're justdealing with tables with rows
and relationships and so on.
Yeah,

Unknown (04:00):
exactly. They can go through, select the tables they
want, select the columns theywant, and in most cases, it's
about the same speed as if wehad a physical table that they
were using. So yeah,

these tools have gotten so good, that's amazing.
Okay, so for you, then thedecision about whether to do
schema on write, having the datasitting there with a schema
already defined, not goingthrough a semantic layer. I
guess there's an implicitsemantic layer there, but you
decide whether to do that basedon how much compute it's going

(04:34):
to take to get to that schema.
So if you have billions of rows,lots of different data sources,
and basically it comes down toyou're going to you're gonna
have to do lots of transforms onthe way, and you might as well
go ahead and store that on theway, and so you're not doing it
every time a report isaccessing.

Unknown (04:51):
Yeah, yeah. And, I mean, I like to tell my team
even simplify it more, you know?
And don't, don't worry about thecompute, worry about the time,
right? It's easier. Everybodycan attach to it. It can be hard
to know how much compute you'reusing in these serverless
environments. But you know, ifwe have a report that's taken
five hours to refresh, yeah? Soyeah, you know, ideally we'd

(05:15):
want to be in 20 minutes less.
So that's usually,

yeah, okay, but the report itself is going to be
much more performant than

Unknown (05:26):
that. Oh, yeah. This is just getting the data into the
report. Once it's stored in thereport, it's, it's going to be
snappy, yeah,

very few seconds. Okay, all right, that's
an interesting topic. I mean,we've been seeing this change
over the last decade, you know,going from where you know, all,
all data stores, well, forreporting, basically, were data
warehouses, Kimball style datawarehouses, or Inman style data
warehouses, to now Data Lake isthe main, main form, and really

(05:53):
is data lake house. So you'vegot the data lake, and then
you've got these schema on readsemantic layers ready to go?
Yep,

Unknown (06:01):
exactly. Yeah, it's, it's actually been, been cool to
see that. I will say I wasexpecting to have to do that
whole scheme on right a lot morewhen we initially went with this
design,

which designed the the lake house. So when we,
when we

Unknown (06:17):
pivoted from data warehouse to Lake houses, and I
was pleasantly surprised. Istill am at what we're able to
do without that, you know,without doing it on right? Yep,
exactly. And it's just faster,more efficient for our team,
faster for our clients, becausewe're able to add columns, tweak
logic, just way more flexible.
Yeah, so fast, yeah, so youdon't have to rewrite a bunch of
ETLs and exactly, rewritereports to deal with that. Yeah,

(06:39):
interesting.

Okay, well, that's what I wanted to cover on
this. Thanks. Great. Thank

Unknown (06:46):
you. All right. You.

All Episodes

Episode Transcript

Popular Podcasts

United States of Kennedy

Stuff You Should Know

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Schema on Write vs. Schema on Read: What’s Faster for Your Business?

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}United States of Kennedy

Stuff You Should Know

Dateline NBC

All Episodes

Schema on Write vs. Schema on Read: What’s Faster for Your Business?

United States of Kennedy