Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:11):
Hello, and welcome to the Data Engineering podcast, the show about modern data management.
Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy?
DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems.
(00:35):
Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed price.
Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafold
to book a demo and see how they turn months long migration nightmares into week long success stories.
(00:57):
Enterprises today face an enormous challenge. They're investing billions into Snowflake and Databricks, but without strong foundations, those investments risk becoming fragmented, expensive, and hard to govern.
And that's especially evident in large, complex enterprise data environments. That's That's why companies like DIRECTV and Pfizer rely on SQL DBM.
Data modeling may be one of the most traditional practices in IT, but it remains the backbone of enterprise data strategy.
(01:24):
In today's cloud era, that backbone needs a modern approach built natively for the cloud with direct connections to the very platforms driving your business forward.
Without strong modeling, data management becomes chaotic, analytics lose trust, and AI initiatives fail to scale.
SQL DBM ensures that enterprises don't just move to the cloud, they maximize their ROI by creating governed, scalable, and business aligned data environments.
(01:50):
If global enterprises are using SQL DBM to tackle the biggest challenges in data management, analytics, and AI, isn't it worth exploring what it can do for yours?
Visit dataengineeringpodcast.com/sqldbm,
that's sqldbm,
to learn more. Your host is Tobias Macy, and today I'm interviewing Serge Gershkovich
about how and why data modeling is a socio technical endeavor. So, Serge, can you start by introducing yourself?
(02:16):
Sure.
Serge Greshkovich.
I am
a
data modeler, data architect
by trade for most of my
professional career, and I'm also
the head of product at SQL DBM, which is
a online database modeling tool. It's a SaaS product that runs in the browser. So data modeling is is very near and dear to my heart, and
(02:40):
I am also a
Snowflake data superhero,
big fan of of Snowflake as as you can tell, and everything that they're doing. And, yeah, that's that's me in a nutshell.
And do you remember how you first got started working in data?
Kind of by accident. I think the way most people end up in data, nobody
(03:00):
nobody tells their their parents at the at a tender young age that I wanna be in data when I grow up. But I started with technical degree with a computer science background, just kind of landed my first internship
at,
at the data side of of a of a company
and and just kind of stayed there. It was it was interesting. It was
(03:22):
challenging in a way that
programming, I'm sure, has has different challenges, but that's where I ended up. It was in the SAP ecosystem to start, and I spent over a decade,
in in SAP,
first, BW, then HANA, then at the business intelligence suite along alongside that. So eventually
(03:45):
decided
once I saw Snowflake that I never wanted to touch another SAP system again, and I've lived up to that promise.
I'm very, very proud of that.
And and, yeah, eventually just kind of shifted gears to product management.
And so in terms of the overall
space of data modeling, that is a term that can be used across
(04:09):
a large number of contexts and use cases.
And I'm wondering if you can just start by giving
your definition
of the activities that are involved in that overall practice of data modeling.
That's a great question because I want to first
maybe start with what it isn't and maybe dispel some some common myths that might persist for for those that are unfamiliar or not used to doing it on a on a regular basis or doing it systematically as part of a project or part of the data strategy at your current organization.
(04:45):
So if I could kind of summarize it in in one
phrase or image, it would just be
a conversation between
a business stakeholder or a subject matter expert and a
data team. So data engineer, data architect,
business intelligence team, analyst.
It's
understanding
what the business does and encoding
(05:07):
that semantic information in such a way that our
data models and our data structures
are built in such a way that accurately capture
those business requirements
and the needs of the business. So we can, of course,
broaden the definition from there. But at the core, I think if people walk away with anything, it's being able to
(05:31):
align with your business stakeholders
in in a way that everybody's on the same page
and have some kind of
semantics, whether those semantics are visual, verbal, technical, or preferably all three that everybody can understand
and
just share common talking point, common reference point around.
(05:52):
And on the flip side, you mentioned some of the things that it's not. And I'm wondering if you can talk to some of the overall
buckets of
misconceptions
that you've seen people come into the overall practice of data modeling with where maybe they have a very narrow view of the set of tasks and responsibilities involved, or they have a different
(06:16):
area of focus or a different set of objectives as far as what that data modeling practice is supposed to achieve and maybe some of the ways that that can derail the overall
exercise.
Sure. I think the biggest one that it's it's not is
is the misconception
that it doesn't exist or it's optional.
(06:38):
Joe Rees famously said that the lack of a data model is still a data model, albeit a crappy one. And
this misconception has its roots in probably in the fact that most of us that work with data do so on the analytics side. So So we're looking at it not in the transactional system, but in the data warehouse
(06:58):
where you could make the claim. We might not be broadly speaking
formally modeling something. We might just be getting data delivered, landed in our raw layer, and then just working our way up to the transformations up to reporting. So, we're not formally creating ERD diagrams. We're not formally
discussing our designs
and the relationships that our dimensions have with our facts. But by by definition, we have a data model. It just we don't own it. It comes
(07:26):
cascaded down to us from the source system, where, of course, you can't you can't have the sort of chaos you might find in a data warehouse in a transactional system because
sales wouldn't get recorded, bookings wouldn't get made, etcetera. So there's absolutely a data model involved in everything you do. Our CRM systems, our
master data management systems, they all have a a data model. And even if that data model lives in your head and if your
(07:52):
data
landscape is small enough for you to effectively keep it there, that doesn't mean that the data model doesn't exist. So that's the the biggest misconception.
The other one is that a data model is
kind of,
secondary.
It's something that we do post fact. So we we build. And then, again, with there's
(08:14):
as much as I love DBT, but they have a lot of blame in this misconception
that
you can build, you can write SQL, and you can effectively see Taz your way to a data model. And if you need to document something, then describe that DDL and and off you go. And this is what I why I started with fundamentally.
(08:35):
It's
much
easier to to check something out on
on paper before you you actually build the thing where you write lines of code
and you commit business logic, you commit loading and
testing and all of these resources deployment
just to find out that you couldn't align on on the fundamental
(08:56):
formula or an understanding of something. And
the point I'm I'm making is that data modeling is just a a sanity check. It's
if when it's done right in the order of talk to the business, get the requirement,
encoded in something very simple, just a an ERD, for example, and then go off and build. Because
(09:18):
the chance that you've made a mistake after two pairs of eyes, especially the
the business owners or the subject matter experts have had a look at it, is dramatically smaller. And, you know, even DBT has the guardrails in place where you can absolutely enforce data contracts and things like that
(09:38):
that prevent you from even building something
that doesn't meet your original spec of
data types required, number of columns,
etcetera,
references, and things like that. And
the analogy I like to use is sometimes
the like, just the map. A data model is effectively a map of where you're trying to go in your data strategy. And that is
(10:02):
I'll come back to the transformational side of, of data modeling. Let's put a pin in that one. But fundamentally, it's a map. And do you need a map? Well, the answer is
sometimes, not always. That if you're standing in front of, let's say,
a shopping center and it's just a strip mall, you don't need to consult the directory because you see all five stores, you walk into the one you're looking for. If you're in an enormous shopping center, then, you know, you might probably use a map just to find your car in the parking lot. And you wouldn't just wander around aimlessly if you're looking for something very specific. It would be in your best interest to spend two seconds consulting the directory, walking directly to where you need to go and getting that information.
(10:42):
And unfortunately,
that's another of the drivers that make people a lot of times overlook data modeling is the fact that just like in this analogy, if you walk into a shopping center and you just start wandering around,
start asking strangers, you will eventually get to where you want to go. And what this looks like in the business world
(11:03):
is
let's take our previous example. If you're a data architect that's dealing with a data landscape that's effectively small enough to fit inside your head. Imagine it's a startup, it's a small company, but the company grows,
your data sources grow, your data assets grow, your
business units grow, your business teams and data teams grow.
(11:23):
The problem is you start to feel the
symptoms. You start to feel the pain before you've identified that a problem exists because it feels quotidian.
Before, I used to know everything. Now I didn't know, so I asked my colleague next to me. Tomorrow, he didn't know either, so we called a meeting with the business stakeholders.
Tomorrow, the business stakeholders
(11:45):
couldn't even really give us an answer, so we guessed and we built something. Turns out it was wrong, so we went back to design and we rebuilt it. All good. But the problem is it's like a a frog in boiling water that it was okay, and the the pain points are all incremental. So it feels like it's fine. Nothing's broken. But fast forward a year, year and a half, data quality issues, lack of trust,
(12:07):
we're blaming,
and we're getting blamed instead of collaborating with those business stakeholders like we're meant to. We're not delivering value because even when we get it right, the data is suspect.
And how did we get here? It wasn't overnight, but it was gradual. And this is another pain point that
if you know to look for the red flags and the the warning signs,
(12:29):
then
you can avoid
this doomsday scenario. And, unfortunately, it's one that I think anybody that has worked in data has seen
some figment of this, whether it's
full blown or just just starting because, again, no no data strategy is perfect.
But these are just tools that help people orient and help understand what the company is doing
(12:52):
and what it is we're capturing about what the company is doing and then presenting that to the decision makers.
Another point that you raised in terms of this overall space of data modeling is, in particular, the source systems that you're pulling data from, but also data modeling is not something that is the sole domain or dominion of the data warehouse environment. You also need to incorporate data modeling
(13:19):
in exercises such as streaming ingest or if you're doing any data distribution
to things like applications
or other contexts or if you're trying to
incorporate unstructured or semi structured data assets either into a warehouse environment or an application context, you still need to understand
what is the actual purpose and semantics of that data.
(13:41):
And I'm wondering how you see that recognition
of the other
modalities and localities of data being brought into that overall conversation
of data modeling, and what bearing does this piece of information or this collection of information have on the overall business domain model?
(14:02):
Great question. Because I would say that I I don't see it as or at least as much as I would like to. And
just like myself,
many people end up in the data world without a formal background. So I've seen many people come even from the business side that they they were the data expert on their team because they were managing the big Excel that nobody knew how to make sense of. And they eventually kinda get formalized into the data team and get taught the fundamentals of SQL, and then then they're off and running. But
(14:33):
many people, for example, don't know that our even what we call a database today is is founded in relational theory, which is mathematics.
So the technology that we're using, of course, all of the cloud compute and caching and storage, there have been massive innovations there. But fundamentally, what is why do we call it a relational database?
(14:55):
Because it is grounded in relational theory, which goes back to the sixties and seventies at Gurkad
that he codified all of this and eventually that gave rise to the database as we know it today.
Concepts
like rows and columns. Like, that was revolutionary
in at some point.
And when all of this was getting codified,
(15:15):
transformational
modeling, and I'll I'll and like I said, I'll get back to what I think that term should encompass,
just
was nascent or almost didn't exist. The fact that you would be analyzing data across systems and not just within one system was not yet a thing. And
today, when we're looking at the contrast between a transactional system where all of our models for,
(15:41):
largely speaking, can be expressed in that classical definition of
entities, relationships,
dependencies,
primary and foreign keys.
That is not so much the case on the warehousing side, where the
OLAP system has a different type of dependency. So it's not so much a formal
(16:01):
parent child relationship, but it's it is formal in the sense that if the table I'm pulling from tomorrow is missing the column I'm trying to pull, then the pipeline is gonna break. So this is a transformational dependency. It's not to say that the dimension is a conformed dimension from parent to child, but you have a data pipeline that depends on all of these things coming together.
(16:25):
So what are you trying to solve on on the OLAP side and OLTP side? On the OLTP side, you're just trying to make sure that everything is
as close to to normal as possible and
that your data is consistent. It's not duplicated.
And you're able to basically maintain it in such a way that avoids
(16:46):
all types of errors,
dependencies, and,
and update errors, anomalies, etcetera. On the data warehouse side, you know, we can almost guarantee that if we built
something upstream
that our data is pretty much clean when it lands.
Of course, not always the case, but let's just go with it. On the data warehouse side, you have, as you mentioned, you have multiple systems.
(17:09):
You have
conformed dimensions by definition.
You have
transformational
dependencies. So now let's unpack that. What does that mean? That when the data lands, it's quite dirty. It
needs to be
renamed. It needs to be cleaned up. It needs to be consistent in such a way. Then there's all the different methodologies that you could be using. And we'll park that for a later question, but let's just go with the general. You have, let's go medallion or
(17:37):
staging,
reporting, whatever you wanna use. You have a raw layer that is just
data as is. You have a staging layer where it's a little bit more
conformed and normalized, but still not overly
aggregated. And then you have a reporting layer that is
maybe denormalized
completely in one big table if you wanna go to that extreme. But regardless,
(18:00):
you have pipelines that transform the data across those three layers. So you have
dependencies
around selection, and you have you have quality issues in terms of,
are you calling the same field by different names along different parts of that process?
You have the fact that you can have multiple processes working on that data and creating, let's say, the
(18:23):
the sales view and the marketing view of our trading information.
And now by definition, you have kind of the core is duplicated, and then some of the logic around it is specific to a certain department.
Which one is right? Can I pull
yearly totals from either one, or have they been
distorted by business logic that's specific to each business unit? So now you run into all sorts of problems that didn't exist on the transform on the transactional side.
(18:52):
So how do we manage that? So the word a lot of these things, for example, if you take a view, what is a view doing? It's applying
logic, SQL logic
to a dataset and giving you a different kind of result.
Unfortunately,
the word logical has already been co opted. So informal modeling, generally, we start with a conceptual model, which is very basic.
(19:16):
Then we go to the logical
layer, which is
similar to physical, but not really tied to any
type of physical database that makes it easy to understand and also then transform
to a physical database or multiple physical instantiations.
And then physical, where you're actually describing
(19:37):
the
the particulars
like clustering and partitions
and column
data types, things that are that are gonna help you effectively store the data. Where on the logical side, you're
not so much concerned about storage. You're concerned about nuances, business nuances of is this a subtype of this other type, or
(19:57):
how does a customer place an order? Can a customer must a customer place an order to even be a customer? Things like that that you're trying to iron out.
And what is a good word to call everything that happens in a data warehouse?
I am trying to go with
transformational.
That in my intro, I forgot to to mention that I'm also an author. So I have
(20:21):
a book called Data Modeling with Snowflake,
where
after many conversations
with
folks,
because of the nature of my role at SQL DBM, People kind of understand the need, the importance, or they've just heard the term data modeling. They want a a little bit of a foothold because, like we said earlier, maybe they didn't have a formal background in it. People would come to me and say, can you recommend some reading material? And,
(20:46):
honestly, I was struggling. Like, I wasn't gonna throw CJ date at them because they were, you know, they just came from the business world. And I thought, let me write something that is
just practical and to the point of what it is, why do it, how it's gonna help you, and here's some basic formulas. That's why I wrote the book, Data Modeling with Snowflake specifically, so that users can actually not just read about a type two dimension, but actually take a script
(21:13):
that is
optimized for Snowflake architecture and create one and see what it looks like. And of course,
I I didn't stop there. I because even a type two dimension,
there's there's transformations taking place. So it's important to kind of bring this all tie this all back together. It's important to understand
(21:33):
that the transformational component of modeling is real,
is important
to
document, to understand,
to codify in a way again,
it looks more like data lineage rather than
an entity relationship diagram.
But it's still at the heart of it, one entity being connected to another entity, not as a relationship, but as a dependency,
(21:57):
for example. Because those are the types of questions that come up in the data warehousing context of, I'm seeing this pretty dashboard, but where is this information coming from? How do I trust it? Where how do I trace it back to the source?
And of course, there's a whole ecosystem
of monitoring tools, lineage tools, catalogs.
And it could be a little overwhelming
(22:18):
if it's done
first of all, if it's done at all. Second of all, if it's done,
is it being done on somebody's
prerogative of
let's just buy something just so I can claim we have an installation of such a thing? Or is it actually being implemented to answer practical business questions of lineage? And is it up to date, etcetera? So
(22:39):
very long answer to actually a very complicated
problem. So I hope I kind of covered the the edges.
No. I think that that that's a great
summary of that problem space and digging into
some of that technical transformational
aspect where you were mentioning things like type two dimensions, conformed dimensions, dimensional modeling.
(23:02):
When you're in that context of building the warehouse
as that is the effectively digital twin of the organization,
one of the stumbling blocks or
points of analysis paralysis that can often come up is deciding on what is that dimensional
approach that I'm going to take. Am I going to do the Kimbell style star schema? Am I going to do data vault? Am I going to go and explore anchor modeling? Do I just throw up my hands and do the one big table approach? And I'm wondering how you're seeing
(23:34):
teams
try to
maybe leapfrog or short circuit some of that analysis paralysis and just start doing some building and discovery to then be able to
build the feedback loops that help to inform
exactly which approach to take or exactly how to implement whichever approach they're leaning towards?
That's a great question, and I think I have
(23:56):
a a very unique perspective on it that is that could be very helpful because especially in my role now as as head of product at SQL DBM,
not only do I work on the product side, but I naturally also meet with our our customers. And not just customers, but people who are early in the funnel. Maybe they're just trialing the tool.
(24:17):
And I get to to see not just one use case as I would, let's say, as an architect
working with a company long term or even as a consultant working on a project over the course of six months to a year. But I get to speak with dozens of people
a week, a month, and I get to see I get to touch base with them over the those increments
(24:37):
of first week, first month, first six months, a year. And I think that the failures,
the the use cases that
that didn't pan out are just as telling and informative as as the ones that did. So
one of the things that I
I I kind of saw from this vantage point is
(24:57):
one of the counterintuitive
things I would say is
technical
prowess
counterintuitively,
is not a guarantor of success for
data initiatives.
Very early on, and this caught me off guard, that I would be speaking to people who some of them, like, literally would brag about having met Kimball and Inman and and having been taught by them. So initially, naively, I would I would think, surely, they know what they're doing. And then fast forward
(25:27):
six months, I see that they're not really using the tool. Fast forward a year, they're not renewing because
their plan basically fell apart. And
what went wrong is they effectively said, well, I'm the architect. I just need I'm the one that's gonna be doing the heavy lifting. So I'll just need a license for myself, and I'll I'll get her done. Problem.
(25:48):
They are not involving the business.
So they're not asking for viewer licenses. They're not asking
for
a an embed of this into their their wiki or their confluence that as they're building, somebody can the business can look on and sense check what they're doing. So that's number one. That the number one biggest mistake I see, the biggest red flag
(26:10):
to a project that is likely to flounder is lack of understanding
that the data team does not own
the concept of what it is they're building.
They own its implementation.
They don't necessarily know or by design, they don't know as well as their business colleagues
the intricacies
(26:31):
of the things they're supposed to be capturing.
Number two is
I think you kind of hinted at this.
A lot of times, people want
the
the awards. They want the
the accolades on their resume. They want the the stripes on their epaulettes of I implemented data vault because they heard about it. They saw it succeed in other context, or I did anchor modeling.
(26:57):
There are people who understand
anchor modeling. They are comfortable in six normal form and
God bless them.
There are people that
just kind of read the textbook and are, you know, ready to go all in and risk their company's data strategy on it. And that's another problem of
wanting to do something for the sake of doing it, I would say, is the second biggest reason I see implementations fail
(27:23):
where there's not the need just isn't there to justify that amount of complexity.
And just because
sometimes it's an ego problem that you might have somebody,
like, a really talented data architect, seasoned veteran,
certified, what have you,
but not their team. So their team might be relatively junior to them.
(27:45):
So they can't maintain or they make mistakes that the architect,
wouldn't have made
that they're not able to
effective it's holding them back more than it's accelerating what they're doing. Of course, at a certain point, you know, you can't survive without Data Vault or something to handle
50 source systems. You're not gonna try to do that by hand.
(28:05):
And
the last one I would say
is
not having is kind of the the opposite of what I previously said. The previous example is going to the other extreme of
not having a control over the fundamentals.
And when I say fundamentals, I really mean fundamentals. So, when I said, what is data modeling?
(28:28):
It's a shared concept of semantics because semantics
communicate
a density of information that words
sometimes when there's lots of them simply cannot. So those semantics could be visual.
We're not recording, but I like to show a slide of just
a list of tables that you would see when you log into any database. You open the database, open the schema, you see a list of tables.
(28:52):
Even when it's like 10 tables and they're properly cleanly named customer product sales,
it might still take,
us even, an expert, you know, some data wrangling to understand, okay,
is a location a customer location or is it a supplier location?
A simple diagram with just boxes and lines
can solve that problem
(29:13):
visually, instantly.
Other semantics are verbal.
Earlier, we we said the words type two dimension. That's three words that to many of you
likely
symbolize
so much more than than what is what is in those three words. For example, you know you can expect the granularity of that table that it's not just gonna be the business key. You understand that it's likely gonna have columns similar to from and to date, if not others.
(29:42):
You maybe even you know the pseudo code required to load one effectively.
So these are the things that we expect that when I say type two dimension, you don't just go
scratching your head and building something random.
Same thing goes for business concepts of when I say active customer, that's a that's a popular one because nobody has an accurate definition of what an active customer is, or
(30:05):
there's probably multiple of them. And unless we know how to
steward that that knowledge, that context to say, okay. We used to have just one definition. Now we have three departments that have their own definitions.
This is the the company standard. This is the marketing standard, etcetera.
(30:26):
These are the things that
can
when two people are kind of looking at each other and nodding
and not necessarily
thinking the same thing, this is what leads to problems. So
back to the the point I was making that the other issue is, again, a lack of fundamentals
that
they drive
(30:47):
they're they're
principles
that drive success. They're just best practice patterns of
knowing you don't have to necessarily model in Kimbell,
but knowing what a fact versus a dimension is
is very helpful just in trying to describe what type of information a certain table,
contains.
So some of these things are foundational, and a lot of times, like I said, they're they're absent from from the vernacular. So that's why I started with the very first question of what is data modeling. It's not necessarily an ERD diagram. It's
(31:20):
the minimal effective dose of consensus
that moves us forward
as a team.
So we've been narrowing in
from the
overarching
purpose
through the incorporation
of
the business concepts and doing the organizational modeling, trying to map that to a technical representation
(31:43):
where the general idea in data warehousing is to serve as
some sort of digital representation
of the real world manifestation
of the business and its processes so that you can then do some analysis and reporting and maybe prediction on the overall functioning of the business. And that inevitably
(32:05):
involves the
technical details that we've been touching on and
that those technical details can be the
dimensional
approach that we've been discussing. It can be the specific
database engine. It can be
the ways that the data is represented or stored, whether that is in a lake or a warehouse or as free text, etcetera.
(32:29):
And I'm curious what you're seeing as the
most significant impacts that the technical decisions
tend to have in either
accelerating or hindering the capabilities of the organization to actually implement that digital twin representation of the business.
It's a mixed bag because
(32:50):
technology
either
accelerates our progress or just allows us to to fail faster. If it's
complicating
more than it's
adding value, then it's doing the opposite of what it's meant to do. So there's plenty of use cases for everything that you mentioned, starting, let's say, with, semi structured data that we've seen relational
(33:11):
kind of set the groundwork. But sometimes,
especially when we're dealing
with IoT, with sensor information, with logs, they're semi structured, or we're
actually
using our
data lake house or data data lake to our full potential that has unstructured data.
These are just ways of encoding, being able to quickly process the the information as as it's being loaded. But how do we ensure that things inside these,
(33:42):
nested relationships
are are accurate? So our brains are not great just like we're not great at visualizing
exponential growth. We're not great at visualizing
nested data, especially once it goes beyond
one or two levels.
And, in fact, one of the exercises in my book is
I set out to prove
why
(34:02):
why the schema on read approach failed. And
it's not because it's slower. It's not because modern systems can't can't handle semi structured data just as well as structured. It's because
is does not imply ought fundamentally.
That it's actually the same problem with relational, but
the the problem with that further compounds with semi structured is once you
(34:27):
tear it apart, you realize that there are multiple interpretations of it, that you just have certain properties under
parent items, but you don't necessarily
know,
are they their own entity or
how many things you could possibly have. And also if that specific feed, let's say, it doesn't contain a certain property, you don't know that the entity even exists.
(34:49):
So how are you meant to verify this with your business stakeholders
just to get a
conceptual grasp
of,
of of what you're looking at? And
when we're talking about everything,
everything that is possible today, which
is, you know, there's so much stuff coming at us in terms of
(35:10):
what we can possibly process and how we can process it. I
just by, I guess, my my experience and and my background, I'm I'm not a data engineer, so I'm not gonna weigh in on
what is most efficient, what is most cost effective.
I kinda limit myself
to going back to the the understanding of we can process anything that we set our our minds to process, but
(35:36):
how do we make sure that we're dealing with information that we understand and processing it in the correct way? No matter what the format,
if I go to
the example I used in my book is just pirates and pirate ships,
things people love,
not to not to get overly technical.
It doesn't matter that it's in a JSON format.
(35:57):
What does matter is I can't bring JSON to my business team or my captain, in this case, and say, hey. Can you,
assess the crew off of this JSON file?
I need to show them something basic, just two boxes that says this
is your your captain. This is your ship. A pack a captain has a ship. A crew can have these weapons
(36:20):
or cannot, etcetera. Is this correct? And that's something that they can very quickly confirm or or deny. And for me, I think that's I'll limit myself to just saying, use
every tool in your toolkit, but make sure that you also have a reference point to understanding
the information and sense checking within that information with people who are less technical than you might be.
(36:46):
And then inevitably,
we have to address the elephant in the room that is AI and the impact that it is having on this overall practice
of data modeling,
technical representations
of the business processes,
etcetera.
And
given the broad applicability
of LLMs,
(37:06):
I can imagine them being used across each of the different stages of the data modeling life cycle from translating
documents,
you know, business policies,
conversation,
transcripts, etcetera, into
some summarized form of these are the things that we care about. This is what the semantic representation
should be for this particular process,
(37:29):
doing the actual SQL generation
or
schema generation
or ERD
generation, etcetera. And I'm curious
how you're seeing
the introduction of LLMs into this overall
process and life cycle both accelerate and potentially derail that overall effort?
(37:50):
Alright. A lot to unpack there. Let's
let's start kind of on first of all, by acknowledging that, like you said, LLMs are ubiquitous. AI is here to stay. And if you're not keeping up, you're gonna get left behind like never before. It's going to be,
like, as just as you mentioned, it's gonna be there along the way at every step of of your data journey from the way data is generated
(38:14):
to the assistance that help you
along along the way, whether it's creating descriptions or creating the the data model itself,
to
making predictions about how that data asking business questions and predictions around that data. Let's
kind of go with the theme of the previous question of
(38:34):
AI as a tool. How do we make sure that the tool is
working with us and not against us?
The
lessons
of previous iterations and previous business,
hypes
should not be forgotten. So we just we hearken to schema on on read,
(38:54):
which was the debacle of the February
or so where big data was supposed to solve everything. Data lake was gonna be the new new best thing. We don't have to do any modeling. We don't even have to do any transformation.
The system will figure it out. Well, we know how that ended.
AI is also promising the system will figure it out. And in this case,
(39:16):
it actually has a chance of of succeeding at that. However, what is an LLM? What is AI at at heart? It is a
pattern based prediction
generator.
I'm oversimplifying,
but let's just run with this definition for now. If we're going to ask
AI to
answer a business question, understand
(39:36):
a whole volume of of context all in a single
go, wouldn't it make sense for us to
make sure that it's doing the least amount of guessing and interpretation
possible? Meaning,
wouldn't we try to clean up the dataset before
we tried
to make inferences out of it? The difference between
(39:58):
AI today and things we were doing, let's say, in the early twenty tens is is cost.
That even though
transactional costs have been falling, storage costs have been falling,
that has been the trend. And AI has just flipped that on its head, meaning
the cost of being wrong
has just gone up. That it's not a matter of reloading a multi terabyte table
(40:21):
using
an ELT that we can live with.
Retraining a multimillion dollar model
that we cannot live with. Heads will roll. Therefore,
I would encourage people
not to
to believe the hype, but not to dismiss the magic, the magical thinking of
I can just roll my put my hands up and AI will do it for me. So in this regard, data modeling,
(40:46):
again, not drawing data
entity relationship diagrams, but cleaning up your damn data set is more important than it's ever been.
I've heard plenty of examples
where
people that have kind of started doing
semantic models and and AI models
around data,
(41:07):
they immediately bump up against the fact that you can't build a semantic model without having
a pretty good understanding of your conformed physical model.
Very quickly, it exposes
the duplications,
the misalignments,
the double
names, and it's all a part of the same process. It you can kind of it's a spectrum of
(41:28):
how
how detailed versus how general it is that you're talking about,
whether it's customer
generally or
customer ID
specifically
in your physical system.
If you have customer and you have client across two different tables, what do you expect the result will be when when you run AI analysis on that? So these are the kind of basic things that, you know, you don't have to be
(41:51):
an AI engineer
to understand that you you just have to go back to the drawing board and make sure that your fundamentals
are are correct. But now to the other question of where else does AI come in along that journey? Well, here
we open up the doors to the realm
of possibility.
All of us have examples in our lives where AI has come in and fundamentally
(42:16):
systematized
or taken a lot of drudgery out of our our day by making it quick and easy.
So if you need AI to help you get to
to that point of
conforming your your dimensions, cleaning up your datasets,
or even
flagging the the parts of the
(42:38):
data landscape that are problematic.
AI could be very good
even in serving your AI strategy, kind of cleaning up itself before
running more
more system intensive and
cost intensive processes.
Again, I'm not gonna say I'm not gonna list every step of the way where AI can step in and make your life easier because those are those are infinite.
(43:02):
What I will say is what I what I started with that you
have to have a,
by this point,
reflexive
instinct
to see
where can AI help.
If you're trying to solve a problem if you're faced with a problem and your first instinct is,
how do I manually figure this out? Not
(43:24):
reflexively
speaking,
where can AI make this easier?
You're going to get left behind. You're just gonna make yourself obsolete.
But again,
once that is in place, how is AI gonna help you clean up your data? Then once once you've gotten to a place where you're comfortable
throwing AI resources at something,
(43:45):
you've already ensured
success, basically, as much as you can by making AI's job as easy as possible.
And to your point of you can't just throw an LLM at the problem and expect it to solve everything without it having some means of
grounding itself in the context and semantics of the organization
(44:08):
where you highlighted that trying to build that semantic representation
for the LLM to feed off of, at least on the consumption side, can help highlight some of the shortcomings in your existing
data estate.
And
one of the recent developments that has been accelerating
rapidly is the introduction of tool use and in particular, model context protocol servers.
(44:33):
And I'm wondering how you're seeing
that inclusion of
tool use as a means of building that grounding
in the business context and business realities
helps to
provide
guardrails
and improve the effectiveness
of the incorporation of LLMs into each of the stage of the life cycle from
(44:54):
ideation and definition through to consumption and exploration?
That's a good question because MCP is still relatively new
and data teams,
they span the gamut of
There are those that just operate on SQL. They haven't even really dipped into the Python
realm of data analysis.
(45:16):
And then there
are modern,
teams that are comfortable
with calling an LLM using an
MCP server.
But they're still relatively
few.
And
especially in the data space,
one, I would say we're still early. For example, if you're staying up to date with some of the innovations of the cloud data platforms
(45:40):
like, Snowflake and and Databricks,
they're introducing features that
just boggle the mind
for even someone like me that has, you know, come up through the ranks of analysis and data architecture that you can effectively
write a prompt
in line in SQL. So you can select column one, concatenate with column two, and then in your third column, put a prompt to say, you know, analyze the sentiment
(46:06):
of of the string that I've concatenated this. So you're effectively for every row in your in your result, you're seeing an LLM response.
And as we said, with great power comes great responsibility. So these are you know, you're not just gonna be spending your regular
compute tokens on this. If you're making mistakes there, it's gonna get expensive. And also the use cases are infinite in terms of, you know, I might have to have just to get some kind of basic sentiment analysis, I might have had to do, like, some gnarly
(46:36):
regex just to to tease out is this positive negative keywords.
Now AI can pretty pretty confidently tell me, like, this is a positive review. This is a negative review.
So it does make life easy. Some of it is,
you know, provided by by the platforms themselves.
A lot of teams are, as you mentioned, they're
(46:56):
running their own models for security reasons, for
performance reasons, privacy reasons
that
that they can
effectively
hook into using MCP or just regular API calls.
And because they have use cases that demand it. Again, I'm not really the best person to ask at this stage in the game just because
(47:18):
I'm currently
not hands on with any specific data problem. I'm I'm working on a tool that that accelerates
peep the life of people who are. What I do here is specifically for the advanced use cases from from customers that I interact with. And for them, the
you know, there's the technical challenge, kind of the same thing we we just discussed. There's the technical challenge, but there's also the challenge of making sure that once they once they write those pipelines,
(47:49):
that
the
core understanding of what it is they're analyzing is is there for them.
So it's almost like the old is new again. That sure, you can have infinitely
new and advanced
processing systems on top of your old
columns and rows. But the fundamental
(48:09):
understanding
of what is a customer
and
what is an active customer and where is
the table and what are the synonyms for customer, all of that has to be encoded, unfortunately.
Otherwise, you're setting yourself up for failure and failure that, you know, you're gonna blame AI. You're gonna say this isn't smart enough when, you know, your business users probably can't even figure this out if you have so many inconsistencies.
(48:35):
That if if you have to, you know, the context has to live somewhere that we started with the with Joe Reese's famous quote. The data model is still a data model, albeit a crappy one. If it's not systematically
encoded, let's say, in a semantic model, it's living in your head. And if it's living in your head, it's not available for AI to consume and orient itself. And that's gonna be a problem and an inexpensive one.
(48:59):
In your work of helping organizations
approach that overall challenge of data modeling as more than just a technical exercise, What are some of the most interesting or innovative or unexpected ways that you've seen them develop and implement the overall strategy for that? Okay. That's a great question because,
again, from my unique perspective,
(49:20):
there are people who are hands on with
data in the trenches. There are people who are hands on with the engineering side of things.
I am
not a consultant, so I'm not coming in and and actually,
you know, but just by design for privacy reasons, I'm actually not looking at what our customers are are building unless they're willing to share that with me. What I do see are the
(49:45):
frameworks and the technical demands
around how they operate.
And,
again, if you're a consultant and you've seen hundreds of use cases, maybe some of this won't surprise you. But as I've seen SQL DBM
grow from just a modeling tool to
a enterprise
modeling platform. I've seen the needs of the customers evolve from, again, just basic, can I document something? Can I draw the relationships?
(50:13):
Do you support all of the
database objects and properties natively?
Once the tool evolved beyond that point, the types of,
requests we were getting is, you know, it's not one project, it's
multiple data domains, and we wanna communicate. We want to make sure that people are consuming our data products.
(50:35):
So it's things like being able to
reference,
first of all, data discovery. How do I discover the fact that a certain data source or a certain data domain even exists?
How do I reference it without
copying it? So the fact that we're an online
platform means we make it very easy for for users
(50:57):
to take things like
take things like entities from one project and create
a a pointer or a reference or read only copy in another project and model downstream from that. The other thing we're seeing is,
and I'm I'm not exaggerating, that we have data architects that oversee
dozens, if not upwards of a 100 or more data teams. So some of our features, like global standards,
(51:23):
are ways where a single data architect
or a data team can
enforce things
down to their
their
child projects. So what are the conventions we're gonna follow? What are the templates we're gonna use? What are how are we gonna name things consistently?
Not only is it important for
enterprise teams to define that once and only once, but also to make sure it's not just suggested, but enforced. So we have tools on the guardrails
(51:51):
to make sure
that happens.
Once you get to a use case of that size,
you know, you would expect you're kinda thinking openness. But a lot of times,
it's openness with guardrails. So
a lot of companies
have
when especially when they provide
data as
a product or data as a service,
(52:12):
they have internal teams that are
encouraged to share, but share to a point. So it's not even about
making something accessible. It's making something accessible and making sure that people only see the parts of it that you want them to see. So there's also privacy
and our back and restrictions even at the data model level. Another part is
(52:34):
security and compliance.
So everything that we're building,
it's easy to forget that it's it's not just data, but there are laws around the data. So it's easy to forget until you get slapped with a million dollar fine. So things like that are also very important of
not just data itself, but the metadata,
the the tagging,
the
(52:55):
PII
flags,
the
stewardship
around, is this a trusted table, or is this a staging area?
Can I consume it? All of these things, the the more mature
a organization becomes, the more, you know, even these things cannot live in in somebody's head. They they need to be systematized. They have to be fed in through the same CICD process that deploys
(53:19):
your object is the same thing that, you know, stamps the metadata tag saying,
this could be promoted up to the reporting layer, and this cannot, or this should have certain
tags and and policies applied to it to make sure that we're not accidentally
exposing it to people who shouldn't have visibility.
So for me, the
(53:39):
the
evolution of just simply,
how do I document what is what is one table? What are its columns? What are the columns mean? All the way up to how do I search
across
an organization
where
we have upwards of a 100 of data teams or more? Each of those data teams have projects
(54:00):
or data domains that have thousands of of tables, and each of those has definitions and relationships.
How do we secure it? How do we share it? How do we find it? Those are the main challenges. How do we search across everything?
And those are, you know, those those are unique and interesting challenges to solve. You know, there's plenty of people working on on the
(54:21):
engineering side and the data modeling side is is equally important, especially once you get past a certain size as as an organization.
In your experience of working in this space, talking to customers,
coming to grips with all of the different details and nuance around data modeling as that socio technical exercise, what are some of the most interesting or unexpected or or unexpected or challenging lessons that you've learned personally?
(54:47):
From my perspective, I think the most interesting thing is just the wide gamut of methodologies
and standards that people employ. Even as an architect of nearly two decades
in the industry,
you know, I've only seen what I've seen.
I've been a consultant, so I've seen kind of some of the accelerated
pace of, product development.
(55:09):
I've worked at companies
as
a data architect kind of full time. So I've just gotten to know
one industry or several industries in-depth. But having, again, dozens of people a week that I speak to from our client side, I have seen every type of standard
imaginable.
Learning to kind of
(55:30):
design for the extremes, I guess, in in my role has been very important.
Not confining myself to the best practices because we certainly want to
guide people to best practices, but also
respect that, you know, they might have their own ways of doing things.
They might also have their own set of exceptions that even when we set rules, we wanna also be able
(55:52):
to grandfather certain
legacy
things into their design.
That's number one. Number two
is the degree of data maturity really matters. That
as a data architect
before joining SQL DBM, I think I might have been a little bit more stringent in things
(56:14):
I I recommended or, again, best practices that I would tend to follow or tend to recommend others follow.
I have since learned that it's
very much
relative of the data maturity of the team and the company as a whole,
both for the consumers,
as well as the producers of the data. So sometimes
(56:36):
I,
you know, go all the way down to just one big table. I don't think you should complicate your life more than that. Other times, I see people effectively building Data Vault and
dozens, if not, upwards of a 100 of our systems and managing it effectively.
So
I think that's often underappreciated
because nobody wants to admit the fact that, okay. Maybe we're you know, as as a team, maybe we're still fairly junior. Maybe we have a few rock stars, but
(57:04):
as a team, we just can't lift this collectively.
And
let's see if I can think of another
that I haven't already mentioned.
And so as you continue
to work in this space and explore the
evolution
of the marketplace
and the
general growth of data modeling
(57:27):
as
an established practice in organizations
of varying size. Because up until the past few years, it was largely the domain of larger organizations
and smaller companies just did whatever scrappy thing they needed to get by, didn't necessarily adopt these formal practices. I'm curious, what are some of the ways that you are
staying up to speed or particular
(57:49):
aspects of the problem space that you're particularly interested in exploring further?
Yeah. That's a great question. I think that one obvious
consideration is as the ecosystem grows,
the demands on the data modeling tool grow along with it. So people are aren't just limited to,
you know, building
(58:09):
a conceptual model or a physical model. They're gonna be using that model
downstream or rather upstream of, let's say, dbt.
So do we instead of just generating DDL,
are we also able to generate dbt yaml and hand that off to an analytics engineer?
As AI
evolves,
(58:29):
things like semantic models play a much bigger role.
Nobody was asking for semantic models a year ago. Now they're a must have. So now can we take that same information and generate it as a semantic model? But what's really changing
is is the format. So most of the core of that information, whether it's a semantic model or
(58:50):
a DBT model, it's effectively the same
information being conveyed in a different format. In fact,
if the word semantic model scares you, or you're not used to seeing the semantic model,
it's almost identical to a glossary. If you go all the way to the other side of the spectrum,
past logical, past conceptual, even simpler than that, a data glossary
(59:12):
has the same fields as a semantic model. It's just that one is almost like index cards. The other is in in YAML. But that's just some of the examples of,
more tools. You mentioned MCP server earlier. You know, before we had an API, and now we have an an MCP server on top of that. They're still getting effectively the
same context, the same information,
(59:34):
the same nuance using the those methods.
One, as I said, is keeping up with the industry demands,
being able to do the same thing we do with relational
objects
using
semi structured NoSQL
because that plays an ever bigger role in in people's tech stacks. That's just on
(59:54):
the
kind of the tech stack, the the technology stack side. Everything I said earlier about
having a different set of challenges
as the organizations
get bigger. So not not about interpreting
one data model or one set of, of information, but
sharing it across teams, making sure we're not duplicating things unnecessarily,
(01:00:15):
making sure we're
securing things, making sure that we're tracking data lineage, not just from source systems, but also across the company. All of these things are issues that our tool helps solve. So, again, I've,
I've worked at large companies. I've seen what I've seen. But having the
(01:00:35):
privilege of
talking to our customers and them sharing their use cases is, to me, an education, like an accelerated education. I couldn't have gotten anywhere else short of joining one of them and spending a couple of years just,
understanding the nitty gritty and everything that they're working on. But the scope of what they're building is enormous
(01:00:58):
and only
a
very
disciplined
and
systematic
approach
to doing it is gonna
help something like that succeed. So for me,
it's also been a little humbling in terms of seeing what our customers are are able to to manage and build at scale and how they're doing it has been eye opening.
(01:01:20):
Are there any other aspects of
this space of data modeling
as a combination of organizational
and technical
exercise, your experience in the space, or the work that you're doing to help evangelize
the best practices around how to implement it at organizational
scale that we didn't discuss yet that you'd like to cover before we close out the show? Well, I guess I'll make a not so shameless plug and just recommend my my book again. It's called Data Modeling with Snowflake.
(01:01:50):
Don't get hung up on the Snowflake part of it. I just wanted to tie it to one specific system that someone can
can actually iterate and and try these things out as,
as recipes.
And Snowflake is as good,
candidate as any because, you know, you can literally get started for free. And,
with a trial, it doesn't cost anything, so everyone can follow along.
(01:02:14):
And it's a really great resource to
understand
why data modeling matters. So it's a book that just kinda shows ties the business value to the technical
side of it. That is what are
on the engineering side and the data side we tend to focus on. And if there's one
message that I would like to leave your
(01:02:36):
listeners with. It's the fact that data modeling is a team sport, and that team consists of business users and technical data users, data engineers, and data architects.
And it's folly to forget the the business side. So to close-up, I would just say, understand
that a blueprint is important. Understand that your business is fundamental in helping shape that blueprint.
(01:03:00):
Understand that they are not as technical as you are and things that are obvious to you might not be so obvious to them. It's almost a beginner mindset of why should somebody else care about the important work that I'm doing and how do I
get them involved and make it easy for them to contribute?
And
just starting there is is a really good framework for for making sure your data initiatives
(01:03:24):
are off to a good start and have a a higher chance of succeeding.
Alright. Well, for anybody who wants to get in touch with you and follow follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
(01:03:46):
I would go with a
lack of
standards
overall,
and which is not to say that we don't have standards. It's usually the opposite problem of we have too many,
we take your pick.
And
it makes it really difficult
to standardize and make things effective. For example, you can ask anybody,
(01:04:09):
any data engineer of, you know, if I wanted a description, where would I put it in the comment of a column or a table?
That much is clear. But beyond that,
how do I tag this? Is it dbt table properties or is it snowflake tags or take your system of choice? How do I expose this? We're starting to see things in in the catalog space.
(01:04:32):
There's still no single standard. There's still no single standard for
you name it. There's no single standard to do it. And that can trip people up in terms of
which which tool do we bring in if it's not standardized. Can we use a tool? Can we use this tool? How do we make sure that the tools talk to each other? It's hard and it's overwhelming.
(01:04:52):
There's no right answer. There's
it's holding the industry back, I think. On on the other hand, we know we've we've seen these cycles in the data space that first
we couple, then we decouple.
And
fortunately,
when when that happens, we at least tend to learn
our lessons along the way of what works and what doesn't. So we're overall tend to be moving in the right direction,
(01:05:16):
but it still makes it very difficult to recommend something
and
say that this is a one one size fits all solution.
So it's really about keeping
keeping up to date, staying
staying current, and being able to
to mix and match when when you need to.
Alright. Well, thank you very much for taking the time today to join me and share your thoughts and expertise on this overall practice of data modeling and some of the ways that it manifests
(01:05:46):
in organizations
and the role of technical teams in that overall process and how to best engage with the business. So I appreciate the time and effort you're putting into helping popularize that approach, and I hope you enjoy the rest of your day.
Thank you so much. It's been a pleasure. Great talking to you.
(01:06:11):
Thank you for listening, and don't forget to check out our other shows.
Podcast.net
covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com
(01:06:35):
with your story.
Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.