Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Michael R. Gilbert (00:00):
I was
talking with a CEO the other day
who was lamenting the endlessspend on technology for things
that were, as he put it quitefrankly, hard to understand the
value of.
He'd just been listening to hisdata analytics team telling him
that they needed to invest in adata lake.
Even after listening to theirexplanation, he wasn't clear
what a data lake was, why it wasdifferent from the incredibly
expensive data warehouse theyalready had and why tech people
(00:23):
needed to keep making up weirdnames for all these things.
I smiled, offered to buy him acoffee and help him answer the
first of those two questions.
The last one would have to waitfor another day.
So data lakes vs datawarehouses, let's talk about it.
Welcome to the TechnologySounding Board.
(01:07):
I'm your host, michael RGilbert, and we're talking about
data lakes, data warehouses andthe differences between them.
Perhaps it's best to start witha brief review of the
conditions that led to thedevelopment of each of them.
We'll go in chronological orderand start with a data warehouse
, but before we do, I'd like toput an idea in your head that
will hopefully make more andmore sense as we go on, and that
is if the benefits of the dataaccrue predominantly to the
(01:30):
consumers of that data, but thecosts accrue instead
disproportionately to itsproducers, then, unless you have
an external marketplace for it,ie you're in the business of
selling data.
Failing to account for thisimbalance will derail any and
all data projects.
Alright, so with that behind us,why did we create data
(01:51):
warehouses in the first place?
Well, in the beginning thingswere easy, that's to say, they
were incredibly difficult andexpensive, but universally so.
All data processing for a givenfirm was done by a centralized
team using vast monolithicmainframes.
Getting data in or out of themwas slow and expensive, but they
were the only source of thetruth.
Note that phrase as it willre-emerge shortly.
(02:13):
During the 80s and 90s, we sawthe explosion of
microprocessor-based computing.
First we saw departments havingtheir own Unix boxes, and then
we saw PCs everywhere.
Tech skills also grew wildlyand soon enough every department
could have their ownapplication written to solve
their specific problems and dataexploded.
Unfortunately, this data wasalso all over the place.
Both geographically andphilosophically.
(02:35):
Different groups within thesame company would have
different views of parts of thecompany's performance and, worse
, based on how you interpretedthem, you came up with very
different opinions on how thingswere going.
So what was to be done about it?
This couldn't go on.
Such was the environment intowhich the data warehouse was
born.
Organizations everywhere werecoming to the realization that
(02:55):
they needed one source of thetruth.
Now, I'm not sure whichconsulting company came up with
this phrase in relation tocompany data.
The earliest reference I knowof is Galileo, and he was
referring to God.
But whoever it was that spun itinto a corporate data mantra
was extremely successful, and bythe turn of the century, just
about every large company wasinvesting in a centralized data
warehouse under this banner.
(03:17):
The idea was simple enough,let's get feeds from all these
operational systems, that's ourname for the systems that
actually do the work, as opposedto the reporting systems that
would stem from the datawarehouses we're talking about.
But we're getting ahead ofourselves.
So let's take these feeds ofdata, perhaps on a daily basis,
and load them all into onemassive centralized database
(03:38):
that IT would look after andkeep in good shape.
Then we can build theaforementioned reporting
applications on top of this datawarehouse and we can trust that
everyone is seeing a picturebuilt on the whole view of the
company and everyone will begetting the same answer, of
course, as simple as the idea is, the devil is in the details.
The first problem is that all ofthese operational systems were
built individually to do theirwork, without worrying about all
(04:00):
the other systems that overlapwith them.
This meant that the way thedata was mapped out the data
schema, as we would say wasdifferent in each system.
In order to look up an entitythat we might understand, say a
customer, you'd need anidentifier, a key.
Well, the key for the samecustomer in different systems
was incredibly unlikely to bethe same, so relating two views
(04:21):
of the same customer that camefrom two different operational
systems was difficult andfraught with opportunity for
errors.
Solving this problem led to theidea of the ETL process, that's
extract, transform and load.
You would extract the data fromthe operational system,
transform that data into astandardized schema for the data
warehouse, and that would meanchanging the data layout and
(04:41):
updating all the keys that itused to a standardized set to be
used whenever that entity inour example the customer was
referenced.
Then you would load thistransformed data into the right
place in the data warehouse.
And there's the main rub.
Writing these ETLs is a vast,complicated and highly skilled
operation.
It includes designing a dataschema that's massive, large
(05:02):
enough to map every interestingentity in the whole company, and
you can imagine just how manydifferent things any non-trivial
organization might want totrack and then creating and
maintaining key mappings foreach of these operational
systems to those entities.
But it's worse than it seems.
In the UK we talk about projectsbeing like the painting of the
Forth bridge.
It's an iron bridge about twomiles long, which crosses the
(05:25):
river Forth in Scotland.
Being made of iron, it'simportant that the paint is in
good shape or it would rust.
There's a painting crew thatstarts at one end and works
towards the other in a processthat takes three to four years,
which is about how long thepaint will last in the nasty
weather that is native toScotland.
On reaching the end, they mustimmediately return to the
beginning and start putting downthe next coat.
(05:46):
Such it is with data warehouses.
It takes a vast effort to mapall the systems we use into it,
but we're continuously adding toand changing our operational
systems.
Each time we do, the datawarehouse and the ETL processes
must change.
Another major challenge is thatthe databases we use for our
operational systems are designedfor very different needs than
we have in a data warehouse.
(06:07):
Operational systems tend towrite new data or update old
data frequently, each time acustomer places a new order, for
example, but whereas data isadded to the warehouse every
night typically, data isn'treally changed.
It's read a lot, however, andbecause a data warehouse
contains all the data from allthe operational systems for all
time, they are very, very large.
(06:27):
Traditional databases don'tscale well and certainly not
cheaply.
That means systems that supportdata warehouses tend to be very
large and very expensive.
If you recall the idea that westarted with that, there's a
fatal imbalance in play when thebenefits of the data accrue to
the consumers of it, but thecosts accrue disproportionately
to the producers.
Perhaps you can see where thisis going.
(06:48):
Getting data into the datawarehouse is a very slow and
expensive project, but the actof putting it in generates no
value at all.
The value comes from the smallpercentages actually used, and
it's usually used by differentpeople than the ones who are
charged with putting it in therein the first place.
This is not a recipe forhigh-speed success and
innovation.
(07:09):
Meanwhile, back in Gotham City,things have not been standing
still in the technology space.
If we peg the birth of the datawarehouse at approximately the
turn of the century.
Then, by 2010 and certainly by2020, things had changed
radically.
The web had exploded, leadingto massive sources of new data,
including data from outside thecompany.
The cloud had made storage andcompute both very cheap and
(07:31):
readily available, and the worldof data science had become a
mainstream thing.
Teams like the marketingdepartment and the newly formed
data analytics teams wanted tobe able to import data into
their new digital platforms andtest ideas within days,
releasing new products in weeksor, at the very most, months.
They weren't interested inwaiting months or even years to
get new data feeds into the datawarehouse.
(07:52):
They weren't willing to investthe kind of money that IT wanted
for it either, and, in any case, they often weren't using the
typical SQL data skill sets thatthey'd need in order to access
a traditional data warehouse.
They had new languages likePython and R that read data in
from files, not databases.
At the same time, a couple ofnew technological advancements
had arrived 1) the creation of amassively distributed filing
(08:14):
system that could spread dataacross thousands of cheap
commodity level machines, eachbreaking their part of the data
into thousands of smaller files.
And 2) a technique calledMapReduce, which allowed you to
run queries and calculationsacross these thousands of small
files, to come up with aconsolidated answer with
blistering speed, using verycheap cloud-based resources.
To understand MapReduce and whyit's game-changing, imagine
(08:37):
that you want to know how manytype A widgets you sold last
year and having last year'ssales written out as a list
showing everything sold, witheach individual sale appearing
on its own line in the order itwas sold.
It'd probably be a very longlist, right?
Well, you could answer thequestion by starting at the top
of the list, running down it andkeeping a tally of all the type
A widgets you find, and whenyou get to the bottom, you'd
(08:58):
have your answer.
Okay, now imagine every day'ssales are on a different list,
365 of them.
Now we could give each list toa different person, this would
be mapping the problem, and eachperson could keep a tally of
the type A widget sales fromtheir list and when they're all
done, they could give that tallyback to you and all you need to
do is add up 365 numbers andyou have your answer.
(09:21):
That would be the reduce step.
You can see that we can extendthis idea.
Imagine that every one of your20,000 stores has its own daily
sales list.
Now, instead of giving our 365people a single list, we give
them 20,000 lists each - One foreach store sales on that day,
our map stage again.
They in turn give one of theirlists to 20,000 people working
(09:44):
for them.
Another map stage.
Each of their people talliesthe sales of type A widgets on
their list and hands back theiranswers to our 365 workers, who
sum up those 20,000 store baseanswers to get a daily total
across all stores.
A reduce stage.
Finally, they hand back the 365daily answers to us and we sum
them up to get the answer forall stores across the whole
(10:06):
year, another reduce stage.
See how we broke down a hugeproblem into tiny steps that we
can spread across cheap, lowskilled workers.
That's the idea behindMapReduce, and it's very, very
powerful and it allows us to runqueries across absolutely
immense amounts of dataincredibly quickly by leveraging
the burst compute capabilitiesthat the cloud can offer at very
(10:26):
low costs.
We might be using a lot ofhardware, but only for a few
seconds or even parts of secondsat a time.
Let's throw in one last idea,that of columnar data stores.
You see, a traditional databasewould store these lists exactly
as you would imagine, bywriting out a whole line of data
for each individual sale, as itstarts with the store
(10:47):
identifier, then the date andtime, then it has customer, and
then the quantity, then theproduct and the unit price,
extended price, discount tax,blah, blah, blah until we've
captured everything we need toknow about that particular sale.
Now we move on to the next linefor the next sale and write
that down.
It's logical and it's exactlyhow we think.
It's not particularly usefulfor the way we want to analyze
(11:07):
the data, though.
Take our previous MapReduceexample.
We need to know the store andthe date to split them out into
our different workers.
Remember, this data is recordedin the store by day.
This is probably alreadyinformation we have by virtue of
which list we're reading anyway.
The only other two pieces ofinformation we need are which
product this is so we can ignoreanything that isn't a type A
(11:30):
widget and how many did we sell,so that we can update the tally
appropriately.
We don't care about the price,the tax, the customer or
anything else.
So what if we still write theselists out, but we write out
each entry for the first columnfor the whole day.
Then we write out each entryfor the second column, the third
column and so on.
Now I only need to give myworkers just the two columns
(11:54):
they want and I can completelyignore the data that I don't
need, and maybe I don't evenunderstand it anyway.
Moving data takes time.
Not moving data is reallyreally quick.
So you can see how storing datain this format is incredibly
useful.
Of course, you couldn't do thisin an operational data store.
The whole point is that youhave to know when you've written
(12:14):
all of the data elements forcolumn 1 before you can start
writing column 2.
But if you're still sellingproducts for that day at that
store, you don't know that thisis the last one.
In fact, at the start of theday, you're seriously hoping it
isn't, but once you're done withthe day and there can't be any
more sales, you certainly couldstore it this way.
Enter the data lake.
Again, the idea is simple.
(12:35):
Forget about boiling the ocean,painting the Forth bridge or
whatever metaphor you like here.
Forget about ETLs and standardschemas.
Anytime you want to do someresearch, we'll figure out the
data we actually need.
If it's already in the datalake, great.
Otherwise, we'll just extractit from wherever it is and load
it into the data lake, inwhatever structure it already
(12:56):
has, just as a set of plain oldflat files.
If we're loading data everynight, we'll just load the new
data up as a new file.
No need to insert it into theold file, we'll just add a new
one.
We don't do any transformationsat this stage.
We don't design any standardschemas or build out complex
keymappings.
If we have to relate data fromtwo different data sources with
(13:16):
different schemas and keys,we'll add that mapping to our
query logic at the time.
We'll cross that bridge when wecome to it, so to speak.
We also don't need to buildindexes to accelerate the
queries, as the queries we writejust scan through the whole
data from one end to the otherand they're so fast anyway.
There's very little point intrying to build indexes.
We don't even botherunderstanding the whole file
(13:37):
necessarily.
Just the data elements we needfor the analysis we want to do
at this time.
Remember, we're only going tosend the columns we care about
to our workers, so who careswhat the data in the other
columns means?
If we ever do need to updatethe information, let's say we
get new sales information fromtwo weeks ago for a store whose
systems were offline at the timethat we got the feeds for that
day.
We don't try to insert the newdata.
(13:59):
We just find the files for thatstore-day combination, delete
them and write the new fileswith the corrected data back in
their place.
If all this sounds a bit likethe Wild West brought to the
world of data, well it is, andthat leads to a few issues we'll
talk about in a bit.
But first let's look at theproblems it solves.
The cost to the data producersis pretty small.
(14:20):
Just let us have access to yourdaily data output.
No need to transform it.
We'll take it in whateverformat it's already in.
No need to explain the data indetail.
We might ask you about a fewcolumns of it that matter to us
and we'll take it from there.
The costs of putting the fileinto the data lake and
constructing the query all fallon the consumers of that data
and they're only doing the workon the things that they care
(14:41):
about.
There is some overhead, ofcourse.
Someone's got to set up thedata lake in the first place and
there is some cost to storingthe data and running the queries
, but it's very small incomparison to the cost of a data
warehouse.
As for the speed of getting itdone, there's no impact on any
existing data of adding new data, so there's no need for
planning or heavy governance.
New analysis can be enabled indays or even hours, rather than
(15:03):
months or the years that it usedto take.
The downsides are simple.
Because there is nostandardized structure and no
mapping into standard key sets,every query writer has to create
their own mapping every time.
Because there's no formalgovernance enforced, it's
possible that we will beduplicating efforts and
potentially duplicating storage.
Remember that it's cheap, butit's not free.
(15:26):
There are other things thatwe've come to rely on in the
data warehouse world too.
In a data warehouse, if weupdate something or even delete
something we didn't intend toone call to IT and suddenly the
change is undone, like magic,the underlying database
technology allows for this.
Furthermore, if changes keepgetting made to the data that
are incorrect, every change canbe tracked right down to who
(15:46):
made it and when, so we cantrace the problem back to its
root and fix it.
In the data lake world, all wehave is a series of files.
In order to change any data, wedelete the old files and write
new one.
Make a mistake and the olddata is just plain gone.
You have to go back to thesource and get another copy, if
it still exists, because the oldfiles are just deleted.
(16:06):
There's no tracing whathappened to them and who did
what to what when, it's justgone.
The obvious question is can weget a hybrid of these two worlds
?
Can we have our cake and eat it, so to speak?
Well, if you combine a datawarehouse and a data lake, what
would you get?
A data lake house perhaps, andno, I'm not making that up.
(16:27):
That is indeed what they callit, and no, I can't explain Tech
humor.
I can explain what it is,though.
Go back to the analogy that Igave you earlier about lists of
sales by store and by day.
Imagine that each of theselists is on a piece of paper.
Well, if you put all theselists into an envelope, you
could still use the list justlike you did before, but you
could write extra information,metadata as we would call it, on
(16:50):
the envelope, and this couldcure a few problems for you.
First, let's tackle the schemaproblem.
As we work with these lists, wefigure out what each of the
columns means and how it relatesto other columns.
In other lists.
We could write down thesedefinitions on the envelope.
Then the next person to workwith this data can start from
there.
They don't have to rediscoverwhat we've already done and they
(17:10):
can add more details as theyfind them in their work.
Slowly, the data lake willstart to become better and
better defined, much like a datawarehouse.
Next, if we update the data,instead of deleting the old
pages that are wrong, we couldmark the bad pages on the
envelope and just put the newpages in as well.
We can note who we are, whenwe're making the change, and why
, just like we would in a datawarehouse.
(17:31):
Now, if something went wrong,we could look at this change log
written on the envelope and, ifneeded, we could reinstate the
old pages to undo the change.
After all, they're still thereinside the envelope.
We just need to change themarkings on the outside of the
envelope to say that they aregood and to mark the new pages
we had replaced them with asbad instead .
Now we have the same type ofauditability and reliability
(17:53):
that we had in the old datawarehouse world.
Finally, we can add some of thegovernance back to our Wild West
by dividing the data lake intotwo parts.
One part we might calluncertified, for example.
This works just like we'vedescribed so far.
Anyone is free to do with itwhat they like, but any data you
use from it, you use at yourown risk.
If it's valuable to you, great,but you're responsible for
(18:14):
checking its accuracy.
As we find data that isparticularly useful and many
people want to use on an ongoingbasis, then we refer that to
the same type of governance teamthat we had for our data
warehouse.
People with high skill levelsand good enterprise knowledge
that can validate thedefinitions that have evolved
through our explorations andgive them, well, the official
blessing.
The data, and indeed queries,that are so blessed can be moved
(18:37):
into the certified part of thedata lake, and all data sources
in the certified region can nowonly be managed and controlled
through IT's normal processes,and that way, everyone in the
company can access any data fromthis certified region and know
that they're good to go.
That it can be trusted.
In a way, we're building outthe old data warehouse idea, but
this time using much cheapertechnology and not investing
(18:58):
money in any part of it untilit's already proven it's worth.
So, to wrap up, should you beinvesting in a data warehouse or
a data lake.
Neither.
You should be investing in adata lake house, something with
the speed and flexibility of adata lake, but with much of the
strength and governance, whereit's needed, of a data warehouse
.
Now, just before I leave thetopic of data stores, there's a
(19:22):
new shiny thing that people aregetting excited about.
The data mesh.
I'm not going to cover thathere.
It deserves a podcast of itsown.
It's a big subject.
But if you're wondering if youshould be investing in that
instead, I'd say probably not.
There are going to be specificuse cases where it'll make a lot
of sense, but not for 99% oftoday's enterprises.
The problem isn't one oftechnology but of behaviour, and
(19:43):
that stems from the warning Istarted with.
If the benefits of the dataaccrue predominantly to the
consumers of the data, but costsaccrue instead
disproportionately to itsproducers, then failing to
account for this imbalance willderail any and all data projects
.
The data mesh falls foul ofthis, I'm afraid, and until we
find a way to correct thatimbalance, I won't be
recommending it to my clients.
As I said, for those that areinterested, I will put out a
(20:06):
podcast on this topic in thefuture, but for now, I hope you
have a great time hanging out inyour new data lake house and
don't forget to bring a goodbook to read.
Yeah, sorry, couldn't resistthat.
Tech humour again.
Thanks for listening.
I hope you enjoyed it and Ihope you now know a little bit
more about the various optionsfor data storage in today's
enterprise environments.
As always, the transcript ofthis podcast can be found on the
website at https://www.
(20:27):
thetechnologysoundingboard.
com.
If you get a chance, stop by,and leave us a review or a
comment.
Until next time.