All Episodes

October 4, 2024 23 mins

Join us as we sit down with Joe Reis, live at Big Data LDN (London) 2024. Joe shares his partnership with DeepLearning.ai and AWS through his new course on Data Engineering. Joe's new course promises to elevate your data skills with hands-on exercises that marry foundational knowledge with cutting-edge practices. We dive into how this course complements his seminal book, "Fundamentals of Data Engineering," and why certification is valuable for those looking for foundational, hands-on knowledge to be a data practitioner. 

But that's not all; we also dissect the hurdles of adopting modern data architectures like data mesh in traditionally siloed companies. Using Conway's Law as a lens, Joe discuss why businesses struggle to transition from outdated infrastructures to decentralized systems and how cross-disciplinary skills—a concept inspired by mixed martial arts—are crucial in this endeavor as he cleverly calls it 'Mixed Model Arts'. 

Check out Joe's Work: 

What's New In Data is a data thought leadership series hosted by John Kutay who leads data and products at Striim. What's New In Data hosts industry practitioners to discuss latest trends, common patterns for real world data patterns, and analytics success stories.

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
John Kutay (00:14):
Hey everybody.
We are live here at Big DataLondon, Not technically live, it
is pre-recorded.

Joe Reis (00:19):
But we're here live.

John Kutay (00:20):
Yeah, we're here in person live, which is a rare
thing these days.
I'm here with Joe Reese, theguy in data.
Joe, you've been doing so manycool things, glad to run into
you here.
You recently had Andrew Ng onyour show.
Yeah, and Andrew is one of thegreat thought leaders in machine
learning.

Joe Reis (00:41):
He's on the Mount Rushmore as far as I'm concerned
yeah, yeah, yeah, yeah.

John Kutay (00:45):
So that was an awesome episode.
You're doing a lot of greatwork in educating the masses on
AI and machine learning as well.
Yeah, just catch us up on whatyou have going on.

Joe Reis (00:57):
Well, it's funny, today we're announcing the
launch of the data engineeringspecialization on Coursera.
So I've been working with thefine folks at Deep Learning AI
and AWS for the last year onthis Deep Learning AI's Andrew
and his team.
So, yeah, drops today.
Pretty stoked about that, it'scoming out today.

John Kutay (01:18):
Wow, can anyone sign up?
Anyone can sign up.

Joe Reis (01:22):
You can sign up.
Can I sign up, okay?

John Kutay (01:24):
Most importantly, I can sign up.
You can sign up.
Can I sign up?
Okay, most importantly, I cansign up.
Yeah, those who are watchingthis in person can probably sign
up too.
We can probably get you on thewait list.
Go to Courseraorg, that's superexciting.
So tell me about thisspecialization.

Joe Reis (01:37):
Well, so I was at Matt Housley and I wrote a book on
data engineering a couple yearsago and I really felt like that
was a great intro to thefundamentals of data engineering
.
But it was a very technologyagnostic book that didn't go
into a lot of code or anyexamples.
That was by design.
I feel like a book really isn'tthe place to have code examples

(01:58):
.
Some people may disagree withme, but once something goes into
print it's sort of out of datealready.
So you want something that'sgoing to last.
We really felt like a course waslike the real complement to the
book, where you can give a lotof hands-on examples and, I
think, dive deeper into materialthat you really couldn't get
into in the course.
So, um, so the course really, Ithink, is sort of the yin to

(02:19):
the yang of the uh.
The book covers the dataengineering lifecycle and the
undercurrents, I think, but in away that gives you a lot of
challenging hands-on exercises.
So I mean, even when I tooksome of the exercises and did
the final versions, it's likeGod, yep, that's legit.

John Kutay (02:37):
So we know anyone who goes through your
specialization.
Do they get a certification?

Joe Reis (02:42):
Yeah, you get a certification at the end by
Coursera.
Yeah, signed by myself and, Ithink, a few other people.

John Kutay (02:48):
Excellent.
So we know that people that gothrough Joe Reese's data
engineering specializationcourse are tried and true at
implementing data pipelines thatcan feed real world machine
learning and AI operations andanalytics Analytics Still
generally a lot of focus onanalytics.
Your book, the Fundamentals ofData Engineering big cornerstone

(03:10):
Everyone, I would say it's inmy definitely on top shelf of my
book cabinet for dataengineering your book, along
with Martin Kleppman's book ondesign data intensive
applications, and you mentionedyin and yang.
I feel like those two are greatyin and yang because your, your
book is so practical.
Right, you say all the thingsthat data practitioners already

(03:30):
need to know.
People who've been on the jobalways read it and say like,
okay, they kind of say the stuffthat like is mostly unsaid but
everyone knows like that youknow from from learning on the
job.
So I always recommend it topeople who are breaking in yeah
and then I recommend martinclemmons book for the people who
are breaking in.
And then I recommend MartinKlemmens book for the people who
are like.
They do their jobs in analyticsand engineering but they don't
know too much of how it worksunder the covers.

(03:51):
And now it sounds like withyour specialization, your
Coursera class, it's really fortechnical operators to level up
and get certified.

Joe Reis (04:02):
Yeah, I mean you could start from start from zero.
I mean the expectation is youmight know a bit of uh python I
think that's about it, not eventhat much, so it walks you step
by step through everything.
But yeah, it will take you tobeing at least like technically
capable of doing stuff.
I wouldn't say you'd, like, youknow, kill it first day of your
job, but at least you'll knowwhat you're doing.
Uh, you know.
But I think that uh yeah, soit's, it's, it's yeah, yeah.

(04:24):
I think it fulfills a lot ofthings.
And Martin's book's also awesome.
I think that's sort of the.
We originally wrote our book tosort of be the prequel to his
book, where that was very much,I would say, throws you off the
deep end into the innards ofdistributed systems, and I think
that's a great book if you'reworking it.
But what we also noticed islike not a lot of people
actually do the low-levelengineering that that book tells
you about, and I think it'sgreat to know how things work.

(04:46):
But if you're not doing itday-to-day, especially if you're
more junior in your career, Ithink it would need to be sort
of an introduction to being ableto think about how to reconcile
, using those tools inproduction.

John Kutay (04:59):
So that's where we came in.

Joe Reis (05:00):
And then, obviously, you know, martin's working on a
second edition of his book too,with Chris Rickerman.
He's an awesome engineer, soI'm excited to see what they're
doing.
I hope they're doing it.

John Kutay (05:11):
Yeah, yeah, it sounds like he's updating his
book to probably reflect more ofthe modern tools it is.
He mentions things like Reactthat were more like early 2010s,
late 2000s open source tooling,but now there's been this flood
of tooling that's come in andwhat I'm really excited about,
especially for your Courseraclass, is it's all coming

(05:33):
together.
I feel like the stack is sortof stabilizing.
The tool chain is stabilizing.

Joe Reis (05:38):
Yeah, it is.
It feels like that.
I mean, if I look around it's alot of the same vendors from
last year, you know.
So it feels like that.
I mean, if I look around it's alot of the same vendors from
last year, you know, it feelslike the vendor space is at
least like stable.
Yeah Right, I don't seeanything here that's completely
out of left field.
There might be, I don't know.
I mean, you have to walk aroundand see.

John Kutay (05:59):
No, I agree with you .
Yeah, I did a quick walk around.
That's how I felt it's likestabilizing.
We're all sort of agreeing onwhat the layers of indirection
are, right.

Joe Reis (06:06):
What do you think about that, though?
Do you think that that's a goodthing, or do you feel like
that's when the industry isright for its next inflection
point?

John Kutay (06:13):
No, it's a good thing, for sure.
I mean, if you ask my opinion,you know, pre-covid there was
your modern data stack.
Modern modern data stack.
Modern data stack sort of blewup during COVID.
You used so much funding likebillions of dollars of funding.
All these unicorns came up.
We all came back to conferencesin 2022 being like what the
heck is all this stuff right?
And now it's sort of matured,stabilized.

(06:36):
I have a good sense of what youshould use now if you're trying
to build a data or AI stack.

Joe Reis (06:42):
What do you think what would be John Coutet's stack of
choice right now?

John Kutay (06:47):
Yeah, yeah, you've got to have your cloud vendor
that really drives things.
I think that's sort of going tobe your gravity, where you're
going to decide as your core,what tools am I going to use
that are cloud vendor native,and then what things are
especially value add.
So let's just pick AWS, forinstance, I'm going to use that
are cloud vendor native, andthen what things are like
especially value add so let'sjust pick AWS.

(07:07):
For instance, I'm going toobviously, cold storage is going
to be on S3.
You know, I'm going to use anAWS managed database.
Now Oracle is on every cloud,so I don't have to just because
I'm on AWS.
You know sacrifice having likea super performing, you know
scalable database, and thenOracle has its own cloud now.
So that's the other thing.
The companies that were deemedlegacy have actually modernized

(07:32):
all of their own awesome cloudversions.
And my stack of choices okay,start there, whether you're
Oracle Cloud, aws, google Cloud,whatever, and then you're going
to have your adjacent tools.
So you're going to have youringest products like Stream or
whatever.
You know whatever.
I'm not going to name names,but you could probably guess
data modeling, right, you'regoing to have your DBT or you're

(07:53):
going to have your specializeddata modeling tool, then your
analytics tool, which is alsoevolving quite a bit.
So now I'm going to turn thatquestion back to you.

Joe Reis (08:03):
Well, it's my stack of choice.
I think the stack that worksfor you, right?
So I think, like the onlydifference I would say is like I
think nowadays I'm also seeingpeople sort of repatriate from
the cloud and that's often ontotheir laptop oh my gosh.
Yeah, good point.
You're starting to see thatbecause I think, you know,
duckdb is getting pretty awesome, so people, I think, are maybe
doing workloads there.
So it kind of reminds me ofwhat Notebooks used to be back

(08:25):
in the day, where everyone didlocal development on Notebooks,
and I think you're starting tosee that now with DuckDB,
polar's another big one.
I think that's an interestingone to look out for.
But yeah, I mean in the cloudit's an interesting one.
You know you could call it stillthe modern data stack, but I

(08:46):
think we kind of move beyondthat terminology just to the
analytics stack.
Yeah, but now everybody's alsoan AI vendor.
So now you know that's one ofthe things I'll be talking about
is sort of like the linesbetween you know, analytics, ai
and even applications areblurring a lot, yeah, and so I
would say pick what works foryou, because I think the notion
of where we're moving ischanging as well, where it's

(09:08):
like the tooling is mature, butnow you know what you need to
plug and play for certain usecases.
What I'm looking at next issort of what happens when these
things start fusing intodata-driven products, ai-driven
products, and it's not justabout dashboards anymore or
whatever.
It's about how do I power realapplications to provide
analytics to users?

(09:29):
That again provides a feedbackloop to the app, and same with
machine learning and AI, right?
So I think that's the thing I'mmost excited about is that.

John Kutay (09:40):
And you cover all those core components really
well in your book, theFundamentals of Data Engineering
.
Right now, the other big thingyeah, you mentioned people are
moving workloads into theirlaptops, which is a great point.
I'm actually going to be atSmall Data SF run by the
Motherduck people next week andthat idea is like okay, your
laptop can store a terabyte ofdata and has hyper threading and

(10:04):
multi-core processing and youcan do so many ad hoc workloads
on your own machine withoutrunning up cloud compute builds,
so that's also going to besuper interesting.
But how do you still centralizeand have governance around it,
not have copies of data andthings like that?

Joe Reis (10:19):
This is the big crux, I think, when I figure out
what's the governance of thedata sets that are created.
That's why I'm talking aboutdata modeling, because I feel
like the practitioners can'tunderstand how to use these
various forms of data.
You know we moved beyond tables, right, semi-structure is even
an old story, but now it's abouttext, even images, video, audio
.
And how do you blend all thesedata sources together Even Mike

(10:41):
Ferguson was talking about itduring his keynote where it's
like you're going to be doinganalytics, combining that
structured, unstructured datasets.
I mean, it's been happening atsome companies for a while, but
this is sort of what I'm excitedabout.
But then you've got to know howto handle all these different
data sets.
Right, like if you think thateverything needs to be in a
table and you approacheverything from a relational
modeling standpoint and you oreven nested data, right,

(11:10):
semi-structured data, that's afirst-class citizen in almost
every database now.
But then, okay, that woulddefinitely violate the third
normal form, relational model,because you can't have nested
data.
So how do you reconcile allthese concepts together and
create data sets that make sense, not just for you but for other
users, because if you'reworking in a decentralized world

(11:32):
, not just for you but for otherusers, because if you're
working in a decentralized world, the data ideally has to have
some useful form and shape,right.
So it's been on my mind a lotlike how do you make this work?
Because I think decentralizationis sort of the.
I think that's the goal of.
I mean, even now, right Zermatttalked here two years ago and
they still hear conversations ofdata mesh, data fabric and

(11:54):
everything else.
That conversation isn't goingaway.
I think that's an ideal thateveryone wants to get to.
But ironically, thecentralization of standards and
the federated computationalgovernance is sort of how you're
going to get there.
So it's almost a paradox in away.
Right, but in a centralizationof practices, or at least an
understanding of cross-teampractices to make it work.

(12:16):
So it is interesting.

John Kutay (12:19):
That part.
Every organization has tofigure out what decentralization
means to them.
Yeah, We'll talk tomorrow withNational Grid on how they've
decentralized their analyticswith data products, and there's
always this sort of messy middlewhere you can't just read a
book and say this is how we'regoing to do decentralization.
I think it's like trying to fita round peg into a square hole.

(12:42):
So what are yourrecommendations for teams that
are trying to go down thisjourney of decentralization?

Joe Reis (12:47):
I think you hit it, though pretty clearly.
You've got to understand whatdoes that mean to you, because
inevitably you're going to runagainst Conway's Law, especially
when you're in your ownorganizations.
The Conway's Law describes thatyou'll design systems according
to the way that your companycommunicates.
So if you're very siloed, thesystems you build are going to
be very siloed, and so that'sone of the cruxes is really

(13:08):
understanding you know in yourorganization what's a tolerable
level of decentralization or isit tolerable at all?
That's the other thing.
Maybe you don't need to, becauseit's physically impossible
anyway.
So you may as well not lie toyourself, which I think a lot of
companies do, because they'relike oh, we got to do data mesh.
I'm like, yeah, there's nochance in hell that'll ever
happen at your company.
Yeah, like everything, I thinkit's very rigidly hierarchical

(13:30):
and siloed, and like it wouldn'twork.
Yeah, like the very orgstructure is like the epitome of
centralization.
Yeah, so it's the inviolablerule you know I talk about.

(13:53):
There's sort of a corollarythat I jokingly called Reese's
law, which is you'll design yourdata models according to
Conway's law and thearchitecture that supports the
company and so it's interesting.

John Kutay (13:56):
Okay, yeah, I think, joe's law, that's certainly
something that everyone needs tocoin at this point.
And I agree exactly the way youdo data modeling, even the way
you name your tables, is like areflection of like you know how
your company operates, like areyou really truly meant for scale
where business users can goclick a button and, you know,

(14:19):
get some insight, or are youalways going to go need to ask
the data engineer to kind ofdecipher and decode and preach,
you know, prepare data for you?

Joe Reis (14:28):
Oh, yeah, I mean, I've seen it in some companies where
they One company I rememberthey were using still a 1980s
era mainframe and they're like,oh, we're going to decentralize
it.
No, yeah, not until thatthing's gone.
But you had to design all yourarchitecture according to what
that database, how it wasdesigned back in the day.

(14:49):
You had seven or eightcharacter limit columns and it
was just, it's pretty awesome.
So, yeah, but that's thereality of a lot of businesses.
Right, you have like a lot ofinfrastructure that needs to be
revamped and that's like, okay,you're just going to gut all
that today and go move tosomething else.
No, like, that's not how thatworks.
Yeah, just the reality of it.

(15:17):
So I guess, to answer your,question.

John Kutay (15:18):
It's like what does that mean to you?
And you got to look at the.
You got to look at the coldhard facts of like, oh, what can
we support?
Yeah, right, so so, as of thetime of this recording, today,
you have a keynote here at bigdata london.
What are you going to talkabout?

Joe Reis (15:26):
I'm going to talk about, uh, mixed model arts.
So I'm a big mma fan, uh, mixedmartial arts fan.
Have been for a long time.
The notion is really that youknow, we're still the data world
is.
In a lot of ways, I think we'removing forward.
I think ai was sort of the kickin the pants that everyone
needed to sort of like move on,but the discussions have been um

(15:49):
, you're much napoleon dynamite.
Yeah, remember that character,uncle rico, the guy who was like
living in like the past when helike almost won state for
football in high school, so youcould throw a ball over the
mountain, over the mountain,right.
That's what I feel like a lot ofthe data industry is.
I feel like we're uncle rico,where we're just like
reminiscing about the past, so,and we're just like stuck in the

(16:10):
80s and the 90s.
And data modeling.
This is true.
When you mention data modeling,people still talk about
relational.
It's like, oh, you mean Kimbleor Data Vault, right, like we're
still stuck in this tabularworld.
Yeah, nothing wrong with that,it works for what it works for.
But the thing is applications.
No, sql is around 20 plus yearsalready, right, and streaming

(16:34):
is another thing.
Are any of those tables sort of, sort of not?
There's a blend of data.
Now we're introducing text, allthis other stuff, into it.
The notion of make small andlarge is about adopting what
works across disciplines Machinelearning, analytics and apps
Because what's happening isthere's a convergence happening

(16:55):
of all these disciplines as wespeak.
When you open up Uber, right, Ithink at one point, when it
started, that was a Rails appactually, but now it's like you
know, if you look at the numberof Kafka events it brings in a
number of other things, it'sinsane how much data this thing
ingests.
It's a data-driven app thatalso uses machine learning.
All these things work togetherseamlessly, but this is the

(17:16):
direction products are going.
Yeah, you know.
So it's a recognition thatthere's more than one way to
think about and model data.
But you have to know all thesethings and increasingly,
especially when we live in anera of constraints, budget cuts,
teams are more as asked ofpeople.
Whether you're a softwareengineer, whether you're a data
analyst, scientist, you're goingto have to become, I think,

(17:40):
full stack with your datamodeling skills.

John Kutay (17:43):
So that's the notion of the talk.

Joe Reis (17:44):
That's incredible and does this tie seamlessly into
your certification?
Somewhat so.
We do talk about data modelingand we do talk about all these
use cases across different.
So we talk about analytics,machine learning, even working
with application data in thecourse, for sure.

John Kutay (17:56):
Yeah.

Joe Reis (17:57):
So I feel like this is one of those things where, like
, you just need to understandall the different ways of
handling data.
It's sort of you know, mixedmartial arts, right,

(18:18):
no-transcript,cross-disciplinary sport, and
this is the same way I feelabout data, where, like, we need
to catch up to where reality isinstead of dwelling on the past
and thinking like there's youknow, I think it's saying
fixated on the old approaches Ialways quote you specifically on
this when you talk about, okay,what direction is data going in

(18:41):
?

John Kutay (18:41):
and look at software engineering.
Right, because the more I findlike really good productive data
engineers and like people whowork with data stacks have the
skills of a software engineer.
And if you hire a softwareengineer, they can probably
learn that pretty quickly.
Right, they can probably takeyour certification, read your
book and just hit the groundrunning and you know, build

(19:04):
these scalable, uh, faulttolerant data pipelines, whereas
if you hire someone who's toospecialized, just in, like you
know analytics, like their onlyknowledge is SQL, right, they're
going to they have to work in avery narrow scope, which
ultimately makes it hard forthem to be successful.
So I always mentioned this tofolks like data engineering is

(19:28):
software engineering, but it'sspecialized software engineering
.
I think your point about mixedmodel arts where, yeah, you'd
have to have thismultidisciplinary skill set and
ultimately that's what being asoftware engineer is you can't
be a software engineer who justsays, oh no, I only use this one
part of one language and Idon't deploy it, I just write
the code.
No, no one thinks like that.

(19:49):
Right, I only write structs.
Yeah, I only write structs,just for loops.

Joe Reis (19:54):
No, but I always felt like you know, if you're taking
the mixed martial artsequivalent of software engineers
, I always felt like softwareengineers are the wrestlers.
Or if you were to pick like onediscipline, if you were to do
nothing else and you were to gointo martial arts, like
wrestling, you can dictate whereit goes, you can keep it
standing up, you can go into theground and I feel like the
software engineers, they justyou have the chops to sort of

(20:15):
dictate where things go, becauseyou have the technical ability
to deploy things in production.
That's unlike analysts thatjust make dashboards.
Nothing against that, but it'sjust a different skill set.
But I feel like, again, withthe way everything's going, it's
like these, a lot of theseskills.
What I notice is that mysoftware engineering friends I

(20:35):
have a ton of them, but a lot ofthem are interested in
analytics and machine learning.
It's like I need to startbringing this into my stack now.

John Kutay (20:41):
Yeah.

Joe Reis (20:42):
Right, especially ML and AI.
It's like this is the stuffthat you weren't talking about.
But now everybody A lot ofsoftware engineers are
conversant in vector databases,for example.
Right, two years ago thatwasn't even a conversation piece
, right, right, two years ago,that wasn't even a conversation
piece, right.
So now they're expected tobring in all these different
workloads, like I need to bringin, you know, make an OpenAI

(21:04):
plug-in call or something youknow.
But it's like, yeah, differentworld.

John Kutay (21:08):
One thing I am really impressed with is just
how ergonomic the cloudproviders have made.
You know, using LLM, and youknow Ergonomic the cloud
providers have made using LLMand just integrating into your
same cron jobs that are doingdata processing and data
modeling and transformations.
I can go into Google VertexAI's model garden and go pick

(21:28):
Anthropix models or Gemini AWShas Bedrock and it's super
integrated.
So I just did a talk here rightbefore this with Cramp, which
is an amazing company.
They supply a lot of the partsfor agriculture companies in
Europe.
They operate at tremendousscale and they've already
adopted AI machine learning.
Not because they were justsuper gung-ho on doing AI, but

(21:49):
they were like this is the nextpractical step to make the
experience better for ourcustomers.

Joe Reis (21:53):
Bingo and that's just it.
It's about making experiencesbetter, and so that's what a lot
of these products do Betterchat interfaces, for example.
It's like that's a low-hangingfruit, right, okay, but then
what does that depend on?
Well, it depends on, probably,training on some text data that
you have.
Right, you got to know how thatworks.
So, yeah, it's an exciting time.
I feel the inflection point thatwe're in right now is it's like

(22:14):
I think the stack stabilized,but then, you know, now it's
about adopting, you know, newapproaches to solve new types of
problems, not just shoehorningthe existing stuff to solve old
problems.
I think we've done a good jobat that.
We've solved a lot of thoseproblems, at least from a
technical standpoint.
I still think the peopleprocess technology arm is always

(22:35):
in the picture and that thereare always challenges.
But I think, from a technicalaspect, when I look at the
vendors here, it's like it'shard to find like a tool that
you would say like that's areally bad tool.
I mean there's a lot of greattools these days.
I mean the bar is very high,competition's intense.
So now it's time to, now thatyou've solved a lot, I'd say,

(22:56):
I'd say the pretty standardproblems.
Now that's that that pie widensright, and now you can solve
more problems.

John Kutay (23:04):
Yeah, absolutely, joe.
You're the man in data.
That's, that's the best way todescribe you.
You're also author of thefundamentals of data engineering
.
We'll have a link to that bookdown in the show description.
Also, his new Coursera class.
Joe, have a great time at yourkeynote today.
I'm definitely going to bethere.
I'll be your biggest fan in theback Maybe heckling, I mean,

(23:25):
depending on what you say.
Joe, great to see you.
Thanks everyone for tuning inyou.
Advertise With Us

Popular Podcasts

Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

24/7 News: The Latest

24/7 News: The Latest

The latest news in 4 minutes updated every hour, every day.

Therapy Gecko

Therapy Gecko

An unlicensed lizard psychologist travels the universe talking to strangers about absolutely nothing. TO CALL THE GECKO: follow me on https://www.twitch.tv/lyleforever to get a notification for when I am taking calls. I am usually live Mondays, Wednesdays, and Fridays but lately a lot of other times too. I am a gecko.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.