Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Have you ever just stopped and wondered how your streaming
service seems to know your musical tastes like better than
you do, oh?

Speaker 2 (00:07):
Absolutely? Or you know how a credit card company can
instantly spot a fraudulent purchase from halfway across.

Speaker 1 (00:13):
The world, right blocking it before you even realize your
card was skimmed. It's I mean, if you're tuning into
this deep dive to figure out how the modern world
actually works, you know, it feels a lot like magic.

Speaker 2 (00:24):
It really does. But the truth is there's a very industrial,
kind of gritty machine behind that curtain.

Speaker 1 (00:30):
Yeah. And there's a quote right at the beginning of
our source text today from Auto von Bismarck that I
think completely shatters the illusion of this like pristine tech laboratory.
He said, data applications are like sausages. It is better
not to see them being made.

Speaker 2 (00:46):
That quote perfectly sets the tone for the book we're
looking at today. It's called Advanced Analytics with Spark, and
the authors Sandy Riza, Uri Lasers and Sean Owen and
Josh Wills. They just aren't interested in the sanate highs
theoretical version of data science, right.

Speaker 1 (01:02):
They want to get into the weeds exactly.

Speaker 2 (01:04):
Our mission today is to demystify that whole pipeline because
they focus on the incredibly messy, chaotic reality of taking massive,
ugly data sets and forcing them through to create those
everyday insights for you.

Speaker 1 (01:18):
And to understand the magic, you first have to understand
the machinery, which brings us to a really sharp line
the author's draw. They split this into analytics in the
lab versus analytics in the factory. That really stuck out
to me.

Speaker 2 (01:30):
Yeah, that distinction is crucial because, well, the lab is exploratory.
That's where a data scientist sits with a subset of
data on their laptop, maybe using a framework like R
or Python, just kind of plan around right, testing wild theory,
seeing a model actually works. It's highly creative. But then
you have the factory that is operational.

Speaker 1 (01:48):
Analytics, which is a whole different beast completely.

Speaker 2 (01:51):
You have to take that delicate, bespoke model you built
in the lab and deploy it into a live production
application like a real time fraud detection system handling millions
of swipes a minute.

Speaker 1 (02:01):
And historically those were completely different languages, right, Like a
data scientist would do the fun math in the lab
and then hand it off to a team of engineers.

Speaker 2 (02:10):
Oh yeah, and those engineers would have to translate and
rewrite the entire pipeline from scratch in Java or C
plus plus just so it wouldn't crash on the factory floor.

Speaker 1 (02:19):
Which sounds like a massive bottlenets.

Speaker 2 (02:21):
A huge one. Older distributed systems like a Dupe's map
Produce were built to handle factory scale by breaking work
into small tasks across thousands of machines. But map reduce
operates with a very rigid map then reduced format meaning
what meaning every time it finishes a single analytical step,
it's forced to write those intermediate results back out to

(02:43):
a physical hard drive before it can move on.

Speaker 1 (02:45):
Oh wow, I can imagine. That is painfully slow. It's
like doing a complex math equation, but after every single
addition or subtraction, you have to file your scratch paper
away and a cabinet across the room, walk back, and
then retrieve it for the next step.

Speaker 2 (02:58):
That is exactly what it's like, and it's agonizingly slow
when you're dealing with terabytes of data. This is why
Apache Spark, the engine at the heart of this deep dive,
completely revolutionize the industry.

Speaker 1 (03:11):
Because it skips the filing cabinet.

Speaker 2 (03:12):
Essentially, Yes, Spark processes data in memory. It caches the
data in the ram of the cluster rather than constantly
writing it back to a physical disc. By holding it
all in memory, it performs complex operations lightning fast.

Speaker 1 (03:26):
So it bridges the gap between the lab and the
factory exactly.

Speaker 2 (03:30):
You can do your exploratory lab work and run your
factory production on the exact same framework without that massive
translation bottleneck.

Speaker 1 (03:37):
Okay, let's unpack this. I want to push back on
the premise for a second. Why is raw computational speed
so fundamentally important to the science itself? I mean, is
it just about saving a corporation a few minutes of
cloud computing costs? Or does speed actually change the quality
of the model?

Speaker 2 (03:51):
Oh, speed changes everything because it enables iteration. Iteration is
like a fundamental truth of this field. The authors make
it very clear that data scientists never get the model
right on the first try.

Speaker 1 (04:03):
They don't just write a brilliant algorithm and call it
a deack never.

Speaker 2 (04:05):
They don't just hit compile and walk away. They have
to tweak it constantly adjusting what we call hyper parameters.
Testing different inputs.

Speaker 1 (04:13):
Wait before we go further. Hyper parameters are those just
like the dials and knobs on the side of the algorithm.

Speaker 2 (04:19):
That is a brilliant way to visualize it. Think of
an algorithm like an amplifier, and the hyper parameters are
your base trouble and gain knobs. The data scientist has
to constantly twist those knobs, run the data through, listen
to the output, and twist them again until the signal
is perfectly clear.

Speaker 1 (04:38):
Ah, I see.

Speaker 2 (04:39):
And in a framework like map produce, where you have
to read the data set from a hard drive every
single time you tweak a knob, that delay physically limits
how many combinations you can try in a day.

Speaker 1 (04:49):
But Spark removes that friction.

Speaker 2 (04:51):
Right with sparks in memory processing, you can iterate endlessly.
You can twist the knobs a thousand times.

Speaker 1 (04:57):
A minute, so Spark gives us this incredible lightning fast engine.
But an engine still needs fuel. Right, you can iterate fast,
but if the raw material is going into our sausage
factory or garbage, that speed is actually dangerous, absolutely dangerous.

Speaker 2 (05:12):
And the data we collect in the real world is
just incredibly stubbornly.

Speaker 1 (05:16):
Messy, which brings us to what the text calls the
most tedious but absolutely most critical step in the entire
pipeline data cleansing.

Speaker 2 (05:24):
Yeah, the ausers really emphasized that the vast majority of
the work in any successful analysis isn't the cool math.
It lies in pre processing the raw data, and one
of the toughest challenges. There is a problem known as
record linkage.

Speaker 1 (05:37):
Record linkage. You might also hear it called entity resolution
or emergent purge. Right.

Speaker 2 (05:42):
Exactly.

Speaker 1 (05:43):
This is the process of looking at two different records
in your massive database and trying to figure out if
they are actually pointing to the exact same physical thing
in the real world.

Speaker 2 (05:52):
Right. Let's say you're pulling in data from multiple source
systems like a sales database and a marketing database. A
customer or a business might be entered with slightly different names,
or you know a typo in their address, and.

Speaker 1 (06:04):
You need to link those to get an accurate picture.

Speaker 2 (06:06):
Exactly. But a simple equality test where the computer just
asks if string A perfectly matches string B, well, that's
going to fail miserably the moment someone misspells a street name.

Speaker 1 (06:16):
The book has a fantastic, incredibly relatable example of this
that we have to talk about. So imagine you were
looking at a list of business listings. Listing number one
is Josh's coffee shop located in West Hollywood. Okay, Listing
number two is josh Coffee located in Hollywood, and they
share the same phone number.

Speaker 2 (06:34):
Right, So, to a human being, the context makes it
instantly obvious a data entry error left off the west
in Hollywood and the S in Josh's. We intuitively know
they're the exact same small coffee shop.

Speaker 1 (06:47):
But then you can trast that with the second part
of the example. Listing three is Coffee Chain number twelve
thirty four located at fourteen hundred Sunset Blvd, Suite two.
Listing four is Coffee we Chain regional office located to
the exact same address, and they share the exact same
corporate phone number in Seattle.

Speaker 2 (07:06):
Now, a computer looking at the raw similarity of those fields,
the identical address, the identical phone number, it might assume
they're the same entity.

Speaker 1 (07:16):
But you and I know that a local retail coffee
shop branch and the regional corporate office are two totally
different things. You don't go to the corporate office to
buy a latte.

Speaker 2 (07:24):
Right, human brains spot that difference in a millisecond because
we have a lifetime of context about how businesses operate.
But trying to write a hard coded rule to teach
a computer that distinction it feels impossible.

Speaker 1 (07:36):
Because how do you tell the computer to ignore the
city difference in the first example, but care deeply about
the name difference in the second.

Speaker 2 (07:42):
You've hit on exactly why record linkage is such a
thorny computational problem. The criteria we use to make the
duplicate or not duplicate decision changes depending on the specific
context of the pair.

Speaker 1 (07:53):
So you can't just use simple true or false rules.

Speaker 2 (07:56):
No, you have to dive into probabilistic matching. This system
calculates the likelihood of a match based on complex string
edit distances and weighted fields. It is mathematically grueling work.

Speaker 1 (08:09):
There is a terrifying insight from the text here that
I think you the listener, really need to hear. We've
all heard the phrase garbage in garbage out right. If
your data is obviously full of blank fields and errors,
your result will be obviously bad. But the authors argue
that obvious garbage isn't the real danger.

Speaker 2 (08:27):
It's not the real danger is getting a completely reasonable
looking answer from a reasonable looking data set that secretly
has major equality issues hiding under the surface.

Speaker 1 (08:37):
That is wild.

Speaker 2 (08:38):
It is the silent killer of data projects, because if
you don't properly handle record linkage, your algorithm might run flawlessly.
It might output a clean, confident prediction, and a company
might make massive financial decisions based on it.

Speaker 1 (08:51):
But because the underlying entities were merged incorrectly, like combining
a thousand retail stores at their corporate offices, the entire foundation.

Speaker 2 (08:59):
Is flawed, exactly as the authors bluntly state, Drawing significant
conclusions based on this kind of invisible mistake is the
sort of thing that gets data scientists fired.

Speaker 1 (09:09):
Okay, so we survive the sausage factory. We've used spark
to rapidly iterate, we've carefully linked our messy records without
getting fired. What do we actually build with all this?

Speaker 2 (09:21):
Let's apply this pipeline to something you interact with every
single day, music recommendations Yes.

Speaker 1 (09:26):
To explore how recommendation engines actually function, the authors use
this fascinating data set published back in two thousand and
five by Audio Scrobbler.

Speaker 2 (09:34):
Which was the foundation for the first music recommendation system
for last dot fm.

Speaker 1 (09:38):
Right, and even by today's standards, this data set is substantial.
We are talking one hundred and forty one thousand unique users,
one point six million unique artists, and over twenty four
million recorded plays.

Speaker 2 (09:49):
But what makes this data set so interesting isn't just
the here scale of it, but the type of data
it relies on. The entire system is built on what
is called implicit feedback.

Speaker 1 (09:58):
Okay, implicit feedback.

Speaker 2 (10:00):
Think about the difference between implicit and explicit data. Explicit
data is when a user actively gives a song a
five star rating or a thumbsuck. They are explicitly stating
their preference.

Speaker 1 (10:11):
They're telling the machine exactly what they.

Speaker 2 (10:13):
Think, right. Implicit data, on the other hand, is just
a passive record of an action. In this audio Scrabbler
data set, it's just a raw record that a user
played a song.

Speaker 1 (10:23):
Okay, here's where it gets really interesting. I have to
push back on this. If the algorithm is basing my
entire taste profile on plays, isn't that a wildly weak signal?

Speaker 2 (10:34):
What do you mean?

Speaker 1 (10:34):
Well, think about your own habits. I fall asleep with
Spotify running all the time, or what if I put
an album on realize I absolutely hate the first track,
But the doorbell rings, so I walk away and the
album just keeps playing to an empty room. The computer
thinks I loved it.

Speaker 2 (10:48):
That is a very fair point. A single implicit data
point carries far less information than an explicit five star rating.
It's a very noisy, highly flawed signal.

Speaker 1 (10:58):
Because it captures all those accidents.

Speaker 2 (11:00):
Absolutely going to capture accidental plays, skip tracks, that registered
as plays, shared accounts, all of it. However, the sheer
volume of implicit data more than makes up for that.

Speaker 1 (11:11):
Noise, because listeners rarely take the time to explicitly rate
things anymore. I mean, I can't remember the last time
I actually clicked a star rating on a podcast or
a song exactly.

Speaker 2 (11:20):
People are incredibly lazy with explicit feedback. You might only
get ratings from the most extreme fans or the most
vocal critics, which heavily skews your data.

Speaker 1 (11:29):
That makes sense, but.

Speaker 2 (11:30):
People generate implicit play data constantly, just by going about
their day. So while a rating data set might give
you highly accurate info on a few dozen songs, a
user loved and implicit data set gives you tens of
thousands of data points on their broader behavior. The massive
volume simply overpowers the individual moments of noise.

Speaker 1 (11:49):
Okay, so we have twenty four million vague, noisy plays
spanning one point six million artists. The million dollar question
is how do we turn that massive pile of passive
actions into a highly accurate, personalized playlist, one that makes
you feel like the computer is reading your mind.

Speaker 2 (12:05):
This brings us to the core math latent factor models,
and the specific algorithm the authors deploy and smart to handle.
This is called alternating least squares or als.

Speaker 1 (12:17):
Okay, let's start with the latent factor part. The book
uses an analogy here that really clarified this for me.
Imagine looking at a single customer's listening history. They are
frequently buying albums by extremely aggressive heavy metal bands like
Megadeth and Pantera, right, But interspersed in those purchases, they
are also buying albums by the classical composer Mozart. Now,

(12:39):
if you just look at the raw metadata, it seems
completely random. What do Megadeath and Mozart possibly have in common?

Speaker 2 (12:45):
On the surface, nothing They share no genres, no instruments,
no vocal styles, but a latent factor model doesn't care
about the labels we put on music. It tries to
explain those seemingly random observed interactions through a small number
of unobserved underlying reasons.

Speaker 1 (13:00):
And those hidden reasons are the latent factors.

Speaker 2 (13:02):
Exactly. The underlying reason. The latent factor might be that
this user enjoys a specific, complex spectrum of music that
spans from aggressive metal through highly technical progressive rock all
the way to intricate classical compositions.

Speaker 1 (13:17):
Oh wow, so the algorithm doesn't know the word metal.

Speaker 2 (13:20):
Or classical not at all, just mathematically discovers this invisible
bridge connecting these disparate artists based on how millions of
users behave in tandem.

Speaker 1 (13:28):
That is wild. So how does it actually discover those bridges.

Speaker 2 (13:32):
To do that? The algorithm uses matrix factorization. I want
you to picture a massive spreadsheet. We will call this
matrix a The rows going down the side are one
hundred and forty one thousand users. The columns going across
the top are the one point six million artists.

Speaker 1 (13:46):
Okay, I'm with you.

Speaker 2 (13:47):
If a user played an artist, there's a number in
that intersecting cell. If they didn't play them, it is
a zero.

Speaker 1 (13:52):
I am trying to picture the spreadsheet in my head.
One hundred and forty one thousand rows by one point
six million columns, That is over two hundred and twenty
five billion individual cells. It is massive, and because no
single user has listened to even a fraction of one
point six million artists, almost every single cell in that
giant spreadsheet is zero. Right. It's incredibly sparse.

Speaker 2 (14:11):
It is over ninety nine percent empty. So the mathematical
process of matrix factorization takes this massive, mostly empty matrix
A and breaks it down or factors it into two
much smaller, skinny matrices. That's called the matrix X, which
represents the users, and matrix Y, which represents the artists.

Speaker 1 (14:30):
This reminds me of the jigsaw puzzle visual from.

Speaker 2 (14:32):
The text Well that's a good one.

Speaker 1 (14:34):
Yeah. When you only have a few scattered pieces of
a puzzle sitting on a giant table, it's incredibly hard
to describe what the picture is. That scattered table is
our giant, mostly empty matrix A. But if you actually
force those pieces together to finish the puzzle, you realize
it's just a simple picture of a cat.

Speaker 2 (14:51):
Right, the complexity disappears.

Speaker 1 (14:52):
Matrix X and y are the simple underlying reality of
the cat. The algorithm is trying to find that simple
truth hidden in the scattered pieces precisely.

Speaker 2 (15:01):
And to find that cat mathematically, you have to multiply
matrix x and matrix y together to recreate the original
matrix A. But there's a massive catch twenty two here.

Speaker 1 (15:12):
Wait. Yeah, because basic algebra says you can't solve a
complex equation if you have two entirely unknown variables at
the exact same time. We don't know matrix x and
we don't know matrix hy. So how does the algorithm
actually calculate this?

Speaker 2 (15:23):
What's fascinating here is the alternating part of alternating le squares.
Since the algorithm can't solve for both x and y
at the same time, it simply fakes one of them.

Speaker 1 (15:33):
Wait, it just fakes it.

Speaker 2 (15:34):
Yeah, It initializes the artist matrix matrix y with completely random,
entirely made up numbers.

Speaker 1 (15:40):
It just guesses wildly.

Speaker 2 (15:41):
The guesses wildly. But here is the trick. Once you
pretend that matrix y is a known fixed value, the
unsolvable equations suddenly becomes solvable. It becomes trivial to use
basic linear algebra to perfectly solve for matrix X.

Speaker 1 (15:54):
But X was just solved using a fake Y, so
X is completely wrong too.

Speaker 2 (15:59):
It is wrong, but, and this is key, it is
slightly less wrong than the completely random numbers we started with.
So now the algorithm locks in that newly calculated, slightly
less wrong matrix X. It treats X as the fixed truth,
and it works backward to solve for brand new matrix y. Ah.
It alternates, It alternates back and forth millions of times. Yeah,

(16:20):
solve X using Y, then solve Y using X. With
each single alternation, the square differences between the predicted matrix
and the actual user data gets just a tiny bit smaller.

Speaker 1 (16:30):
Which is the least squares part of the name exactly.

Speaker 2 (16:33):
It methodically shrinks the air until it eventually converges on
a highly accurate approximation of the underlying reality. The latent
tastes emerge from the math, so.

Speaker 1 (16:41):
The math turns, the matrices alternate, the errors shrink, and
eventually the algorithm spits out a list of recommended artists
for a user. But how do we know if it
actually worked? Like, how do we know if those alternating
guesses generated a genuinely good recommendation for the listener.

Speaker 2 (16:57):
Evaluating a recommendation engine is notoriously difficult. Actually, it raises
this fascinating problem about subjective human judgment versus mathematical truth.

Speaker 1 (17:05):
The authors dive into a specific test case from the
data set that I think illustrates us perfectly. They isolate
a single listener user twosero nine three, seven, six to zero.

Speaker 2 (17:15):
Okay, let's look at them.

Speaker 1 (17:16):
We look at their actual historical listening history, and it
includes artists like David Gray, Black Delicious, Jurassic five Exhibit,
and the Irish rock band the Saw Doctors. It is
this really specific, nuanced mix of mainstream pop, underground nineties
hip hop, and Irish rock.

Speaker 2 (17:32):
Right, So the data scientist asks their firstly baked als
model to recommend five brand new artists for this user
based on those latent factors, and the algorithm recommends fifty
cent j Z, Kanye West, Tupac, and the Game.

Speaker 1 (17:45):
So did the algorithm just completely fail? I mean, we've
fetted a nuanced history including Jurassic five and the Saw Doctors,
and it basically just handed us a generic list of
the most mainstream, commercially successful hip hop artists of two
thousand and five. At first glance. That looks like a terrible,
non personal life recommendation.

Speaker 2 (18:01):
It certainly looks like a failure to a human observer
who might be looking for a clever indie rock recommendation.
But that is the problem with subjective judgment. How do
we objectively mathematically prove if the algorithm is doing a
good job across all one hundred and forty one thousand users.

Speaker 1 (18:18):
Right, Because you can't have a human manually judge every
single output.

Speaker 2 (18:22):
Exactly, and you can't just test the algorithm by asking
it to recommend music the user has already played.

Speaker 1 (18:27):
Because a recommender's entire job is to find new things.
If it just tells me to listen to David Gray again,
it hasn't discovered a latent taste. It just memorize my
history bingo.

Speaker 2 (18:37):
To solve this, the authors explain the proper evaluation technique.
You take a user's actual listening history, and you purposefully
hide a portion of their plays from the model. You
hold that data back, effectively blinding the algorithm to a
portion of reality.

Speaker 1 (18:52):
Okay, so you train the model on say, eighty percent
of their history, and keep the other twenty percent locked
in a vault.

Speaker 2 (18:59):
Then you ask the algorithm to rank all one point
six million artists in the entire database for that user,
from best fit to worst fit. If the algorithm is
truly working, it should naturally rank those hidden artists, the
ones in the vault, very near the top of that
one point six million list, even though it was never
explicitly told the user listened to them.

Speaker 1 (19:19):
Oh that's brilliant.

Speaker 2 (19:20):
That is how you objectively prove the latent factors are
actually capturing the shape of the user's taste.

Speaker 1 (19:27):
So we step back and we look at this whole pipeline.
We've gone from the messy reality of data collection, the
challenge of linking records with probabilistic matching, the noise of
implicit feedback, all the way to alternating matrices to uncover
hidden preferences. And why does this matter to you, the listener,
Because this exact pipeline is running your life.

Speaker 2 (19:46):
Oh it really is. Every single time you open Spotify, Netflix,
or Amazon, you are generating implicit.

Speaker 1 (19:52):
Actions just by clicking around.

Speaker 2 (19:53):
Yeah, lingering on a movie title for three seconds, skipping
a track halfway through, clicking a product and then hitting
the back button. All of those tiny, seemingly insignificant actions
are being ingested, cleaned, and fed into giant matrices to
uncover latent tastes you didn't even know you had.

Speaker 1 (20:12):
So when you feel like Spotify just read your mind
with a perfect discovery playlist, it didn't. It just solved
matrix sex. It's that illusion of magic we talked about
at the top of the show. We think the machine
knows our soul. We think the tech is clean and brilliant.

Speaker 2 (20:26):
But beneath the surface, it is a sausage factory exactly.

Speaker 1 (20:29):
It's a sausage factory of noisy data, probabilistic guesses, and
millions of alternating calculations zeroing in on who we are
based on the faint, messy footprints we leave behind.

Speaker 2 (20:39):
It is an elegant yet brutal mathematical approximation of human preference.

Speaker 1 (20:44):
Which leaves us with a final thought, tom all over.
If these algorithms like alternating least squares rely entirely on
what you and millions of others have already done your
pass implicit behavior, are they actually discovering new, unique tastes
for you, or are they just mathematically trapping you in
an inescapable echo chamber, constantly reinforcing the person you used

(21:05):
to be.

Speaker 2 (21:05):
A fascinating mathematical loop.

Speaker 1 (21:07):
To end on, something to think about the next time
you're streaming up cues, up the exact song you expected.
Until next time, keep digging beneath the surface.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

Dateline NBC

Betrayal Weekly

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

Dateline NBC

Betrayal Weekly

All Episodes

Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Stuff You Should Know