Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Welcome to the deep dive. You know, when most people
hear machine learning or maybe AI, I think the first
thing that comes to mind is the code.
Speaker 2 (00:10):
Right, Oh, absolutely, Python, scripts, neural nets, all that complex
engineering stuff. That's the flashy part.
Speaker 1 (00:15):
Yeah, the engine, It is the engine. Yeah.
Speaker 2 (00:17):
But what our source material for today really emphasizes, and
it's looking specifically at the prerequisites for even building those engines,
is that the real foundation. It isn't the code, ok,
it's math specifically, it's statistics. You could almost call it
a preliminary requirement.
Speaker 1 (00:33):
Right. So that's our mission today. Then we're aiming to
give you a bit of an intellectual shortcut here. We
want to pull out the essential statistical concepts that the
core of vocabulary and the toolkit you need for exploring data,
cleaning it up, getting ready for predictive modeling.
Speaker 2 (00:48):
Basically saving you the trouble of reading the whole textbook page.
Speaker 1 (00:51):
By page exactly. This is about getting that statistical fluency
you need before you even think about training.
Speaker 2 (00:56):
A model, and it's not just about passing some exam.
You genuinely need these concepts because well, every single step
in an mL pipeline from the moment you get the
data to evaluating how well your model did. It's fundamentally
a statistical operation.
Speaker 1 (01:13):
Okay, so where do we start. I guess right at
the beginning, recognizing what kind of data you're even dealing with.
Speaker 2 (01:20):
That's the spot the sources remind us that data collection
isn't just you know, chaos. It's usually driven by trying
to answer some real world question.
Speaker 1 (01:28):
Like market research before you launch a product, maybe.
Speaker 2 (01:31):
Exactly is this product feasible? Who are we trying to
reach that kind of thing?
Speaker 1 (01:34):
And the answers we get the actual numbers we collect
in store.
Speaker 2 (01:38):
Those are grouped into what statisticians call random variables. They're
the numerical backbone of whatever research you're doing.
Speaker 1 (01:44):
Okay, random variables, So how do we bring some order
to that? How do we structure them?
Speaker 2 (01:49):
Well, we mainly split them based on what they can
actually measure. First up, you've got discrete random variables.
Speaker 1 (01:55):
Discrete meaning separate.
Speaker 2 (01:57):
Yeah, I think fixed counts. They have to be whole numbers.
It can't be like counting how many people clicked on
an ad or the number of gold medals a country
one in the Olympics. Not it fixed counts, Definite fixed counts.
Speaker 1 (02:09):
So what's the other kind the stuff that isn't fixed counts?
Speaker 2 (02:12):
That would be your continuous random variable. This one stores
values that can be decimals or floats, and theoretically, at
least you could measure them with infinite precision.
Speaker 1 (02:22):
Like height or weight.
Speaker 2 (02:24):
Perfect examples height, weight, temperature. You can always, in theory,
add another decimal place to make the measurement finer.
Speaker 1 (02:31):
Okay, that makes sense for numbers, But what if the
data isn't a number at all, Like if it's just
a label someone's city or maybe their preferred brand.
Speaker 2 (02:40):
Ah, good question. Then you're working with categorical variables. And
this is where we need another layer of distinction, because
how an mL algorithm handles these depends a lot on
whether the categories have some kind of internal meaning or order.
Speaker 1 (02:54):
Wait, internal meaning? Why does that matter? Isn't red just
red to a computer?
Speaker 2 (02:59):
It matters quite a bit, actually, mostly because it impacts
how you encode that data before feeding it to a model.
If the categories have absolutely no inherent rank or order,
we call them nominal variables. Okay, think gender like male female,
or maybe types of fruit apple, banana, orange. You can't
(03:19):
really rank one above the other. Logically, they're just distinct groups.
Speaker 1 (03:23):
Right, distinct groups makes sense, But what if they can
be ranked and it's.
Speaker 2 (03:27):
An ordinal variable. Think about say a customer satisfaction rating low, medium, high.
Speaker 1 (03:33):
Ah, Okay, there's a clear hierarchy.
Speaker 2 (03:35):
There exactly, And knowing this difference is key before you
start your future engineering phenomenal variables. The algorithm often needs
to treat each category as totally separate, maybe using something
called one hot encoding. But for ordinal variables you might
be able to use encoding methods that preserve that ranking,
which can sometimes make the model simpler or even more accurate.
(03:56):
So yeah, knowing this distinction is pretty fundamental.
Speaker 1 (03:59):
All right, So we figure fur out what kind of
variables we have. What's the immediate next step? Usually it's
descriptive statistics, right, trying to summarize potentially huge data sets.
Speaker 2 (04:09):
Yes, exactly. We're moving from just defining things to actually
starting to tell the story hidden the data. The first
step is usually summarizing it, focusing on its center and
it's spread.
Speaker 1 (04:19):
Okay, center and spread. Let's start with the center. Measures
of central tendency is that the term that's the one and.
Speaker 2 (04:24):
The big three here are the mean the median and
the mode.
Speaker 1 (04:28):
Everyone knows the mean the average, Right, add them all up,
divide by how many there are. Seems simple, But what's
the specific mL insight? Why is it so important?
Speaker 2 (04:37):
Well, mathematically, the mean is the center of balance for
your data. But what's really interesting is how it connects
directly to prediction. How So, when you build, say a
simple linear regression model, what you're essentially doing is trying
to draw a line that minimizes the square distance between
that line and all your data points. Yeah, the mean
turns out to be the single value that inherent only
(05:00):
minimizes that scored error.
Speaker 1 (05:02):
Huh. So it's like the best guess if you knew
nothing else.
Speaker 2 (05:06):
It's the optimal point prediction if you had zero other information. Yes,
it's a point of minimum error.
Speaker 1 (05:11):
Okay, but the mean has that famous weakness, right, the
outlier problem. Like if your averaging salary is in a
small startup and suddenly the CEO's twenty million dollars salary
gets added.
Speaker 2 (05:22):
In exactly that one massive outlier just yanks the average
way way up, making it not very representative of the
typical employee.
Speaker 1 (05:31):
So that's where the medium comes in.
Speaker 2 (05:32):
Precisely, The median is the exact middle value when you
sort your data from smallest to largest. Fifty percent of
the data is below it, fifty percent is above it.
Speaker 1 (05:40):
And because it only cares about the middle position.
Speaker 2 (05:43):
It's incredibly robust to those extreme outliers. That twenty million
dollars salary doesn't really affect the median much, if at all.
Speaker 1 (05:50):
And if you have an even number of data points,
no single middle value.
Speaker 2 (05:54):
Simple you just take the average of the two middle values.
Still gives you that robust central point.
Speaker 1 (05:59):
Okay, so mean is air minimizing, but sensitive to outliers,
meeting is robust. What about the third one, the mode?
Speaker 2 (06:05):
The mode is even simpler. It's just the value that
shows up most often in your data set, most frequent yep.
It's typically most useful for categorical data, finding the most
popular choice or the most common group.
Speaker 1 (06:18):
And he quirks with the mode.
Speaker 2 (06:20):
Couple interesting ones. It's the only measure of center that
might not actually be present in your data, which sounds
weird but can happen. And you can also have more
than one mode, like bimodal exactly by moodal if there
are two peaks, or even multimodal that can be a
clue that your data might actually be composed of a
couple of different underlying groups or clusters.
Speaker 1 (06:39):
Okay, so we found the center using mean, median or mode.
But you said center alone isn't enough. Two data sets
could have the same mean but look totally different.
Speaker 2 (06:50):
Right. Imagine one data set clustered tightly around the mean
and another spread way out, same mean, very different story.
That's why we need measures of disperge or spread, and
the main ones are variance in standard deviation STY.
Speaker 1 (07:04):
Okay, variance and STY. They both measure spread, right, how
far data points tend to be from the center, usually
the mean.
Speaker 2 (07:10):
That's the core idea. A high value for either variants
or SD means the data is really spread out, dispersed widely.
A small value means everything's huddled close to the mean.
Speaker 1 (07:21):
So if they measure the same basic thing, why do
we need both? What's the practical difference, especially thinking about
machine learning?
Speaker 2 (07:28):
Okay, so mathematically, the standard deviation is just the square
root of the variance. The absolute key difference is the
units units. Yeah, variance is calculated using square differences, so
it's units are the square of the original data's units.
If you measure at height in meters, the variance is
in meters squared, which is kind of awkward to interpret directly.
Speaker 1 (07:49):
Not very intuitive.
Speaker 2 (07:50):
But the standard deviation, because it's the square root, is
back in the original units. So if your height data
is in meters, the SD is also in meters.
Speaker 1 (07:58):
Ah. Okay, so s D is easier to compare directly
to the mean.
Speaker 2 (08:02):
Much easier. It makes SD far better for interpretation, for reporting,
and really crucially for something called feature scaling or normalization
in mL.
Speaker 1 (08:10):
Why future scaling.
Speaker 2 (08:11):
Well, often in mL you have features measured on totally
different scales, maybe aging years, income in thousands of dollars,
heightened centimeters. Models can sometimes struggle with that or give
too much weight to features with larger numerical values.
Speaker 1 (08:24):
So you need to put them on a level playing
field exactly.
Speaker 2 (08:27):
You often rescale features so they have a mean of
zero and a standard deviation of one, And standard deviation
is the metric you use to do that rescaling properly.
It's fundamental for pre processing data for many algorithms.
Speaker 1 (08:39):
Okay, we've gone from defining data types to summarizing them
with center and spread. Now how do we pivot towards
using this data for prediction. That feels like the next
logical step.
Speaker 2 (08:50):
It is, and that pivot really starts by defining a
potential cause and effect relationship. This is where we introduce
the concepts of dependent and independent.
Speaker 1 (08:59):
Variable right setting up the experiment.
Speaker 2 (09:01):
Essentially, pretty much, we're defining our modeling goal. What factor
are we changing or observing the independent variable, and what
outcome are we measuring the effect on the dependent variable.
Speaker 1 (09:14):
So the independent variable is the input, the thing we control,
or the factor we think is causing a change.
Speaker 2 (09:20):
Exactly like in a drug trial, the dosage level would
be the independent variable. Or using an example from the source,
maybe the type of pitch a pitcher throws to a batter.
That's the input being.
Speaker 1 (09:30):
Varied, and the dependent variable is the output the result
What happens because of the independent variable.
Speaker 2 (09:35):
Yes, it's the variable being tested or measured that responds
to the changes. In that baseball example, the batter's performance,
did they hit it how well? That's the dependent variable.
Its value depends on the pitch type.
Speaker 1 (09:47):
And getting these two defined correctly seems absolutely critical. It's
basically framing the entire problem you want your mL model
to solve.
Speaker 2 (09:55):
It is you're specifying the relationship you intend to model
and predict.
Speaker 1 (09:58):
Now, underpinning all all of this statistical analysis, all these
measurements and relationships, there's a really core principle that gives
us confidence in the results, right, the law of large
numbers LLLN.
Speaker 2 (10:09):
Ah. Yes, the LLN. It's absolutely fundamental. It's kind of
the bedrock that makes statistics work reliably.
Speaker 1 (10:15):
So what is its state? In simple terms, it.
Speaker 2 (10:17):
Basically says that if you repeat the same experiment over
and over and over again a huge number of times,
the average of the results you get will get closer
and closer to the true expected theoretical value.
Speaker 1 (10:30):
Like flipping a coin.
Speaker 2 (10:31):
Perfect example, flip a coin just ten times, you might
easily get say seven heads and three tails. That's pretty
far from the expected fifty to fifty, right, But flip
that same coin a million times or ten million times,
the ratio of heads to tails is going to get
incredibly close to exactly one to one. It converges on
the true probability.
Speaker 1 (10:50):
And it's that convergence that lets us trust statistical methods exactly.
Speaker 2 (10:54):
It validates the whole idea of using probabilities and statistics
derived from experiments or samples to stand underlying truths. It
allows us to have confidence in probabilistic models.
Speaker 1 (11:04):
So the LN gives us the confidence then to take
results we see in a smaller sample of data and
make reasonable conclusions about the entire population it came from,
which sounds like statistical inference.
Speaker 2 (11:15):
That's precisely what statistical inference is about, and it leads
directly to the main framework we use for making those decisions.
Hypothesis testing.
Speaker 1 (11:23):
Okay, hypothesis testing. This is where we formally test an
idea using the data.
Speaker 2 (11:28):
Yes, it's the structured process where we use the summary
statistics we calculated combined with our understanding of probability in
the LLN to draw conclusions about a whole population based
only on evidence from a sample.
Speaker 1 (11:41):
And it usually involves setting up two competing ideas beforehand.
Speaker 2 (11:44):
Correct you have a kind of statistical showdown. The main
goal is to see if there's enough evidence in your
sample data to reject the null hypothesis.
Speaker 1 (11:52):
The null hypothesis being the default skeptical position.
Speaker 2 (11:56):
Always it's the statement of no effect, no difference, or
no relation. For example, this new drug has no effect
on recovery time compared to the place ebo. It's the
status quo assumption, and we test that against against the
alternative hypothesis. This is the statement that contradicts the null.
It's what you, as the researcher, might actually suspect or
hope to prove, like, no, this new drug does reduce
(12:18):
recovery time.
Speaker 1 (12:19):
So the whole process is about gathering enough statistical evidence
to confidently say, Okay, we can reject the no effect
idea in favor of the there is an.
Speaker 2 (12:29):
Effect idea precisely, and that level of statistical confidence, often
expressed as a P value or a confidence interval, is
what determines whether you feel justified in acting on your
findings or making a claim about the population.
Speaker 1 (12:42):
All right, let's pull this together. We've walked through quite
a statistical toolkit, understanding the different types of variables you encounter.
Speaker 2 (12:48):
Discrete, continuous, nominal, ordinal.
Speaker 1 (12:51):
Yeah, then summarizing them with measures of center like the
mean and median, understanding spread with standard deviation especially ysds
full practically.
Speaker 2 (13:01):
Right, those units matter for comparison and scaling.
Speaker 1 (13:03):
Then we moved into setting up predictions by defining dependent
and independent variables, and finally, the framework for making decisions
based on sample data hypothesis testing built on the confidence
given by the law of large numbers.
Speaker 2 (13:16):
It really does form the essential foundation. You can see
how these concepts are well mandatory before you jump into
the more complex mL algorithms.
Speaker 1 (13:24):
They really are the entry point for any serious study
or application.
Speaker 2 (13:28):
But here's a final thought, something that connects back to
that law of large numbers. The LN guarantees convergence. It
gives us certainty, but only over a massive number of trials.
A million coin flips.
Speaker 1 (13:40):
Right, requires huge scale exactly.
Speaker 2 (13:43):
But in the real world, doing market analysis, building a
product prototype, maybe even running a clinical trial, we almost
never have a million data points. We work with samples,
sometimes relatively small samples because collecting data is expensive or
time consuming.
Speaker 1 (13:58):
So the certainty we get is an absolute. It's usually probabilistic,
like saying we're ninety five percent confident or maybe ninety
nine percent confident.
Speaker 2 (14:06):
Right, which leads to the provocative question. If the law
of large numbers only guarantees truth over immense scale, how
often are our everyday decisions, maybe in business, launching a
new feature, or even interpreting a political poll, actually based
on what could be called the fallacy of small.
Speaker 1 (14:22):
Numbers, meaning we're drawing conclusions from samples that might be
too small to really trust the LLNS guarantee potentially.
Speaker 2 (14:30):
So the question for you, the listener, is what level
of statistical certainty that ninety five percent, that ninety nine
percent are you willing to accept, Especially when you're moving
from analyzing a potentially small, expensive sample to making a
big assumption about the entire population, an assumption that could
have major consequences, maybe cost millions.
Speaker 1 (14:49):
How much uncertainty can you live with?
Speaker 2 (14:51):
What's your threshold for risk framed in that statistical confidence?
Speaker 1 (14:54):
Definitely something to think about. A great place to leave
it for the steep dive. Thanks for joining us,