All Episodes

December 15, 2025 • 15 mins
An educational resource detailing statistical concepts foundational to machine learning, including descriptive statistics (mean, median, mode, and measures of dispersion), probability theory, and methods for parameter estimation and hypothesis testing. The book covers various analytical techniques such as ANOVA, regression models (linear, logistic, and regularized forms), and non-parametric statistics, often illustrating their practical application using Python libraries like Pandas and NumPy. The text also offers an overview of machine learning algorithms, including supervised and unsupervised methods, positioning statistics as the core discipline underpinning these advanced applications.

You can listen and download our episodes for free on more than 10 different platforms:
https://linktr.ee/cyber_security_summary

Get the Book now from Amazon:
https://www.amazon.com/Statistics-Machine-Learning-Implement-Statistical/dp/9388511972?&linkCode=ll1&tag=cvthunderx-20&linkId=334106a284fd7b6360bf1aa51ed5b699&language=en_US&ref_=as_li_ss_tl

Discover our free courses in tech and cybersecurity, Start learning today:
https://linktr.ee/cybercode_academy
Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Welcome to the deep dive. You know, when most people
hear machine learning or maybe AI, I think the first
thing that comes to mind is the code.

Speaker 2 (00:10):
Right, Oh, absolutely, Python, scripts, neural nets, all that complex
engineering stuff. That's the flashy part.

Speaker 1 (00:15):
Yeah, the engine, It is the engine. Yeah.

Speaker 2 (00:17):
But what our source material for today really emphasizes, and
it's looking specifically at the prerequisites for even building those engines,
is that the real foundation. It isn't the code, ok,
it's math specifically, it's statistics. You could almost call it
a preliminary requirement.

Speaker 1 (00:33):
Right. So that's our mission today. Then we're aiming to
give you a bit of an intellectual shortcut here. We
want to pull out the essential statistical concepts that the
core of vocabulary and the toolkit you need for exploring data,
cleaning it up, getting ready for predictive modeling.

Speaker 2 (00:48):
Basically saving you the trouble of reading the whole textbook page.

Speaker 1 (00:51):
By page exactly. This is about getting that statistical fluency
you need before you even think about training.

Speaker 2 (00:56):
A model, and it's not just about passing some exam.
You genuinely need these concepts because well, every single step
in an mL pipeline from the moment you get the
data to evaluating how well your model did. It's fundamentally
a statistical operation.

Speaker 1 (01:13):
Okay, so where do we start. I guess right at
the beginning, recognizing what kind of data you're even dealing with.

Speaker 2 (01:20):
That's the spot the sources remind us that data collection
isn't just you know, chaos. It's usually driven by trying
to answer some real world question.

Speaker 1 (01:28):
Like market research before you launch a product, maybe.

Speaker 2 (01:31):
Exactly is this product feasible? Who are we trying to
reach that kind of thing?

Speaker 1 (01:34):
And the answers we get the actual numbers we collect
in store.

Speaker 2 (01:38):
Those are grouped into what statisticians call random variables. They're
the numerical backbone of whatever research you're doing.

Speaker 1 (01:44):
Okay, random variables, So how do we bring some order
to that? How do we structure them?

Speaker 2 (01:49):
Well, we mainly split them based on what they can
actually measure. First up, you've got discrete random variables.

Speaker 1 (01:55):
Discrete meaning separate.

Speaker 2 (01:57):
Yeah, I think fixed counts. They have to be whole numbers.
It can't be like counting how many people clicked on
an ad or the number of gold medals a country
one in the Olympics. Not it fixed counts, Definite fixed counts.

Speaker 1 (02:09):
So what's the other kind the stuff that isn't fixed counts?

Speaker 2 (02:12):
That would be your continuous random variable. This one stores
values that can be decimals or floats, and theoretically, at
least you could measure them with infinite precision.

Speaker 1 (02:22):
Like height or weight.

Speaker 2 (02:24):
Perfect examples height, weight, temperature. You can always, in theory,
add another decimal place to make the measurement finer.

Speaker 1 (02:31):
Okay, that makes sense for numbers, But what if the
data isn't a number at all, Like if it's just
a label someone's city or maybe their preferred brand.

Speaker 2 (02:40):
Ah, good question. Then you're working with categorical variables. And
this is where we need another layer of distinction, because
how an mL algorithm handles these depends a lot on
whether the categories have some kind of internal meaning or order.

Speaker 1 (02:54):
Wait, internal meaning? Why does that matter? Isn't red just
red to a computer?

Speaker 2 (02:59):
It matters quite a bit, actually, mostly because it impacts
how you encode that data before feeding it to a model.
If the categories have absolutely no inherent rank or order,
we call them nominal variables. Okay, think gender like male female,
or maybe types of fruit apple, banana, orange. You can't

(03:19):
really rank one above the other. Logically, they're just distinct groups.

Speaker 1 (03:23):
Right, distinct groups makes sense, But what if they can
be ranked and it's.

Speaker 2 (03:27):
An ordinal variable. Think about say a customer satisfaction rating low, medium, high.

Speaker 1 (03:33):
Ah, Okay, there's a clear hierarchy.

Speaker 2 (03:35):
There exactly, And knowing this difference is key before you
start your future engineering phenomenal variables. The algorithm often needs
to treat each category as totally separate, maybe using something
called one hot encoding. But for ordinal variables you might
be able to use encoding methods that preserve that ranking,
which can sometimes make the model simpler or even more accurate.

(03:56):
So yeah, knowing this distinction is pretty fundamental.

Speaker 1 (03:59):
All right, So we figure fur out what kind of
variables we have. What's the immediate next step? Usually it's
descriptive statistics, right, trying to summarize potentially huge data sets.

Speaker 2 (04:09):
Yes, exactly. We're moving from just defining things to actually
starting to tell the story hidden the data. The first
step is usually summarizing it, focusing on its center and
it's spread.

Speaker 1 (04:19):
Okay, center and spread. Let's start with the center. Measures
of central tendency is that the term that's the one and.

Speaker 2 (04:24):
The big three here are the mean the median and
the mode.

Speaker 1 (04:28):
Everyone knows the mean the average, Right, add them all up,
divide by how many there are. Seems simple, But what's
the specific mL insight? Why is it so important?

Speaker 2 (04:37):
Well, mathematically, the mean is the center of balance for
your data. But what's really interesting is how it connects
directly to prediction. How So, when you build, say a
simple linear regression model, what you're essentially doing is trying
to draw a line that minimizes the square distance between
that line and all your data points. Yeah, the mean
turns out to be the single value that inherent only

(05:00):
minimizes that scored error.

Speaker 1 (05:02):
Huh. So it's like the best guess if you knew
nothing else.

Speaker 2 (05:06):
It's the optimal point prediction if you had zero other information. Yes,
it's a point of minimum error.

Speaker 1 (05:11):
Okay, but the mean has that famous weakness, right, the
outlier problem. Like if your averaging salary is in a
small startup and suddenly the CEO's twenty million dollars salary
gets added.

Speaker 2 (05:22):
In exactly that one massive outlier just yanks the average
way way up, making it not very representative of the
typical employee.

Speaker 1 (05:31):
So that's where the medium comes in.

Speaker 2 (05:32):
Precisely, The median is the exact middle value when you
sort your data from smallest to largest. Fifty percent of
the data is below it, fifty percent is above it.

Speaker 1 (05:40):
And because it only cares about the middle position.

Speaker 2 (05:43):
It's incredibly robust to those extreme outliers. That twenty million
dollars salary doesn't really affect the median much, if at all.

Speaker 1 (05:50):
And if you have an even number of data points,
no single middle value.

Speaker 2 (05:54):
Simple you just take the average of the two middle values.
Still gives you that robust central point.

Speaker 1 (05:59):
Okay, so mean is air minimizing, but sensitive to outliers,
meeting is robust. What about the third one, the mode?

Speaker 2 (06:05):
The mode is even simpler. It's just the value that
shows up most often in your data set, most frequent yep.
It's typically most useful for categorical data, finding the most
popular choice or the most common group.

Speaker 1 (06:18):
And he quirks with the mode.

Speaker 2 (06:20):
Couple interesting ones. It's the only measure of center that
might not actually be present in your data, which sounds
weird but can happen. And you can also have more
than one mode, like bimodal exactly by moodal if there
are two peaks, or even multimodal that can be a
clue that your data might actually be composed of a
couple of different underlying groups or clusters.

Speaker 1 (06:39):
Okay, so we found the center using mean, median or mode.
But you said center alone isn't enough. Two data sets
could have the same mean but look totally different.

Speaker 2 (06:50):
Right. Imagine one data set clustered tightly around the mean
and another spread way out, same mean, very different story.
That's why we need measures of disperge or spread, and
the main ones are variance in standard deviation STY.

Speaker 1 (07:04):
Okay, variance and STY. They both measure spread, right, how
far data points tend to be from the center, usually
the mean.

Speaker 2 (07:10):
That's the core idea. A high value for either variants
or SD means the data is really spread out, dispersed widely.
A small value means everything's huddled close to the mean.

Speaker 1 (07:21):
So if they measure the same basic thing, why do
we need both? What's the practical difference, especially thinking about
machine learning?

Speaker 2 (07:28):
Okay, so mathematically, the standard deviation is just the square
root of the variance. The absolute key difference is the
units units. Yeah, variance is calculated using square differences, so
it's units are the square of the original data's units.
If you measure at height in meters, the variance is
in meters squared, which is kind of awkward to interpret directly.

Speaker 1 (07:49):
Not very intuitive.

Speaker 2 (07:50):
But the standard deviation, because it's the square root, is
back in the original units. So if your height data
is in meters, the SD is also in meters.

Speaker 1 (07:58):
Ah. Okay, so s D is easier to compare directly
to the mean.

Speaker 2 (08:02):
Much easier. It makes SD far better for interpretation, for reporting,
and really crucially for something called feature scaling or normalization
in mL.

Speaker 1 (08:10):
Why future scaling.

Speaker 2 (08:11):
Well, often in mL you have features measured on totally
different scales, maybe aging years, income in thousands of dollars,
heightened centimeters. Models can sometimes struggle with that or give
too much weight to features with larger numerical values.

Speaker 1 (08:24):
So you need to put them on a level playing
field exactly.

Speaker 2 (08:27):
You often rescale features so they have a mean of
zero and a standard deviation of one, And standard deviation
is the metric you use to do that rescaling properly.
It's fundamental for pre processing data for many algorithms.

Speaker 1 (08:39):
Okay, we've gone from defining data types to summarizing them
with center and spread. Now how do we pivot towards
using this data for prediction. That feels like the next
logical step.

Speaker 2 (08:50):
It is, and that pivot really starts by defining a
potential cause and effect relationship. This is where we introduce
the concepts of dependent and independent.

Speaker 1 (08:59):
Variable right setting up the experiment.

Speaker 2 (09:01):
Essentially, pretty much, we're defining our modeling goal. What factor
are we changing or observing the independent variable, and what
outcome are we measuring the effect on the dependent variable.

Speaker 1 (09:14):
So the independent variable is the input, the thing we control,
or the factor we think is causing a change.

Speaker 2 (09:20):
Exactly like in a drug trial, the dosage level would
be the independent variable. Or using an example from the source,
maybe the type of pitch a pitcher throws to a batter.
That's the input being.

Speaker 1 (09:30):
Varied, and the dependent variable is the output the result
What happens because of the independent variable.

Speaker 2 (09:35):
Yes, it's the variable being tested or measured that responds
to the changes. In that baseball example, the batter's performance,
did they hit it how well? That's the dependent variable.
Its value depends on the pitch type.

Speaker 1 (09:47):
And getting these two defined correctly seems absolutely critical. It's
basically framing the entire problem you want your mL model
to solve.

Speaker 2 (09:55):
It is you're specifying the relationship you intend to model
and predict.

Speaker 1 (09:58):
Now, underpinning all all of this statistical analysis, all these
measurements and relationships, there's a really core principle that gives
us confidence in the results, right, the law of large
numbers LLLN.

Speaker 2 (10:09):
Ah. Yes, the LLN. It's absolutely fundamental. It's kind of
the bedrock that makes statistics work reliably.

Speaker 1 (10:15):
So what is its state? In simple terms, it.

Speaker 2 (10:17):
Basically says that if you repeat the same experiment over
and over and over again a huge number of times,
the average of the results you get will get closer
and closer to the true expected theoretical value.

Speaker 1 (10:30):
Like flipping a coin.

Speaker 2 (10:31):
Perfect example, flip a coin just ten times, you might
easily get say seven heads and three tails. That's pretty
far from the expected fifty to fifty, right, But flip
that same coin a million times or ten million times,
the ratio of heads to tails is going to get
incredibly close to exactly one to one. It converges on
the true probability.

Speaker 1 (10:50):
And it's that convergence that lets us trust statistical methods exactly.

Speaker 2 (10:54):
It validates the whole idea of using probabilities and statistics
derived from experiments or samples to stand underlying truths. It
allows us to have confidence in probabilistic models.

Speaker 1 (11:04):
So the LN gives us the confidence then to take
results we see in a smaller sample of data and
make reasonable conclusions about the entire population it came from,
which sounds like statistical inference.

Speaker 2 (11:15):
That's precisely what statistical inference is about, and it leads
directly to the main framework we use for making those decisions.
Hypothesis testing.

Speaker 1 (11:23):
Okay, hypothesis testing. This is where we formally test an
idea using the data.

Speaker 2 (11:28):
Yes, it's the structured process where we use the summary
statistics we calculated combined with our understanding of probability in
the LLN to draw conclusions about a whole population based
only on evidence from a sample.

Speaker 1 (11:41):
And it usually involves setting up two competing ideas beforehand.

Speaker 2 (11:44):
Correct you have a kind of statistical showdown. The main
goal is to see if there's enough evidence in your
sample data to reject the null hypothesis.

Speaker 1 (11:52):
The null hypothesis being the default skeptical position.

Speaker 2 (11:56):
Always it's the statement of no effect, no difference, or
no relation. For example, this new drug has no effect
on recovery time compared to the place ebo. It's the
status quo assumption, and we test that against against the
alternative hypothesis. This is the statement that contradicts the null.
It's what you, as the researcher, might actually suspect or
hope to prove, like, no, this new drug does reduce

(12:18):
recovery time.

Speaker 1 (12:19):
So the whole process is about gathering enough statistical evidence
to confidently say, Okay, we can reject the no effect
idea in favor of the there is an.

Speaker 2 (12:29):
Effect idea precisely, and that level of statistical confidence, often
expressed as a P value or a confidence interval, is
what determines whether you feel justified in acting on your
findings or making a claim about the population.

Speaker 1 (12:42):
All right, let's pull this together. We've walked through quite
a statistical toolkit, understanding the different types of variables you encounter.

Speaker 2 (12:48):
Discrete, continuous, nominal, ordinal.

Speaker 1 (12:51):
Yeah, then summarizing them with measures of center like the
mean and median, understanding spread with standard deviation especially ysds
full practically.

Speaker 2 (13:01):
Right, those units matter for comparison and scaling.

Speaker 1 (13:03):
Then we moved into setting up predictions by defining dependent
and independent variables, and finally, the framework for making decisions
based on sample data hypothesis testing built on the confidence
given by the law of large numbers.

Speaker 2 (13:16):
It really does form the essential foundation. You can see
how these concepts are well mandatory before you jump into
the more complex mL algorithms.

Speaker 1 (13:24):
They really are the entry point for any serious study
or application.

Speaker 2 (13:28):
But here's a final thought, something that connects back to
that law of large numbers. The LN guarantees convergence. It
gives us certainty, but only over a massive number of trials.
A million coin flips.

Speaker 1 (13:40):
Right, requires huge scale exactly.

Speaker 2 (13:43):
But in the real world, doing market analysis, building a
product prototype, maybe even running a clinical trial, we almost
never have a million data points. We work with samples,
sometimes relatively small samples because collecting data is expensive or
time consuming.

Speaker 1 (13:58):
So the certainty we get is an absolute. It's usually probabilistic,
like saying we're ninety five percent confident or maybe ninety
nine percent confident.

Speaker 2 (14:06):
Right, which leads to the provocative question. If the law
of large numbers only guarantees truth over immense scale, how
often are our everyday decisions, maybe in business, launching a
new feature, or even interpreting a political poll, actually based
on what could be called the fallacy of small.

Speaker 1 (14:22):
Numbers, meaning we're drawing conclusions from samples that might be
too small to really trust the LLNS guarantee potentially.

Speaker 2 (14:30):
So the question for you, the listener, is what level
of statistical certainty that ninety five percent, that ninety nine
percent are you willing to accept, Especially when you're moving
from analyzing a potentially small, expensive sample to making a
big assumption about the entire population, an assumption that could
have major consequences, maybe cost millions.

Speaker 1 (14:49):
How much uncertainty can you live with?

Speaker 2 (14:51):
What's your threshold for risk framed in that statistical confidence?

Speaker 1 (14:54):
Definitely something to think about. A great place to leave
it for the steep dive. Thanks for joining us,
Advertise With Us

Popular Podcasts

Stuff You Should Know
Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

The Burden

The Burden

The Burden is a documentary series that takes listeners into the hidden places where justice is done (and undone). It dives deep into the lives of heroes and villains. And it focuses a spotlight on those who triumph even when the odds are against them. Season 5 - The Burden: Death & Deceit in Alliance On April Fools Day 1999, 26-year-old Yvonne Layne was found murdered in her Alliance, Ohio home. David Thorne, her ex-boyfriend and father of one of her children, was instantly a suspect. Another young man admitted to the murder, and David breathed a sigh of relief, until the confessed murderer fingered David; “He paid me to do it.” David was sentenced to life without parole. Two decades later, Pulitzer winner and podcast host, Maggie Freleng (Bone Valley Season 3: Graves County, Wrongful Conviction, Suave) launched a “live” investigation into David's conviction alongside Jason Baldwin (himself wrongfully convicted as a member of the West Memphis Three). Maggie had come to believe that the entire investigation of David was botched by the tiny local police department, or worse, covered up the real killer. Was Maggie correct? Was David’s claim of innocence credible? In Death and Deceit in Alliance, Maggie recounts the case that launched her career, and ultimately, “broke” her.” The results will shock the listener and reduce Maggie to tears and self-doubt. This is not your typical wrongful conviction story. In fact, it turns the genre on its head. It asks the question: What if our champions are foolish? Season 4 - The Burden: Get the Money and Run “Trying to murder my father, this was the thing that put me on the path.” That’s Joe Loya and that path was bank robbery. Bank, bank, bank, bank, bank. In season 4 of The Burden: Get the Money and Run, we hear from Joe who was once the most prolific bank robber in Southern California, and beyond. He used disguises, body doubles, proxies. He leaped over counters, grabbed the money and ran. Even as the FBI was closing in. It was a showdown between a daring bank robber, and a patient FBI agent. Joe was no ordinary bank robber. He was bright, articulate, charismatic, and driven by a dark rage that he summoned up at will. In seven episodes, Joe tells all: the what, the how… and the why. Including why he tried to murder his father. Season 3 - The Burden: Avenger Miriam Lewin is one of Argentina’s leading journalists today. At 19 years old, she was kidnapped off the streets of Buenos Aires for her political activism and thrown into a concentration camp. Thousands of her fellow inmates were executed, tossed alive from a cargo plane into the ocean. Miriam, along with a handful of others, will survive the camp. Then as a journalist, she will wage a decades long campaign to bring her tormentors to justice. Avenger is about one woman’s triumphant battle against unbelievable odds to survive torture, claim justice for the crimes done against her and others like her, and change the future of her country. Season 2 - The Burden: Empire on Blood Empire on Blood is set in the Bronx, NY, in the early 90s, when two young drug dealers ruled an intersection known as “The Corner on Blood.” The boss, Calvin Buari, lived large. He and a protege swore they would build an empire on blood. Then the relationship frayed and the protege accused Calvin of a double homicide which he claimed he didn’t do. But did he? Award-winning journalist Steve Fishman spent seven years to answer that question. This is the story of one man’s last chance to overturn his life sentence. He may prevail, but someone’s gotta pay. The Burden: Empire on Blood is the director’s cut of the true crime classic which reached #1 on the charts when it was first released half a dozen years ago. Season 1 - The Burden In the 1990s, Detective Louis N. Scarcella was legendary. In a city overrun by violent crime, he cracked the toughest cases and put away the worst criminals. “The Hulk” was his nickname. Then the story changed. Scarcella ran into a group of convicted murderers who all say they are innocent. They turned themselves into jailhouse-lawyers and in prison founded a lway firm. When they realized Scarcella helped put many of them away, they set their sights on taking him down. And with the help of a NY Times reporter they have a chance. For years, Scarcella insisted he did nothing wrong. But that’s all he’d say. Until we tracked Scarcella to a sauna in a Russian bathhouse, where he started to talk..and talk and talk. “The guilty have gone free,” he whispered. And then agreed to take us into the belly of the beast. Welcome to The Burden.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.