E2 - Machine Learning for the Rest of Us - The Technology Sounding Board

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Michael R. Gilbert (00:01):
A few years ago, my world and the world of
machine learning collided, andwith it came a whole host of new
concepts and words that werereally quite confusing: AI,
Machine Learning, Deep Learning,Neural Networks, Stochastic
Gradient Descent, things Icouldn't even pronounce. And
what I really wanted at the timewas a gentle introduction in
plain English, that wouldexplain to me what all these
things were, how they fittogether, and what it could mean

(00:23):
to me. And I really couldn'tfind it. If you're in that
situation, and you suddenly needto know about machine learning
and aren't an expert. Well,let's talk about it.

(00:46):
Hello, and welcome to TheTechnology Sounding Board. I'm
your host, Michael R. Gilbert.
And in this episode, we're gonnatalk about Machine Learning.
Now, this is gonna be a littlebit different from some of our
other episodes, there's no gueston this, this is just my
viewpoint, and it's really meantto be a setup for some episodes
that we will have in the future,when we're talking to some real

(01:06):
world practitioners using ML intheir space to drive
improvements in theirenterprises. But, as I outlined
in the beginning, there's a lotgoing on in this space and it's
quite confusing. If it's notsomething that you do on a day
to day basis, it can be hard tounderstand what is what, what
matters, and how it all hangstogether. So what I want to do

(01:27):
is I want to lay out what thismeans, what the terminology
means, what ML is, what itisn't, and what it can do for
you. And as I said in the title,this is meant to be machine
learning for the rest of us, noPhD required, I promise you,
there will be no math, and wewill keep it in plain English

(01:48):
where ever we possibly can.
So with that, let's begin. Ifyou want to define Machine
Learning, we should probablystart by asking what is
learning? And it may seem like asilly question. But when you
think about it, it's it's kindof harder to actually lock down,

(02:08):
then you first think. So I'llput it out there that learning
isn't memorization. Now what doI mean by that? Let's start by
thinking about the the old cardgame we've all played when we
were younger, the memory cardgame. And in case you haven't,
just imagine a set of cards,let's say 64 cards, and each

(02:30):
card could have one of 32pictures. So for every picture,
it'll have a card, a twin cardthat had the same picture. We
put all these cards out on thefloor, face down and if we're an
organized sort of people, wemight put it in an eight by
eight grid and if not, maybeit's all Higgledy Piggledy,
doesn't really matter. And eachperson takes a turn and the

(02:52):
first person turns over a card,and then tries to pick another
card by turning that over to seeif it matches. If it doesn't
match, then both cards getturned back again. And it's the
next player's turn. If it doesmatch, you get to keep those
cards and you get to try again.
The aim of the game is,obviously, to find all the
pairs. And at the end of thegame, the person with the most

(03:14):
pairs wins. Now clearly, memoryhelps you win this game, being
able to remember where eithersomething you turned over or
something your opponents turnedover was, so that when you find
its twin, in your next round,you can remember where the
original was and turn it over.
But memory will not help you winthe game in concept, right?

(03:39):
Winning one game and memorizingall the pair's doesn't make you
any more likely to win or losethe next game. We haven't
learned anything. We simplymemorized. Now, for sure, having
a memory is important.
Otherwise, we can't learnanything, but it's insufficient.
So what is learning? I'm goingto propose that learning is

(04:02):
really about patternrecognition, about recognizing
patterns when we see themremembering those patterns, so
that we can then view the datain the world around us and infer
new knowledge from the patternsthat we've learned. So for
example, if you remember thatfar back, you'll remember that
when we teach kids to multiply,usually we make them learn the

(04:24):
times table, two times two isfour, three times two is six,
four times two is eight, etc.,etc., etc. It's effective. They
learn how to multiply in thesense that they can apply the
table. But the table gets verylarge very quickly and
remembering all that isdifficult. Very soon, whether we
teach them the tricks or not,the child will figure out that

(04:49):
there are certain patterns inthe data. Let's take the nine
times table for example. Twotimes nine is 18. One and eight
add up to nine. Three times nineis 27, two and seven, add up to
nine. Well, wait a minute, isthat always true? And so we can
test these and see that, all theway up to 10, this works out
very nicely. So we don't need toremember the nine times table

(05:13):
anymore, we can figure it out,we'll take whatever multiple of
nine, were being asked to give,let's say five, we take one off,
it becomes four, we find thenumber that makes for up to
nine, which is five, so thatmust be 45. And suddenly, we got
much, much faster. And as wellas being faster, we're no longer
having to literally memorizehuge amounts of data. And so

(05:36):
that's another way we could lookat learning, to say, in effect,
it's data compression. we'retaking, instead of just
remembering everything we couldever remember, we're remembering
a set of rules, a set ofpatterns about that data, which
can stand in place of all thatdata. And so we can infer what's
going on. Well great. Okay, soif we have a handle on what

(05:59):
learning might be, what does itmean for machine learning? Well,
that would be teaching a machineto be able to recognize patterns
and infer new data from olddata. Okay, but isn't that what
every computer program does? We,when we write a program, we have
encoded an algorithm, which willtake input data in and give some
output data, infer the rightanswer from the input data we

(06:23):
were given. And we're notstoring all of this in some
massive lookup table. It's analgorithm. And so that's another
difference I want to draw here.
There's a difference in thisworld of between learning and
teaching. Computer programming,writing computer programs, we
can think of as being teachingthe computer, how to recognize

(06:46):
that pattern, how to infer thenew data, what we want to do is
have the computer learn that foritself, figure it out on its
own. Look at the data and theninfer a pattern, we don't want
have to write the code. Somachine learning, in a way, is a
way of automating the productionof programming. So that we can

(07:08):
get machines to make themselvessmarter.
Now that touches on anothersubject, which is one of

Artificial Intelligence (07:15):
Machine Learning, ML, Artificial
Intelligence, AI. And you see AIand ML used almost
interchangeably in a lot ofliterature. Are they the same
thing? Well, again, I'm gonnaargue no, they aren't right?
Artificial Intelligence is aboutthe idea of making a computer
behave in a more human way aboutallowing them to be able to deal

(07:38):
with a world of imprecision, tobe able to be able to recognize
things in natural language wherethe language itself doesn't
necessarily make completestructured sense, possibly even
to be able to emote, so thatthey can relate to us. And we
can relate to them in morenatural ways. Examples today,

(07:59):
obviously, the personalassistants, Alexa and Siri, come
to mind. But actually, thisconcept has been around for a
long time, really started inearnest, I would say in the late
50s, and made some big progressin the 60s, the first commercial
success of artificialintelligence was in the 1980s.
And in that time, artificialintelligence really meant expert

(08:23):
systems. And these were systemsthat could, again, take a set of
rules that have been created,that were created by actually
listening to experts, by writingdown how they understood the
world, and then letting itinterpret the data that it sees
to come up with new information.
And though there were somecommercial successes in the

(08:46):
80's, they did not take off theway we thought they would. And
that's because actuallyencapsulating these rules turned
out to be really hard. You haveto spend a lot of time with
experts, and you only you onlyknow what you know, you can only
grow what can be defined. And soI think we could say that
machine learning was aprerequisite for artificial

(09:07):
intelligence to explode, we haveto find a way of generating vast
amounts of knowledge, vastamounts of new learning for
these things to sit on top of.
And that's the area that machinelearning looks at, which helps
and supports artificialintelligence, isn't the same
thing. And indeed, machinelearning can be useful even

(09:28):
without artificial intelligence.
I think of artificialintelligence as being sort of
replacing humans in the decisionprocess. Whereas machine
learning is really in decisionsupport in human processes. And
that's my perspective. I'm notsure that everyone necessarily
would agree with it. But I thinkit's a good way to slice the
line between what is AI and whatis ML?

(09:50):
Okay, having drawn a linebetween AI and ML, we're not
really going to discuss AI anyfurther. The focus here is on
the ML and machine learning sideof things. One last point I want
to make before we get into thedetails of it is there are two
or two forms of machine learningthat we can look at, supervised

(10:11):
and unsupervised. Now,supervised is where most of the
work is really being done,unsupervised is what we would
really like it to be, thedifference? Well, I'd like to be
able to say, look, here's somedata I've collected from the
world around me, tell mesomething interesting, tell me
something I didn't know, that Ican use. And the example that we

(10:33):
can use in this space that isquite commonly used is
clustering. So we've got awebsite, we do a lot of
e-commerce, and we've got a lotof information about what our
customers actually buy, theirbehavior on the site and what
have you. And we want to build amarketing campaign to target
them for things that might be ofinterest to them. Marketing

(10:53):
campaigns are expensive. So wedon't want to create a huge
number of them. On the otherhand, the more specific a
marketing campaign is to the enduser , the target, the more
effective it's going to be. So Iwant to be able to say, Okay,
I've got enough budget for fivedifferent marketing campaigns,
let's take this data. And let'ssplit these customers into five

(11:16):
groups that are themselvesdifferent in some meaningful
way. And, you know, hey, let meknow what that meaningful way
is. So I can then buildsomething that targets those
groups, which will hopefully bemore successful than just a
broadcast, but allows me to sortof cut it down into size, a more
meaningful chunk. And we can dothat today that clustering

(11:39):
algorithms are pretty easy. Andthat can be done. That's not
where we are, in most levels ofmachine learning. Most levels of
machine learning are really inthe supervised model. And what
that's about is kind of morelike the the traditional sort of
teaching by giving you some somedata, asking you to answer some

(12:02):
homework, grading your homework,and then allowing you to get
better based on the things thatyou got right on the homework
and the things you didn't getright on that homework. And the
more we do it, the better youget. That's supervised learning.
And that's really where we aretoday in the state of the state
of the art. Okay, to brieflyrecap where we are, we talked

(12:22):
about the fact that machinelearning is about getting a
computer to build inferencepatterns from data that it sees
I kind of program itself, thatit isn't AI, and helps AI but it
isn't AI. And we're not talkingabout that here. And that
specifically, we are lookingabout supervised learning,
versus unsupervised learning Ihow we help the machine learn by

(12:46):
giving it some kind of gradedanswers to its initial guesses.
Well, let's talk about how we dothat. And let's again go back to
well, what would we do ashumans? And can we take that as
a model and use that for helpinga computer get to the same
place?
Let's think about what we woulddo if we wanted to make

(13:06):
inference based on data that wecould observe, and how would we
approach it. Let's take a reallysimple example. Let's say we
commute to work. And obviouslythis example refers to a time
back when we actually didcommute to work. But let's
suppose we can still rememberwhat it's like to get up in the
morning, have a breakfast, getin the car drive to work and try

(13:27):
to be there for a certain time.
We don't want to get there tooearly, that wastes our time. And
we definitely don't want to bethere late. But the problem is
that actually the time it takesto get from here to there is
variable, it would be nicesomehow to be able to predict
how long it's going to take. Andso we might guess that actually
the time it takes to travel isdependent on the amount of

(13:48):
traffic on the road, the moretraffic on the roads, the longer
it's going to take. And the lesstraffic, the quicker it's going
to be. Okay, sounds reasonable.
So we could, we could just takea number of trips. And we could
record on a plain old piece ofgraph paper. And we could record
on the y axis, how long it tookus and on the x axis some

(14:09):
measure of how much trafficthere was. And what we would see
after a while is that these dotsthat we put on the graph kind of
line up. Now, ideally, theywould line up in a perfectly
straight line. And we could justdraw the line to connect all the
missing dots, then we could sayfor any given amount of traffic,

(14:30):
we simply read off the line fromthe x axis up to the Y axis. And
that's going to tell us exactlydown to the second how long it's
going to take to get to work. Itdoesn't work that way of course,
we know that, but we'll seesomething that is a cluster of
dots, which generally getshigher as it goes further up, ie
the more traffic we get, themore the dot cluster goes up.

(14:53):
And we'll squint at that a bit.
And we'll sort of draw animaginary line that cuts more or
less through the middle of itand that line will give us an
estimate. And that estimate willbe frankly, good enough. And so,
hey, we just learned how toinfer on a pattern. Let's see if
we can translate that same ideainto allowing a computer to do
it. It turns out, of course,that actually, that's pretty

(15:16):
easy to do the phrase for thisis a regression. And as humans
it helps for us to visualize it.
And that's why we draw it on apiece of graph paper. But
clearly, the computer doesn'tneed the visualization. It's
easy for it to relate one set ofnumbers to another set of
numbers. The difficulty, thecomplication, shall we say, is

(15:37):
in how we make that straightline appear between the dots.
Now, as humans, like I said, wecould squint at it and say,
well, that fits about right. Butthat looks about right, or TLAR,
as we used to call it doesn'tformalize well, to interpret for
a computer, how are we going toget more precise than TLAR?

(15:57):
What we now need is the conceptof an error function. And what
do I mean by that? So we havethis line, this imaginary line
that we think is the best fitthrough the cloud of dots. And
what we can do is we couldmeasure how far away from that
line each dot is. Now there arevarious rules that we could use

(16:18):
mean squared error, and L1errors, etc, etc, it doesn't
really matter. As long as wehave a reason for using a
particular error, we'll usethat. So let's just say we're
going to simply just measure thedistance of each dot away from
that line, take the absolutevalue, add them all up and
average it. And that's going tobe a measure of how good or bad

(16:39):
the line is, we can move thatline around. And as we do, that
error function will either go upor go down. And if we keep
moving in the direction thatgoes down, we will eventually
find the lowest point it couldbe and that's going to be the
best line, it won't be a perfectfit. If it was a perfect fit,
they would all line up in astraight line on their own, and
it would be obvious, but it'sgoing to be the best fit. And

(17:00):
that's the concept we're lookingfor. So this is going to come up
again and again. And again, inmachine learning that what we're
really trying to do is take anapproximation to fit a line to
some data, and then findsomething that minimizes the
error or minimizes the loss, aswe would say sometimes in that

(17:21):
approximation. And that's whatwe're going to do.
So before I go any further inloss functions, let's just add a
little bit more complexity toour little problem here. So far,
we've said we're trying toestimate the time it's going to
take to commute based on thelevel of traffic on the road.
And it's not a straight line,and the dots are more or less

(17:42):
sort of lining up, but they'renot really lining up, big cloud.
How could we get more accurate?
Well, we can get more accurateby adding more data points by
more information about theproblem. So let's say well, the
weather probably has someinfluence on how long it takes
for the commute. And let's justput a number down between one is
really good weather and 10 isreally awful weather sort of, I

(18:03):
don't know, Hurrican level nastyweather, if we then sort of want
to plot, not only the time ittook to go, but the amount of
traffic was on the road and theweather that was on that day, we
can't do that in two dimensions.
But being human, we canunderstand things in three
dimensions. And we can representa picture that shows that data

(18:27):
in three dimensions using alittle bit of art. And now what
we see is hills and peaks andvalleys as opposed to just
straight lines. But hopefully,we get a reasonably straight
interpretation of this now inthree dimensions.
The beauty about allowingmachines to do this is because

(18:47):
they don't need anyvisualization because this is
all about data to them. Itdoesn't matter how many
dimensions there are, they coulddo this with 100 data points
this, they could do this with1000 data points with a million
data points. There's no way wecould draw that. So we can no
longer use visualizations torepresent it. But the computer
doesn't care. It wasn't usingthe visualizations anyway. And

(19:08):
so we start to get into some ofthe power that machine learning
has versus how we humans wouldtry to tackle this problem.
Okay, so we've now kind of gotthe idea that in order to help a
computer to learn to infer fromthe data, the challenge really
is going to be how's it going tominimize the loss, it can make a

(19:28):
guess we can we can calculate anerror on that guess. How does it
minimize a loss? To answer that,I want to paint a picture for
you. And we're going to use asort of three dimensional
problem just like we talkedabout before. But this time,
imagine that actually you'vebeen out for a walk in the

(19:48):
hills. And what the task beforeyou now is, is to get back to
base camp, back to where youstarted. And you started
conveniently at the lowest pointin the hills in the valley at
the very, very lowest point.
This is obviously representingminimizing a in this case loss
function. In order to minimizethe loss function, ie to get
back to base camp, all you gotto do is walk down the hill to

(20:09):
the bottom. Well, let's make itjust a little bit more
complicated and say, you'reactually shrouded in fog. Now
you don't know which way to go,you can't see the base station.
So how are you going to getthere? One technique you might
use would be to reach out withone foot and sort of tap the
ground around you and try tofind the steepest gradient, ie
the thing that goes down thefurthest, and take a small step

(20:33):
in that direction. And then youmight stop there and tap with
your foot again and find outfrom there, which is the
steepest downward direction ofsteepest gradient, and take
another small step. And if youdid this repeatedly, eventually,
you'd get to the bottom, you getback to base camp, you would
have minimized, in this caseminimizes your altitude, but

(20:54):
what we're representing here, ofcourse, is minimizing our loss
function. That how we're goingto teach the computer to do
this. And there's a name forthis technique. And this
technique is called gradientdescent. And you can see where
the name comes from, you mayoften hear Stochastic gradient
descent, that's just a variantof the same thing where we
randomize some of the input datathat we use, rather than use all

(21:15):
the input data. And that tendsto be faster, but doesn't
matter. It's a detail. GradientDescent is the concept,
stochastic gradient descent, orSGD is the technique, the
variant of the technique thatgets used the most often in this
process. So how does thecomputer know which way is the
steepest way down the slope?

(21:35):
That's a mathematical techniquecalled differentiation, which
will take a formula and give youthe slope that that formula is
giving out. And I'm not going toget into differentiation in this
discussion, thankfully. Butit's, it's something that's
doable. In most cases. There aresome problems for which the
function can't bedifferentiated. And then we
would have to find othertechniques to do that. But in

(21:56):
the vast majority of cases,that's how we solve that
problem. Okay, so our algorithmslooking pretty good. Now, we
collect some data. And for eachelement of data for each of
these rows of data, we make surewe have a label on the data to
say what the right answer wouldbe. We let the computer
basically guess how the datamight map to the answer. And in

(22:18):
reality, it doesn't matter howgood or bad the individual guess
is, because it can then gothrough and see how well it did
it creates a loss function todefine okay, how do I measure
how good or bad this is, andthen it uses the algorithm SGD,
or something, some variant likeit, in order to find the way to
minimize the loss function, andit's going to get better and

(22:40):
better and better. And so we'vesolved machine learning, right?
No problems?
Well, now there is one slightsnag that we have yet to deal
with, which is, bizarrely, itcan get too good. So you may
have heard of the problem ofoverfitting. What does that
mean? Now again, go back to thehuman world. And imagine that
we're teaching a class ofstudents how to solve physics

(23:05):
problems. And we're doing thisby giving them a test bank of
1000 questions. And for each ofthese questions, they are given
the right answer. And we sayokay, go learn from these, try
to guess what the answers are,when you get it right, great.
When you get it wrong, adjustyour, your guessing algorithms,
so you get better and better andbetter. And soon enough, they

(23:25):
will be able to answer anyquestion they given from this
bank of 1000 and get them right.
But it's entirely possible, thatthey'll get it right because
they're simply remembering theanswer to every question. And if
they do that, then when they arefaced with an exam with
questions that were not in thetest bank, they won't be able to
get it right, because they won'tbe able to generalize what they

(23:46):
have learned from the specificdata into the real world. And
that's a real danger in machinelearning. How do we solve that
problem? Well, there aretechniques to make the most out
of smaller sets of data. And I'mnot going to get into the
details and the weeds of thatthey certainly are. But the
basic principle is much, muchlarger training set. At some

(24:09):
point, the data being fed inbecomes so large, that it's too
large for the model to rememberthe data, and it starts to
extract the general answer. Andyou know, this is true again,
when you when you look athumans, if you gave us the set
of 1000 questions, it's possiblethey could remember all of the
questions, who gave them 100,000It's no way they can do it. Now

(24:31):
they just have to startgeneralizing, and picking out
the rules that will help themsucceed.
And now that brings into the oneof the core drivers behind
what's the cost and expensebehind machine learning. If it's
as easy as we just said it iswhy is it so expensive, and why
do we need such well educatedpeople who cost us a large

(24:56):
amount of money to do it? Numberone reason is the size of the
data getting hold of enoughdata, and remember, this data
just can't be something you pullat random, it has to be labeled
data. So we have to know whatthe right answer is. That's an
exercise in and of itself. Now,if we take the fact that we're
going to chew through an awfullylarge amount of data, and we're

(25:19):
going to do through it in aniterative process, where we're
going to get our answer, we'regoing to measure our homework,
and we're going to improve, youcan see that depending on the
problem we're trying to solve,this can take a long time, too.
So you've got a lot of expensiveequipment, and you've got a lot
of people tied up for a longtime trying to chew data down,
these things, collectively, canlead to the difficulty. And the

(25:45):
third problem I want to talkabout very, very briefly, is the
nature of the problem. And itsrelation to the amount of data
that we need. And I don't meanhere, the number of examples.
But the amount of indicators.
In our really, really easyexample, we were looking at two
data points, how much trafficwas on the road. And what's the
weather like? Let's suppose thatwe're trying to do some analysis

(26:07):
on on photographs, facialrecognition. And let's say we're
using 4k photographs, which arenot uncommon today, then we are
talking about something in theregion of 8 million data points.
And each of those data pointsactually has three different
colors if we're talking about acolor photograph, and it's 8.3

(26:28):
million per, which means were 25million, basically 25 million
data points, we're looking atthe analyze a full color 4K a
picture, that a lot of datapoints. Some problems are
related in terms of how long ittakes to compute in an order n,
which means the larger thisnumber of data points is, the

(26:48):
larger the time taken somerelated to n squared, well,
literally, as the number goesup, the time goes up by the
square of the number. And if youstart to square 25 million, you
get to a very big number. Someproblems are related not to the
square, or even the cube biggernumbers like that. But the
exponential of it. So there'ssome constant, like, say 10 to

(27:09):
the power of n, well, 10 to thepower of 25 million. I don't
know what that number is, butit's so large that it wouldn't
matter how many computers youhad, you could run all those
computers for as long as theuniverse has existed, and you
still wouldn't solve theproblem, they're essentially
intractable. So there are limitsto what this can do. And

(27:32):
they're, the problems themselveshave to be in what we call a
polynomial space, in order to beable to actually solve them. And
even then they can take someserious time.
Before we get into what we canand can't do with Machine
Learning. I want to talk verybriefly about Deep Learning.

(27:52):
Now, what is Deep Learning? Andhow does it relate to Neural
Networks, which seem to come upwith a conversation anytime Deep
Learning comes up? And how doesall that relate to Machine
Learning as we've just beendiscussing it. And in order to
discuss Deep Learning, we reallywant to take a moment and think
of what we've just been talkingabout as being shallow learning,

(28:16):
with a caveat that just about noone in the universe uses the
phrase shallow learning. Maybethey should, but they don't.
What do I mean by that? Well,we're making a couple of really
important assumptions that wehaven't really made explicit
yet, in what we've done so far.
So let's go back to our example,where we're trying to estimate
the time it takes to commute towork based on a couple of

(28:38):
variables. We've said, Well,we're going to look at the
traffic density. And we're goingto look at the weather. And
we're going to use those two, topredict the outcome. Number one,
we said that, collectively,they're going to relate to the
outcome via some kind of line,which is a straight line. It's a
linear problem, as we would say.

(29:01):
And we accept that it reallyisn't a linear problem, but it's
going to be straight enough thatour approximation is going to be
useful. Well, that's oneassumption that may or may not
hold out. And two we're, we'rebasically saying we can leap to
the answer from looking directlyat the data. And in this case,
we can. We can see the trafficdensity and we can see the

(29:25):
weather. And that will give us agood indication of the expected
commute time. But real worldproblems that we as humans solve
all the time, are often a lotmore complex than that. So if
you were to do something assimple, simple in terms of human
capabilities, of looking at aseries of pictures that we've

(29:47):
taken, and identifying which ofthose pictures contained a cat
or a dog, you would find thatvery, very easy and maybe even
sorting between here's a wholelot of pictures and some of them
are cats and some of them areDogs, which ones are which?
Trivial.
When you think about what you'rereally doing, however, is

(30:07):
actually quite complex. You'reidentifying that an area in this
photograph represents an animal.
And you're doing that by saying,Okay, well, this is a texture
that looks like fur, this is atexture that looks like skin.
And this is a texture that lookslike scales, whatever. And here
are some shapes that look likeeyes. And this is a shape that

(30:29):
looks like a mouth and has ashape that looks like a nose.
And so we can also relate thedistance between the eyes and
the nose and the mouth, and say,Okay, well, that geometry turns
up in cats a lot. And so thislooks like it's a cat versus a
slightly different geometrylooks like it's a dog. So we

(30:50):
didn't just look at the picture,and make the decision based on
the data as we saw it, werecognize the pattern in the
data. And we recognize thepattern in the pattern of
patterns. And we recognize thepattern and the pattern, the
pattern, the patterns, and so onand so forth, we built up a
whole level of reasoning fromthe raw data into multiple

(31:10):
layers, in order to get to theconclusion. And these multiple
layers, that's what we'rereferring to when we're talking
about deep learning, being ableto lay one on top of the other
on top of the next to get deeperinto understanding. So neural
networks are a technique that AIis currently the probably the

(31:31):
predominant, I mean, almostcertainly the predominant
technique used in deep learningfor how to construct this layer
of learning on top of layer oflearning on top of layer of
learning. And in neuralnetworks, we're doing the same
sort of processing that we weredoing before the idea of
constructing a linear reasoningfrom a layer of data, in order
to give us an output, then we'refeeding that output into another

(31:54):
linear layer, and then feedingthat out into another linear
layer. And so we're using thesame techniques that we've just
done with stacking them on topof each other. But in a neural
network, we do something veryinteresting, which is to insert
in between each of these linearlayers, a nonlinear layer. And
what do I mean by that? And whyis this interesting? Well, okay,

(32:17):
so a nonlinear function. Andthere are lots to choose from.
But we're going to talk about,for example, the one probably
most commonly used is a ReLU, R- E - L -U, which stands for
rectified linear unit, in caseyou ever wanted to know, but you
probably never, ever want toknow anyway. What that is very
simple, right? For any input, ifthat input is negative, the

(32:40):
answer is zero. So if I send inminus five to a ReLU, I'm gonna
get zero, if I send in minus 10,to a ReLU, I get zero. For any
input that's greater than zero,I'm going to get that output. So
if I send in three, the answeris three, if I send in 10, the
answer is 10. Now, this is avery, very simple function, but

(33:02):
it does something very, veryinteresting. It basically turns
into a switch, any number thatis out of my original first
layer of line that comes out asnegative is going to disappear,
it's going to turn off, and anynumber that stays positive is
going to stay on. And as we gothrough these layers, and

(33:24):
between each layer, we have aswitch, what we see sort of
streaks of activity occurringbetween these layers, which
looks very reminiscent of thestreaks of activity we see as
neurons in a in a real brainfire. And that's where it gets
his name the neural network.

(33:44):
What that does, mathematicallyis quite astonishing. If we now
stack these layers together withnonlinear layers in between
them, we can now model anyarbitrary function, it doesn't
matter whether it's linear orcurvy or so curvy, there aren't
words to describe how curvy itis, we can model it. And that

(34:05):
means we can answer arbitraryquestions by using neural
networks. Notwithstanding, westill have the same issues we
had before. We need a lot ofdata. And we need a lot of
processing power. And it cantake a long time. And there are
some problems for which itdoesn't matter how much time we
had, it will take more time thanwe have in the entire universe

(34:26):
to answer. But the problem spaceis now much, much larger things
we can solve with this techniqueare really complicated, like
facial recognition, likeidentifying animals in a piece
of scenery, for example. So I'msure you can imagine that's
where all the excitement is inthe machine learning space these
days in deep learning, specificneural networks and the things

(34:50):
that they can drive. This reallyis at the cutting edge of our
understanding how to constructneural networks so that they
perform Well, is more art thanscience, I would say at the
moment. And so it takes veryclever people a lot of
experience and a lot ofsometimes just simply trial and

(35:11):
error to to find the bestapproach.
The last concept I want to talkabout before I get on to
applications of all of this isgenerative adversarial networks.
And that's a mouthful, right? AGAN. Well, so the idea here is,
what would happen if you tooktwo of these neural networks,

(35:32):
and played them off against eachother. And an example that I'm
going to give you is, imaginethat you created a neural
network, that's whose purposewas to be able to take in a
picture and determine whether ornot this is actually a piece of
artwork by Picasso. Picked atrandom. And we train it just

(35:56):
like we did before. We give itlots of pictures, and we label
those pictures as yes, this is aPicasso or no, this isn't. And
we include in our training set,some pictures of some really
good fakes, but we label them asfakes. And it learns, and it
gets pretty good at it. And soonenough, it can tell the
difference between a Picasso andsomething that isn't a Picasso.

(36:19):
So let's say we also create agenerative network. What do I
mean by that? This is somethingthat's going to take picture as
an input. And it's going tochange that picture into a
picture that looks like it wasdone by Picasso. Now, you can
imagine that we might some makesomething just make random

(36:40):
changes to the pictures, doesn'tmatter what the change is at
this point. And it's going tofeed its output into Picasso
detector. And of course, when itfirst does it, the Picasso
detector is gonna say no, that'snot even close. Sorry, guys.
That's not a real Picasso. Butnow we've got something that can
generate labels as feed into ourPicasso generator. And of

(37:07):
course, we can do exactly thesame techniques to change the
way it's modifying the picture,and get closer and closer to
something that passes thePicassotest. And initially, wouldn't
get very close, but it might getmaybe 99% certain that this is a
fake rather than 100% CertainWell, that's in the right
direction. So we'll make more ofthose types of changes, maybe

(37:29):
then gets 95% certain that it'sa fake rather than a real
Picasso. That's the rightdirection. So we'll make more of
those changes, and so on and soforth. And before long, we have
trained this generative networkto create pictures of the
Picasso style that are soconvincing, the Picasso detector

(37:52):
can't tell it's a fake. Welldoes the story end there? Well,
no, because then with myadversarial come? Great. So we
can now generate good Picassolooking pictures, which we know
are fakes. And so now we cancreate more training data for
the Picasso detector, with lotsof labeled data of good Picasso

(38:16):
fakes, we can feed that into thePicasso detector. And we can
retrain that, it gets better andbetter and better. And before
long, it's no longer fooled bythe fake generated pictures, it
knows the difference betweenreal ones and fake ones. And so
does the story end there? Well,of course not, we can flip it
around again. And we can trainour generator against the new

(38:39):
Picasso detector, and make thatget better and better and
better. So this is all a game?
Not really no, obviously, whatwe're trying to do is create a
better Picasso detector. And,okay, let's take the Picasso
idea and translate this intosome kind of fraud detection
system, or some kind of spamdetection system or some kind of
computer attack system that'sgoing to detect more and more

(39:02):
attacks, you can see how havingthe computer generate more and
more convincing attacks, moreand more convincing spam, more
and more convincing fraudsimulations helps us get a
better defense and a betterprotection. So that's the idea
behind using computer learning,with one computer learning,

(39:23):
being the adversary of theother, to in fact, improve both.
So what does this mean for theenterprise? I think we can split
this into two here. There arecompanies who, whose product
basically, is the transformationof digital signals. And
sometimes you might need tostretch it a little bit to

(39:45):
understand that that's thebusiness they're in. But it's
not that hard to imagine. So,I'm delivering a podcast, for
example. And obviously what I'mdelivering here is digital
speech, but I'm using products,which are recording signals from
a microphone and turning thatinto something that is easier to

(40:07):
listen to. And one of the tasksthat we need to do is noise
reduction, there will bebackground noise, maybe from the
air conditioning system orprinter or what have you in the
area. And that doesn't need tobe in the podcast, what you want
to be able to focus on is justwhat I'm saying. In the past,
what we've done is we've usedall sorts of frequency band

(40:30):
analyzers to take out certainfrequencies that tend to be more
problematic, where, oh I don'tknow, we might see some harm in
the 50 hertz, or the 100 Hertzrange, based on some kind of
machinery doing something. Andwe'll dip that down so that it
appears less. The problem iswe're messing with the waveform.
And so the result is lowernoise, but also changes the

(40:54):
actual speech. And sometimes itmakes it harder to understand
that it did when the noise wasthere. So now, we have products,
which understand how torecognize human speech. Because
they can recognize human speech,they can separate the sound into
two streams, if you like, speechand not speech, and they can

(41:17):
amplify the bit that's speechand suppress the bit that's not
speech. And as if by magic, allthe background noise disappears.
That's a very specific use ofmachine learning systems to be
able to driven from from neuralnetworks in this case. Now, if
you're in the business ofproducing software to process

(41:40):
audio, this would be a techniquethat obviously you would want to
invest in. Okay, but what if youarn't? What if you're, picking
something a random, an onlineretailer? Should you be
investing in machine learningexpertise, building a team of
people that probably quiteexpensive, with a lot of

(42:00):
infrastructure, in order to dothis? Let's take for example,
fraud, right? We want to be ableto maximize the number of
customers we reach, we want tobe able to minimize the amount
of fraud that we have to dealwith. Obviously, selling
something to somebody wherewe're not going to get the money
is bad business, we will wedon't want to do that. But how

(42:23):
do you tell the differencebetween something which is odd
transaction, but stilllegitimate and something which
is an odd transaction? Becauseactually, it's fraud?
Clearly, we could use machinelearning to use some deep
algorithms to discover patternsin behavior that might separate

(42:46):
out the fraud behavior from thenon fraud behavior. So is it
worth us investing in growingthat kind of capability. And I'm
going to suggest it probablyisn't. What we want to be doing
is investing in products thatleverage this kind of
capability, and possibly evenproducts that can leverage this
capability that can beautomatically trained on our

(43:09):
data to make them very specificto our business. But without
having to reinvent withouthaving to understand the
processes that are being used todo that. And so that's the
split, I think, is going to bevery important understanding
where in the value chain, yourreal application of technology
is, is it part of something youactually sell something you

(43:31):
monetize, in which caseinvesting in adding this kind of
technology to it is like youwould with any new technology,
part of the product chainanalysis. If that's not your
business, then you want to belooking for partners with
expertise in this space toprovide you with the
technologies that you want.
Nobody wants to build their ownERP, that's just not very

(43:53):
clever. Same here, you want tobe buying products that leverage
the state of the art neuralnetworks that are trainable on
your data.
Alright, so we did talk aboutsome of the limits in what can
be done. And we talked about thefact that those limits are
generally driven by a), is thedata and enough data available?

(44:16):
And do we have labels for thosedata? So we can train it in the
first place? Is the problemitself essentially tractable or
not? And do we have theexpertise in how to do this?
That there are a whole classesof problems, we have to ask
ourselves? Should we use thesetechniques to address there are
some built in traps for us inthis? And we have to remember

(44:39):
that what machine learning isdoing is teaching a computer how
to look at data and makeinferences based on the data
we've given it and also theinferences we made on that data.
It's not learning how to inferthings from the real world. It's
learning how to infer thingsthat we have given it from a

(45:03):
subset of the real world thatwe've already put our
interpretation on. And why isthat problematic? Well, if we
are giving it biased data, it'sgoing to end up encoding that as
a biased set of assumptions. Andthat's going to cause us a lot
of problems. It would be hardto overstate how serious that

(45:24):
problem can be. To put it inperspective, let's talk a little
bit about natural languageprocessing. And a game we might
play with a computer to see howwell it has understood the text
by doing the association game.
For example, we might ask it,man is to King like woman is to
blank, and we would ask it tofill in the blank. And actually,

(45:49):
we get our algorithms, very,very good. And it will come up
with the answer, Queen. Correct.
But and you know, given thatI've raised the question,
there's going to be a but. Well,a 2019 report in the Association
for computational linguisticswas titled Black is to Criminal

(46:12):
as Caucasian is to Police. Okay,the title is outrageous, and
it's clearly the academicequivalent of clickbait. But it
was pulled directly from anexercise done on the same kind
of natural language processingalgorithm that was trained on a
corpus that almost everybodyuses to train their natural

(46:33):
language processing.
It was really about how we mightde-bias the information we're
using to pass into our algorithmcreations. But it underlines the
point that, I mean, imagine, forexample, that you're in the
business of selling mortgages,and you're trying to use some
machine learning to be able todetermine who you should be

(46:55):
giving these mortgages to andwho you should avoid. If the
data set on which it was trainedis systematically skewed, in
order to trust some people lessthan others, then you're going
to do yourself some serious harmin multiple ways. I mean, one,
obviously, let's just put asidefrom morally, it's just not

(47:16):
right. But we have at leastthree issues that are just
purely financial in front of us.
One, we want to sell our productto anybody who's capable and
willing to buy the product. Soif we're avoiding a set of the
market, erroneously, we'remissing out on an opportunity.
Two, that, hey, if it gets outthat we are not treating a

(47:41):
certain segment, a segment ofthe market fairly, and you know
it will get out, thereputational damage could be
could, it could destroy thecompany from where it goes on.
And three, of course, we couldbe sued, or criminally
prosecuted, least civillyprosecuted for the actions
depending on which base thatwe're in. And yet, nothing that

(48:03):
we have done was deliberatelyaimed at excluding any part of
the population. It's because thedatasets we were using had this
in them. How do we fight that?
And I gotta tell you, there's alot of work being done. And you
can imagine that a huge amountof research is being done on how
we might de-bias this data. Theresults are not conclusive,

(48:27):
shall we say. So this is asignificant problem that we need
to be aware of. And maybe theseare areas where we're stepping
in very cautiously. One of theproblems of neural networks, and
not all of the machine learningtechniques suffer from this. But
remember, neural networks arenow making up the majority of
where the progress is beingmade. One of the problems of
neural networks is they're notexplainable. So we can generate

(48:52):
something which will come upwith some inferences. But we
can't really look inside theneural network to see how did it
deduce what it deduced. And soit's very difficult to see if
you want what it's thinking, andI use the word advisedly. But
that makes it really difficultin order to know how trustworthy

(49:14):
the answer is. And bytrustworthy, I don't mean, is it
correct? Is it inferring correctinformation from the data? But
again, how skewed is the data?
And what is the data actuallytelling us this actually, what
it's telling us Correct or notcorrect? We gave you an example
there. That's pretty outrageous.

(49:34):
And, you know, obviously, that'sgoing to come out one way or
another. But what if that was asubtly different assumption
that's baked into the data, butstill not right, but would be
financially damaging over thelong term for the organization?
And you just wouldn't know thatone of the frightening things
about deep learning algorithms,again, a lot of research is

(49:56):
being put in at the moment ontrying to make this more
explainable. Trying tounderstand what's going on on
the inside. But now, we need tobe cautious about what we're
aiming our neural network at, sothat we don't fall into that
trap. Now, avoiding the questionof bias for a minute, then
there's a question of the linebetween valuable and creepy. As

(50:22):
the systems learn more and moreabout our behavior, they can
tailor their response to us moreand more precisely, which you
can certainly see is very, veryhelpful. But it can also get
quite a bit unnerving. That's anarea that may change over time,
at least to say, how well ourcustomers except that kind of

(50:43):
helpfulness or versus feel likeit's walking the edge of being
creepy. Because we get used totechnology, we get used to the
way things behave really quitequickly. But it's something we
definitely want to be watching.
And we want to be putting ourfoot in that water very
carefully, and measuring theresponse in our customer base,
and perhaps adjustingaccordingly. So the goal for

(51:05):
this episode was to outline whatmachine learning is, where it
fits in the AI and machinelearning, deep learning, neural
networks spectrum, help youunderstand what some of those
terms mean, and give you a very,very top level view of how this
all hinges together, and maybehow it might be useful. And I

(51:26):
hope we've done that. As Iexplained at the beginning, this
is kind of a setup for a seriesof episodes we want to have in
the future, where we start tospeak to individual
practitioners about how this isaffecting their business, how
that's being implemented intheir world, and get some real
world feedback on where we'regoing next. I hope you enjoyed

(51:46):
this episode. And if you did,please leave us a review. If you
didn't, please let us knoweither way, get involved in the
discussion on the website at
//thetechnologysoundingboard.comor leave us a review on your
podcast streaming service ofchoice. Thank you very much for
listening and see you next time.

All Episodes

E2 - Machine Learning for the Rest of Us

Episode Transcript

Popular Podcasts

United States of Kennedy

Stuff You Should Know

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}E2 - Machine Learning for the Rest of Us