Deep Learning and Deepfakes

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:04):
Welcome to Tech Stuff, a production from I Heart Radio.
Hey there, and welcome to tech Stuff. I'm your host,
Jonathan Strickland. I'm an executive producer with I Heart Radio
and I love all things tech. Now, before I get
into today's episode, I want to give a little listener

(00:25):
warning here. The topic at hand involves some adult content,
including the use of technology to do stuff that can
be unethical, illegal, hurtful, and just plain awful. Now, I
think this is an important topic, but I wanted to
give a bit of a heads up at the start
of the episode, just in case any of you guys

(00:45):
are listening to a podcast on like a family road
trip or something. I think this is an important topic
and I think everyone should know about it and think
about it. But I also respect that for some people
this subject might get a bit taboo. So let's go
on with the episode. Back in nine, a movie called

(01:06):
Rising Sun, directed by Philip Kaufman, based on a Michael
Crichton novel and starring Wesley Snipes and Sean Connery came
out in theaters. Now, I didn't see it in theaters,
but I did catch it when it came on you know,
HBO or Cinemax or something. Later on, the movie included
a sequence that I found to be totally unbelievable. And

(01:29):
I'm not talking about buying into Sean Connery being an
expert on Japanese culture and business practices. Actually, side note,
Sean Connery has an interesting history of playing unlikely characters,
such as in Highlander, where he played an immortal who
was supposedly Egyptian, then who lived in feudal Japan and

(01:49):
ended up in Spain where he became known as Ramirez.
And all the while he's talking to a Scottish Highlander
who's played by a Belgian actor. But I'm getting way
off track here. Besides, I've heard Crichton actually wrote the
character while thinking of Connery, So you know, what the
heck do I know? In the film, Snives and Connery
are investigators, and they're looking into a homicide that happened

(02:12):
at a Japanese business but on American soil. The security
system in the building captured video of the homicide and
the identity of the killer appears to be a pretty
open and shut case. But that's not how it all
turns out. The investigators talked to a security expert played
by Tia Carrera, and she demonstrates in real time how

(02:34):
video footage can be altered. She records a short video
of Connery and snipes loads that onto a computer, freezes
a frame of the video, and essentially performs a cut
and paste job swapping the heads of our two lead characters.
Then she resumes the video and the head swap remains

(02:55):
in place, and that head swap stuff is possible. I mean,
clearly it has to be possible, because you actually do
see that effect in the film itself. But it takes
a bit more than a quick cut and paste job.
But we'll leave off of that for now. The whole
point of that sequence, apart from showing off some cinema magic,

(03:16):
is to demonstrate to the investigators that video, like photographs,
can be altered. The expert has detected a blue halo
around the face of the supposed murderer in the footage,
indicating that some sort of trickery has happened. She also
reveals that she cannot magically restore the video to its
previous unaltered state, which I think was actually a nice

(03:38):
change of pace for a movie. By the way, I
think this movie is really, you know, not good, like
not worth your time, but that's my opinion anyway. For years,
this kind of video sorcery was pretty much limited to
the film and TV industries. It usually required a lot
of pre planning beforehand, so it wasn't as simple as

(04:01):
just taking footage that was already shot and changing it
in post on a whim with a couple of clicks
of a button. If it were, we would see a
lot fewer mistakes left in movies and television because you
could catch it later and just fix it. But the
tricks were possible, they were just difficult to pull off.
It just wasn't something you or I would ever encounter

(04:23):
in our day to day lives. But today we live
in a different world, a world that has examples of
synthetic media. Commonly referred to as deep fakes. These are
videos that have been altered or generated so that the
subject of the video is doing something that they probably
would or could never do. They've brought into question whether

(04:47):
or not video evidence is even reliable, much as the
film Rising Sun was talking about. We already know that
eyewitness testimony is terribly unreliable. Our perception and memory play
tricks on us, and we can quote unquote remember stuff
that just didn't happen the way things actually unfolded in reality.

(05:09):
But now we're looking at video evidence and potentially the
same light. I mean, it's scary. So today we're going
to learn about synthetic media, how it can be generated,
the implications that follow with that sort of reality, and
ways that people are trying to counteract a potentially dangerous threat,

(05:30):
you know, fun stuff. Now, first, the term synthetic media
has a particular meaning. It refers to art created through
some sort of automated process, so it's a largely hands
off approach to creating the final art piece. Now, under
that definition, the example of rising sun would not apply

(05:53):
here because we see in the film and presumably this
happens in the book as well, but I haven't read
the book that a human being actually changes that. People
have used tools to alter the video footage. This would
be more like using photoshop to touch up a still image,
with the computer system presumably doing some of the work

(06:14):
in the background to keep things matched up. Either that
or you would need to alter each image in the
footage frame by frame, or use some sort of matt approach.
To learn more about matts, you can listen to my
episode about how blue and green screens work. Synthetic media
as a general practice has been around for centuries. Artists

(06:35):
have set up various contraptions to create works with little
or no human guidance. In the twentieth century we started
to see a movement called generative art take form. This
type of art is all about creating a system that
then creates or generates the finished art piece. That would
mean that the finished work, such as a painting, wouldn't

(06:57):
reflect the feelings or thoughts of the art is who
created the system. In fact, it starts to raise the
question what is the art? Is it the painting that
came about due to a machine following a program of
some sort, or is the art the program itself? Is
the art the process by which the painting was made?

(07:19):
Now I'm not here to answer that question. I just
think it is an interesting question to ask. Sometimes people
ask much less polite questions, such as is it art
at all? Some art critics went out of their way
to dismiss generative art in the early days. They found
it insulting, but hey, that's kind of the history of

(07:42):
art in general. Each new movement and art inevitably finds
both supporters and critics as it emerges. If anything, you
might argue that such a response legitimizes the movement in
you know, a weird way. If people hate it, it
must be something. In two thousand eighteen, an artist collective

(08:03):
called Obvious located out of Paris, France. They submitted portrait
style paintings that were created not by an actual human painter,
but by an artificially intelligent system. Now they looked a
lot like typical eighteenth century style portraits. There was no
attempt to pass off the portrait as if it were

(08:24):
actually made by a human artist. In fact, the appeal
of the piece was largely due to it being synthetically generated.
It went to auction at Christie's and the AI created
painting fetched more than four hundred thousand dollars. And the
way the group trained their AI is relevant to our

(08:45):
discussion about deep fakes. The collective relied on a type
of machine learning called generative adversarial networks or g a N,
which in turn is depending on deep learning. So it
looks like we've got a few things we're going to
have to define here. Now, I'm going to keep things
fairly high level, because as it turns out there are

(09:07):
a few different ways to create machine learning models, and
to go through all of them in exhaustive detail would
represent a university level course in machine learning. I have
neither the time for that nor the expertise. I would
do a terrible job, So we'll go with a high
level perspective here first. A generative adversarial network uses two systems.

(09:31):
You have a generator and you have a discriminator. Both
of these systems are a type of neural network. A
neural network is a computing model that is inspired by
the way our brains work. Our brains contain billions of neurons,
and these neurons work together, communicating through electrical and chemical signals,

(09:52):
controlling and coordinating pretty much everything in our bodies. With computers,
the neurons are Note the job of a node is,
you know, supposed to be kind of like a neuron
cell in the brain. It's to take in multiple weighted
input values and then generate a single output value. Now,

(10:14):
the word weighted w E I G H T E
D weighted is really important here because the larger and
inputs weight, the more that input will have an effect
on whatever the output is. So it kind of comes
down to which inputs are the most important for that
nodes particular function. Now, if I were to make an analogy,

(10:36):
I would say, your boss hands you three tasks to do.
One of those tasks has the label extremely important, and
the second task has the label critically important, and the
third task has a label saying you should have finished
that one before it was handed to you. Okay, so
that's just some sort of snarky office humor that I

(10:57):
need to get off my chest. But more seriously, imagine
a node accepting three inputs. In this example, input one
has a fifty weight, Input two has a weight, and
input three has a ten percent weight. That adds up
to and that would tell you that the output that
node generates will be most affected by input one, followed

(11:21):
by input two, and then input three would have a
smaller effect on whatever the output is. Each node applies
a nonlinear transformation on the input values, again affected by
each inputs weight value, and that generates the output value.
The details of that really are not important for our episode,

(11:43):
and involves performing changes on variables that in turn change
the correlation between variables, and it gets a bit Matthew,
and we would get lost in the weeds pretty quickly.
The important thing to remember is that a node within
a neural network takes in a weighted sum inputs, then
performs a process on those inputs before passing the result

(12:06):
on as an output. Then some other node a layer
down will accept that output, along with outputs from a
couple of other nodes one layer up, and then we'll
perform an operation based on those weighted inputs and pass
that on to the next layer, and so on. So
these nodes are in layers, like you know a cake.

(12:27):
One layer of notes processes some inputs, they send it
on to the next layer of nodes, and then that
one does onto the next one, and the next one
and so on. This isn't a new idea. Computer scientists
began theorizing and experimenting with neural network approaches as far
back as the nineteen fifties with the perceptron, which was

(12:49):
a hypothetical system that was described by Frank Rosenblatt of
Cornell University. But it wasn't until the last decade that
computing power and our ability to handle a lot of
data reached a point where these sort of learning models
could really take off. The goal of this system is
to train it to perform a particular task within a

(13:12):
certain level of precision. The weights I mentioned are adjustable,
so you can think of it as teaching a system
which bits are the most important in order to do
whatever it is the system is supposed to do in
order to achieve your task, These are the bits that
are the most important and therefore should matter the most

(13:32):
when you weigh a decision. This is a bit easier
if we talk about a similar system with the version
of IBM S Watson that played on Jeopardy. That system
famously was not connected to the Internet. It had to
rely on all the information that was stored within itself.
When the system encountered a clue in Jeopardy, it would

(13:55):
analyze the clue, and then it would reference its data
base to look for possible answers to whatever that clue was.
The system would weigh those possible answers and attempt to
determine which, if any, were the most likely to be correct.
If the certainty was over a certain threshold, such as sure,
the system would buzz in with its answer. If no

(14:16):
response rose above that threshold, the system would not buzz in,
So you could say that Watson was playing the game
with a best guess sort of approach. Neural networks do
essentially that sort of processing. With this particular type of approach,
we know what we want the outcome to be, so
we can judge whether or not the system was successful.

(14:40):
After each attempt, we can adjust the weight on the
input between nodes to refine the decision making process to
get more accurate results. If the system succeeds in its task,
we can increase the weights that contributed to the system
picking the correct answer and thus decrease the input it's

(15:00):
that did not contribute to the successful response. If the
system done messed up and gave the wrong answer, then
we do the opposite. We look at the inputs that
contributed to the wrong answer, we diminish their weights, and
we increase the weights of the other input and then
we run the test again a lot. I'll explain a

(15:23):
bit more about this process when we come back, but
first let's take a quick break. Early in the history
of neural networks, computer scientists were hitting some pretty hard
stops due to the limitations of computing power at the time.

(15:44):
Early networks were only a couple of layers deep, which
really meant they weren't terribly powerful, and they could only
tackle rudimentary tasks like figuring out whether or not a
square is drawn on a piece of paper that isn't
terribly sophisticated. In six David Rummelhart, Jeffrey Hinton, and Ronald

(16:05):
Williams published a lecture titled learning representations by back propagating errors.
This was a big breakthrough with deep learning. This all
has to do with a deep learning system improving its
ability to complete a specific task. And basically the algorithm's
job is to go from the output layer, you know,

(16:25):
where the system has made a decision, and then work
backward through the neural network, adjusting the weights that led
to an incorrect decision. So let's say it's a system
that is looking to figure out whether or not a
cat is in a photograph and it says, there's a
cat in this picture, and you look at the picture

(16:47):
and there is no cat there. Then you would look
at the inputs one level back just before the system
said here's a picture of a cat, and you'd say,
all right, which of these inputs lad the system to
leave this was a picture of a cat, And then
you would adjust those. Then you would go back one
layer up, so you're working your way up the model

(17:10):
and say which inputs here led to it giving the
outputs that led to the mistake, and you do this
all the way up until you get up to the
input level at the top of the computer model. You
are back propagating, and then you run the test again
to see if you've got improvement. It's exhaustive, but it's

(17:32):
also drastically improved neural network performance, much faster than just
throwing more brute force to it. The algorithm essentially is
checking to see if a small change in each input
value received by a layer of nodes would have led
to a more accurate results. So it's all about going
from that output working your way backward. In two thousand twelve,

(17:54):
Alex Krajewski published a paper that gave us the next
big breakthrough. He argued that a really deep neural network
with a lot of layers could give really great results
if you paired it with enough data to train the system.
So you needed to throw lots of data at these models,
and it needed to be an enormous amount of data. However,

(18:17):
once trained, the system would produce lower error rates. So yeah,
I would take a long time but you would get
better results. Now, at the time, a good error rate
for such a system was that means one out of
four conclusions the system would come to would be wrong.
If you ran it across a long enough number of decisions,

(18:39):
you would find that one out of every four wasn't right.
The system that Alex's team worked on produced results that
had an error rate of six percent, so much lower.
And then in just five years, with more improvements to
this process, the classification error rate had dropped down to
two point three percent for deep learning systems. So from

(19:04):
to two point three it was really powerful stuff. Okay,
so you've got your artificial neural network. You've got your
layers and layers of nodes. You've adjusted the weights of
the inputs into each node to see if your system
can identify, you know, pictures of cats, and you start

(19:25):
feeding images to this system, lots of them. This is
the domain that you are feeding to your system. The
more images you can feed to it, the better. And
you want a wide variety of images of all sorts
of stuff, not just of different types of cats, but
stuff that most certainly is not a cat, like dogs,
or cars or chartered public accountants, you name it, and

(19:48):
you look to see which images the system identifies correctly
and which ones it screws up, both which images have
cats in it that actually don't have cats in it,
or images the system has identified as saying there is
no cat here, but there is a cat there. This
guides you into adjusting the weights again and again, and

(20:08):
you start over and you do it again, and that's
your basic deep learning system, and it gets better over
time as you train it. It learns. Now, let's transition
over to the adversarial systems I mentioned earlier, because they
take this and twist it a little bit. So you've
got two artificial neural networks and they are using this

(20:30):
general approach to deep learning, and you're setting them up
so that they feed into each other. One network, the generator,
has the task to learn how to do something such
as create an eighteenth century style portrait based off lots
and lots of examples of the real thing. The domain

(20:50):
the problem domain. The second network, the discriminator, has a
different job. It has to tell the difference between authentic
port traits that came from the problem domain and computer
generated portraits that came from the generator itself. So essentially
the discriminator is like the model I mentioned earlier that

(21:11):
was identifying pictures of cats. It's doing the same sort
of thing, except instead of saying cat or no cat,
it's saying real portrait or a computer generated portrait. So
there are essentially two outcomes the discriminator could reach, and
that's whether or not an images computer generated or it wasn't.
So do you see where this is going? You train

(21:31):
up both models. You have the generator attempt to make
its own version of something such as that eighteenth century portrait.
It does so it designs the portrait based on what
the model believes are the key elements of a portrait,
So things like colors, shapes, the ratio of size, like

(21:52):
you know, how large should the head be in relation
to the body. All of these factors and many more
come into play. The generator creates its own idea of
what a portrait is supposed to look like, and chances
are the early rounds of this will not be terribly convincing.
The results are then fed to the discriminator, which tries

(22:14):
to suss out which of the images fed to it
our computer generated and which ones aren't. After that round,
both models are tweaked the generator adjusts input weights to
get closer to the genuine article, and the discriminator adjust
weights to reduce false positives or to catch computer generated images.

(22:34):
And then you go again and again and again and again,
and they both get better over time. So, assuming everything
is working properly, over time, the adjustment of input weights
will lead to more convincing results, and given enough time
and enough repetition, you'll end up with a computer generated
painting that you can auction off for nearly half a

(22:55):
million dollars. Though keep in mind that huge price or
dates back to the novelty of it being an early
AI generated painting. It would be shocking to me if
we saw that actually become a trend. Also, the painting,
while interesting, isn't exactly so astounding as to make you
think there's no way a machine did that. You'd look

(23:16):
at them and go, yeah, I can imagine a machine
did that. One. A group of computer scientists first described
the general adversarial network architecture in a paper in two
thousand and fourteen, and like other neural networks, these models
require a lot of data. The more the better. In fact,
smaller data sets means the models have to make some
pretty big assumptions, and you tend to get pretty lousy results.

(23:41):
More data, as in more examples, teaches the models more
about the parameters of the domain, whatever it is they
are trying to generate. It refines the approach. So if
you have a sophisticated enough pair of models and you
have enough data to fill up a domain, you can
generate some convincing material. And that in ludes video and

(24:01):
this brings us around to deep fakes. And in addition
to generative adversarial networks, a couple of other things really
converged to create the techniques and trends and technology that
would allow for deep fakes proper. In Malcolm Slaney, Michelle Covell,

(24:22):
and Christoph Bregler wrote some software that they called the
Video Rewrite Program. The software would analyze faces and then
create or synthesize lip animation which could be matched to
pre recorded audio. So you could take some film footage
of a person and then reanimate their lips so that

(24:44):
they could appear to say all sorts of things, which
in some ways set the stage for deep fakes. This case,
it was really just focusing on the lips and the
general area around the lips, so you weren't changing the
rest of the expression of the face, and you would
have to, you know, keep your recording to be about

(25:04):
the same length as whatever the film clip was, or
you would have to loop the film clip over and over,
which would make it, you know, far more obvious that
this was a fake. In addition, motion tracking technology was
advancing over time too, and this also became an important
tool in computer animation. This tool would also be used
by deep fake algorithms to create facial expressions, manipulating the

(25:27):
digital image just as it would if it were a
video game character or a Pixar animated character. Typically, you
need to start with some existing video in order to
manipulate it. You're not actually computer generating the animation, like,
you're not creating a computer generated version of whomever it
is you're you're doing the fake of You're using existing

(25:51):
imagery in order to do that and then manipulating that
existing imagery, so it's a little different from computer animation.
In two thousands six teen, students and faculty at the
Technical University of Munich created the face to Face project
that would be face the numeral two and then face
and this was particularly jaw dropping to me at the time.

(26:14):
When I first saw these videos back in ten, I
was floored. They created a system that had a target actor.
This would be the video of the person that you
want to manipulate. In the example they used, it was
former US President George W. Bush. Their process also had
a source actor. This was the source of the expressions

(26:38):
and facial movements you would see in the targets, so
kind of like a digital puppeteer in a way. But
the way they did it was really cool. They had
a camera trained on the source actor and it would
track specific points of movement on the source actor's face,
and then the system would manipulate the same points of

(26:58):
movement on the target actor's face in the video. So
if the source actor smiled, then the target smiled, so
the source actor would smile, and then you would see
George W. Bush in the video smile in real time.
It was really strange. They used this looping video of
George W. Bush wearing a neutral expression. They had to

(27:21):
start with that as there they're sort of zero point,
and I gotta tell you, it really does look like
the former president George W. Bush is having a bit
of a freak out on a looping video because he
keeps on opening his mouth, closing his mouth, grimacing, raising
his eyebrows. You need to watch this video. It is

(27:43):
still available online to check out. In Students and faculty
over at the University of Washington created the Synthesizing Obama project,
in which they trained a computer model to generate a
synthetic video of former US President Barack Obama, and they
made it lip sync to a pre recorded audio clip

(28:03):
from one of Obama's addresses to the nation. They actually
had the original video of that address for comparison, so
they could look back at that and see how they're
generated one compared to the real thing. And their approach
used a model that analyzed hundreds of hours of video
footage of Obama speaking, and it mapped specific mouth shapes

(28:28):
to specific sounds. It would also include some of Obama's mannerisms,
such as how he moves his head when he talks
or uses facial expressions to emphasize words. And watching the
video and that, you know the real one next to
the generated one is pretty strange. You can tell the
generated one isn't quite right. It's not matching the audio exactly,

(28:51):
at least not on the early versions, but it's fairly close,
and it might even pass casual inspection for a lot
of people who weren't, like, you know, actually paying attention.
Authors Morass and Alexandro defined deep fakes as quote the
product of artificial intelligence applications that merge, combine, replace, and

(29:13):
superimpose images and video clips to create fake videos that
appear authentic end quote. They first emerged in two seventeen,
and so this is a pretty darn young application of technology.
One thing that is worrisome is that once someone has
access to the tools, it's not that difficult to create

(29:34):
a deep fake video. You pretty much just need a
decent computer, the tools, a bit of know how on
how to do it, and some time you also need
some reference material, as in like videos and images of
the person that you are replicating, and like the machine
learning systems I've mentioned, the more reference material you have,

(29:55):
the better. That's why the deep fakes you encounter these
days tend to be of notable famous people like celebrities
and politicians. Mainly there's no shortage of reference material for
those types of individuals, and so they are easier to
replicate with deep fakes than someone who maintains a much
lower profile. Not to say that that will always be

(30:17):
the case, or that there aren't systems out there that
can accept smaller amounts of reference material. It's just harder
to make a convincing version with fewer samples. But in
order to make a convincing fake, the system really has
to learn how a person moves. All those facial expressions matter.

(30:39):
It also has to learn how a person sounds. Will
get into sound a little bit later. But mannerisms, inflection, accent, emphasis, cadence,
quirks and ticks, all of these things have to be
analyzed and replicated to make a convincing fake, and it
has to be done just right, or else it comes
off as creepy or unrealistic. Think about how impressionists will

(31:02):
take a celebrity's manner of speech and then heighten some
of it in comedic effect. You'll hear all the time
with folks who do impressions of people like Jack Nicholson
or Christopher Walkin or Barbara streisand people who have a
very particular way of speaking. Impressionists will take those as
markers and they really punch in on them. Well, a

(31:25):
deep fake can't really do that too much, or else
it won't come across as genuine. It'll feel like you're
watching a famous person impersonating themselves, which is weird. Now.
The earliest mention of deep fakes I can find dates
to a two thousand seventeen Reddit forum in which a
user shared deep faked videos that appeared to show female

(31:45):
celebrities in sexual situations. Heads and faces had been replaced,
and the actors in pornographic movies had their heads or
faces swapped out for these various celebrities. Now the fakes
can look fairly convincing, extremely convincing in some cases, which
can lead to some people assuming that the videos are

(32:07):
genuine and that the folks that they saw in the
videos are really the ones who are in it. And
obviously that's a real problem, right. I mean that this
technology we've given enough reference data DEFEATA system, someone could
fabricate a video that appears to put a person in
a compromising position, whether it's a sexual act or making

(32:28):
damaging statements or committing a crime or whatever. And there
are tools right now that allow you to do pretty
much what the face to face tool was doing back
in two thousand sixteen. A program called avatar if I,
which is not that easy to say anyway. It can
run on top of live streaming conference services like Zoom

(32:49):
and Skype, and you can swap out your face for
a celebrities face. Your facial expressions map to the computer
manipulated celebrity face uh that just looks at you through
your webcam, and then if you smile, the celebrity image smiles, etcetera.
It's like that old face to face program. It does
need a pretty beefy PC to manage doing all this

(33:13):
because you're also running that live streaming service underneath it.
It's also not exactly user friendly. You need some programming
experience to really get it to work. But it is
widely accessible as the source code is is open source
and it's on get hubs, so anyone can get it.
Samantha Cole, who writes for Vice, has covered the topic

(33:36):
of deep fakes pretty extensively and the potential harm they
can cause, and I recommend you check out her work
if you're interested in learning more about that. Do be
warned that Coal covers some pretty adult themed topics and
I think she does great work and very important work.
But as a guy who grew up in the Deep South,

(33:57):
it's also the kind of stuff that occasionally makes me
clutch my purse roles. But that's more of a statement
about me than her work. She does great work. I
think most of us can imagine plenty of scenarios in
which this sort of technology could cause mischief on a
good day and catastrophe on a bad day, whether it's
spreading misinformation, creating fear and certainty and doubt fud or

(34:21):
by making people seem to say things they never actually said,
or contributing to an ugly subculture in which people try
to make their more base fantasies a reality by putting
one person's head on another person's body. You know, it's
not great. There are legitimate uses of the technology too,
of course, you know, tech itself is rarely good or bad.

(34:42):
It's all in how we use it. But this particular
technology has a lot of potentially harmful uses, and Samantha
Coll has done a great job explaining them. When we
come back, I'll talk a bit more about the war
against deep fakes and how people are trying to prepare
for a world that is increasingly filled with media we
can't really trust. But first, let's take a quick break.

(35:13):
Before the break, I mentioned Samantha Cole, who has written
extensively about deep fags, and one point she makes that
I think is important for us to note is that
the vast majority of instances of deep fake videos haven't
been some manufactured video of a political leader saying inflammatory things.

(35:33):
That continues to be a big concern. There's a genuine
fear that someone is going to manufacture a video in
which a politician appears to say or do something truly
terrible in an effort to either discredit the politician or
perhaps instigate a conflict with some other group. There are
literal doomsday scenarios in which such a video would prompt

(35:56):
a massive military response, though it does seem like it
might be a little far fetched. Though heck, I don't know,
considering the world we live in, maybe it's not that
big of a stretch anyway. Cole's point is that so far,
debt has not happened. She points out that the most
frequent use for the tech either tends to be people

(36:17):
goofing around or disturbingly using it too. In her words,
quote take ownership of women's bodies in non consensual porn
end quote. Cole argues that the reason we haven't really
seen deep fix used much outside of these realms, apart
from a few advertising campaigns. Is that people are pretty
good at spotting Deep Fix. They aren't quite at a

(36:39):
level where they can easily pass for the real thing.
There's still something slightly off about them. They tend to
butt up against the uncanny valley. Now, for those of
you not familiar with that term, the uncanny valley describes
the feeling we humans get when we encounter a robot
or a computer generated figure that closely resembles a human

(37:02):
or human behavior, but you can still tell it's not
actually a person, and it's not a good feeling. It
tends to be described as repulsive and disturbing, or at
the very best, off putting. See also the animated film
Polar Express. There's a reason that when that film came out,

(37:23):
people kind of reacted negatively to the animation, and it's
also a reason why picks are tends to prefer to
go with stylized human characters who are different enough from
the way real humans look to kind of bypass uncanny valley.
We just think of that as a cartoon, not something
that's trying to pass itself off as being human. But

(37:44):
while there hasn't really been a flood of fake videos
hitting the Internet with the intent to discredit politicians or
infuriate specific people or whatever, there remains a general sense
that this is coming. It's just not here now. The
sense I get is that people feel it's an inevitability,
and there are already folks working on tools that will
help us sort out the real stuff from the fakes.

(38:07):
Take Microsoft, for example. There R and D division fittingly
called Microsoft Research, developed a tool they call the Video Authenticator.
This tool analyzes video samples and looks for signs of
deep fakery. In a blog post written by Tom Bert
and Eric Horvitts to Microsoft executives, they say, quote it

(38:30):
works by detecting the blending boundary of the deep fake
and subtle fading or gray scale elements that might not
be detectable by the human eye. End quote. Now I'm
no expert, but to me, it sounds like the video
Authenticator is working in a way that's not too dissimilar
to a discriminator in a generative adversarial network. I mean,

(38:54):
the whole purpose of the discriminator is to discriminate or
to tell the difference between genuine when unaltered videos and
computer generated ones. So the video authenticator is looking for
tailtale signs that a video was not produced through traditional
means but was computer generated. However, that's the very thing

(39:14):
that the generators in G A N systems are looking
out for. So when a generator receives feedback that a
video it generated did not slip past the discriminator, it
then tweaks those input weights and starts to shift its
approach in order to bypass whatever it was that gave
away its last attempt, and it does this again and again.

(39:38):
So the video authenticator might work well for a given
amount of time, but I would suspect that in the
long run, the deep fake systems will become sophisticated enough
to fool the authenticator. Of course, Microsoft will continue to
tweak the authenticator as well, and it will become something
of a seesaw battle as one side outperforms the other temporarily,

(40:01):
and then the balance will shift. Though there may come
a time where either the deep fakes are too good
and they don't set off any alarms from the discriminator,
or the discriminator gets so sensitive that it starts to
flag real videos and it hits a lot of false
positives and calls them generated videos instead. Either way, you

(40:23):
reach a point where a tool like this no longer
really serves a useful purpose, and the video authenticator will
be obsolete. Now, this is something we see in artificial
intelligence all the time. If you remember the good old
days of capture, you know, the approving you're not a
robot stuff. The stuff we were told to do was
typically type in a series of letters and numbers, and

(40:46):
it wasn't that hard when it first started, at least
not at first. That's because the text recognition algorithms of
the time weren't very good. They couldn't decipher mildly deformed
text because the shape to the text felt too far
outside the parameters of what the system could recognize as
a legitimate letter or number. You make the number a little,

(41:09):
you know, deformed, and then suddenly the systems like, well,
that doesn't look like a three to me because it's
not in the shape of a three. But over time
people developed better text recognition programs that could recognize these
shapes even if they weren't in a standard three orientation,
and those systems began to defeat those simple early captures

(41:30):
that required captured designers to make tougher versions, and eventually
the machines got good enough that they can match or
even outperform humans. And at that point, those text based
captures proved to be more challenging for people than for machines,
which meant if you use them, you defeated the whole
purpose in the first place. So while this escalation proved

(41:51):
to be a challenge for security, it was a boon
for artificial intelligence. And while I focused almost exclusively on
the imagery of video here, the same sort of stuff
is going on with generated speech, including generated speech that
imitates specific voices like deep big videos. This approach works
best if you have a really big data set of

(42:12):
recorded audio, so people like movie and TV stars, news reporters, politicians,
and um, you know, podcasters, we're great targets for this stuff.
There might be hundreds or you know, in my case,
thousands of hours of recording material to work from. Training
a model to use the frequencies timbre, intonation, pronunciation, pauses,

(42:38):
and other mannerisms of speech can result in a system
that can generate vocals that sound like the target, sometimes
to a fairly convincing degree, and for a while to
peek behind the curtain here we at tech stuff. We're
working with a company that I'm not going to name,
but they were going to do something like this as
an experiment. I was gonna do a whole episode on it,

(43:00):
and I had planned on crafting a segment of that
episode only through text. I was not going to actually
record it myself and then use a system that was
trained on my voice to replicate my voice and deliver
that segment on its own. I was curious if it
can nail not just the audio quality of my voice, which,

(43:22):
let's be honest, is amazing that sarcasm I can't stand
listening to myself, but it would also have to replicate
how I actually make certain sounds, Like would it get
the bit of the southern accent that's in my voice,
or the way I emphasize certain words. Would it pause
for effect at all or would it just robotically say

(43:44):
one word after the next and only pause when there
was some helpful punctuation that told it to do so.
Would it indicate a question by raising the pitch at
the end of its sentence. Sadly, we never got far
with that particular problem check, so I don't have any
answers for you. I don't know how it would have
turned out, but clearly one of the things I thought

(44:06):
of was that it's a bit of a red flag.
If you can train a computer to sound exactly like
a specific person, that means you can make that person
say anything you like, and obviously, like deep fake videos,
that could have some pretty devastating consequences if it were
at all, you know, believable or seemed realistic. Now, the

(44:27):
company we were working with was working hard to make
sure that the only person to have access to a
specific voice would be the owner of that voice, or
presumably the company employing that person. Though that does bring
up a whole bunch of other potential problems, like can
you imagine eliminating voice actors from a job because you've
got enough of their voice and you can just replicate it.

(44:50):
That wouldn't be great, But even so, it was something
I felt was both fascinating from a technology standpoint and
potentially problematic when it comes to an application of that technology.
One other thing I should mention is that the Internet
at large has been pretty active in fighting deep fakes,
not necessarily in detecting them, but removing the platforms from

(45:12):
which they were being shared, Reddit being a big one.
The subreddit that was dedicated to deep fakes what had
been shut down. So there have been some of those
moves as well. Now this is not directly against the technology,
it's more against the proliferation of the uh the output
of that technology. As for detecting deep fakes, it's interesting

(45:33):
to me that people are even developing tools to detect them,
because to me, the best tools so far seems to
be human perception. It's not that the images aren't really convincing,
or that we can suddenly detect these, you know, blending
lines like the video Authenticator tool. It's rather that it's

(45:53):
just not hard for us to spot a deep fake. Now,
stuff just doesn't quite look right in the way that
people behave in these videos. The vocals and animation often
don't quite match. The expressions aren't really natural, the progression
of mannerisms feels synthetic and not genuine. It just it

(46:14):
looks off. It's that uncanny Valley thing, and so just
paying attention and thinking critically can really help use suss
out the fakes from the real thing. Even if we
reach a point where machines can create a convincing enough
fake to pass for reality. We can still apply critical thinking,
and we always should. Heck, we should be applying critical

(46:35):
thinking even when there's no doubt as to the validity
of the video, because there may be enough to doubt
the content of the video itself. If I listen to
a genuine scam artist in a genuine video, that doesn't
make the scam more legitimate. We always need to use
critical thinking. What I think is most important is that

(46:57):
we acknowledge the very real fact that there are numerous organizations, agencies, governments,
and other groups that are actively attempting to spread misinformation
and disinformation. There are entire intelligence agencies dedicated to this endeavor,
and then there are more independent groups that are doing

(47:18):
it for one reason or another, typically either to advance
a particular political agenda or just to make as much
money as quickly as possible. This is beyond doubt or question.
There are numerous misinformation campaigns that are actively going on
out there in the real world right now. Most of
them are not depending on deep fakes, because one, deep

(47:42):
fakes aren't really good enough to fool most people right now,
and too, they don't need the deep fakes in the
first place. There are other methods that are simpler, that
don't need nearly the processing power that work just fine.
Why would you go through the trouble of synthesizing a
video if you can get a better response with a
blog post filled with lies or half truths. It's just

(48:06):
not a great return on investment. So bottom line, be
vigilant out there, particularly on social media. Be aware that
there are plenty of people who will not hesitate to
mislead others in order to get what they want. Use
a critical eye to evaluate the information you encounter. Ask questions,

(48:26):
check sources, look for corroborating reports. It's a lot of work,
but trust me, it's way better that we do our
best to make sure the stuff we're depending on is
actually dependable. It'll turn out better for us in the
long run. Well, that wraps up this episode of text stuff,
which yeah, I used as a backdoor to argue about

(48:47):
critical thinking. Again, sue me, don't, don't really sue me.
But I think that that's another instance where it's a
really clear example where we have to use that kind
of stuff. So I'm gonna keep keep on stressing it.
And you guys are awesome. I believe in you. I
think that when we start using these tools at our

(49:08):
disposal that everybody can develop just with some practice, that
things will be better. We'll be able to suss out
the nonsense from the real stuff, and we're all better
off in the long run if we can do that.
If you guys have suggestions for future topics I should
cover in episodes of tech Stuff, let me know via Twitter.

(49:29):
The handle is text stuff H s W and I'll
talk to you again really soon. Text Stuff is an
I Heart Radio production. For more podcasts from my Heart Radio,
visit the i Heart Radio app, Apple Podcasts, or wherever
you listen to your favorite shows.

All Episodes

Episode Transcript

TechStuff News

Follow Us On

Hosts And Creators

Oz Woloshyn

Karah Preiss

Show Links

Popular Podcasts

United States of Kennedy

Stuff You Should Know

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Deep Learning and Deepfakes