Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:04):
Welcome to tech Stuff, a production from I Heart Radio.
Hey there, and welcome to tech Stuff. I'm your host,
Jonathan Strickland. I'm an executive producer with I Heart Radio,
and how the tech are you. I am currently on
vacation celebrating my anniversary, and I didn't want to leave
(00:26):
you without an episode, So the episode We're going to
play for You was recorded and published on September seven,
twenty twenty. It is called deep learning and deep fakes,
and recent developments in the deep fakes field include researchers
creating tools that can detect tells in artificial voices, for example.
(00:47):
But really, when you think about that, it's just a
see saw like pattern that will see deep fake technology
improve over time, and then our ability to detect deep
fakes will improve, and this will keep going until one
side or the other has the edge permanently. Now we
kind of talk about that in this episode. In fact,
(01:07):
also deep fakes are very much in the spotlight literally
on the popular TV series America's Got Talent, A team
from the startup Metaphysics made it all the way to
the final round of the competition by creating deep fake
copies of the famous judges on the show all in
real time. It's equal parts entertaining and terrifying. Okay, maybe
(01:29):
entertaining terrifying. Anyway, enjoy this episode deep learning and deep Fakes. Now,
before I get into today's episode, I want to give
a little listener warning here. The topic at hand involves
some adult content, including the use of technology to do
stuff that can be unethical, illegal, hurtful, and just plain awful. Now,
(01:55):
I think this is an important topic, but I wanted
to give a bit of a heads up at this
are of the episode, just in case any of you
guys are listening to a podcast on like a family
road trip or something. I think this is an important
topic and I think everyone should know about it and
think about it. But I also respect that for some
people this subject might get a bit taboo. So let's
(02:17):
go on with the episode. Back in ninete a movie
called Rising Sun, directed by Philip Kaufman, based on a
Michael Crichton novel and starring Wesley Snipes and Sean Connery
came out in theaters. Now, I didn't see it in theaters,
but I did catch it when it came on you know,
(02:39):
HBO or Cinemax or something. Later on the movie included
a sequence that I found to be totally unbelievable. And
I'm not talking about buying into Sean Connery being an
expert on Japanese culture and business practices. Actually, side note,
Sean Connery has an interesting history of playing unlikely characters,
such as in high Lander, where he played an immortal
(03:01):
who is supposedly Egyptian, then who lived in feudal Japan
and ended up in Spain where he became known as Ramirez,
and all the while he's talking to a Scottish Highlander
who's played by a Belgian actor. But I'm getting way
off track here. Besides, I've heard Crichton actually wrote the
character while thinking of Connery, So you know, what the
(03:22):
heck do I know? In the film, Snips and Connery
are investigators, and they're looking into a homicide that happened
at a Japanese business but on American soil. The security
system in the building captured video of the homicide and
the identity of the killer appears to be a pretty
open and shut case. But that's not how it all
(03:44):
turns out. The investigators talked to a security expert played
by Tia Carrera, and she demonstrates in real time how
video footage can be altered. She records a short video
of Connery and snipes loads that onto a computer. Her
freezes a frame of the video and essentially performs a
(04:04):
cut and paste job swapping the heads of our two
lead characters. Then she resumes the video and the head
swap remains in place, and that head swap stuff is possible.
I mean, clearly it has to be possible, because you
actually do see that effect in the film itself. But
it takes a bit more than a quick cut and
(04:25):
paste job. But we'll leave off of that for now.
The whole point of that sequence, apart from showing off
some cinema magic, is to demonstrate to the investigators that video,
like photographs, can be altered. The expert has detected a
blue halo around the face of the supposed murderer in
the footage, indicating that some sort of trickery has happened.
(04:48):
She also reveals that she cannot magically restore the video
to its previous unaltered state, which I think was actually
a nice change of pace for a movie. By the way,
I think this movie is really, you know, not good,
like not worth your time, but that's my opinion anyway.
For years, this kind of video sorcery was pretty much
(05:10):
limited to the film and TV industries. It usually required
a lot of pre planning beforehand, so it wasn't as
simple as just taking footage that was already shot and
changing it in post on a whim with a couple
of clicks of a button. If it were, we would
see a lot fewer mistakes left in movies and television
(05:30):
because you could catch it later and just fix it.
But the tricks were possible, they were just difficult to
pull off. It just wasn't something you or I would
ever encounter in our day to day lives. But today
we live in a different world, a world that has
examples of synthetic media. Commonly referred to as deep fakes.
(05:52):
These are videos that have been altered or generated so
that the subject of the video is doing something that
they probably really would or could never do. They've brought
into question whether or not video evidence is even reliable,
much as the film Rising Sun was talking about. We
already know that eyewitness testimony is terribly unreliable. Our perception
(06:16):
and memories play tricks on us, and we can quote
unquote remember stuff that just didn't happen the way things
actually unfolded in reality. But now we're looking at video
evidence and potentially the same light. I mean, it's scary.
So today we're going to learn about synthetic media, how
(06:37):
it can be generated, the implications that follow with that
sort of reality, and ways that people are trying to
counteract a potentially dangerous threat. You know, fun stuff. Now, first,
the term synthetic media has a particular meaning. It refers
to art created through some sort of automated process, so
(06:59):
it's a largely hands off approach to creating the final
art piece. Now, under that definition, the example of rising
Sun would not apply here because we see in the
film and presumably this happens in the book as well,
but I haven't read the book that a human being
actually changes that. People have used tools to alter the
(07:22):
video footage. This would be more like using photoshop to
touch up a still image, with the computer system presumably
doing some of the work in the background to keep
things matched up. Either that or you would need to
alter each image in the footage frame by frame, or
use some sort of matt approach. To learn more about Matts,
(07:43):
you can listen to my episode about how blue and
green screens work. Synthetic media as a general practice has
been around for centuries. Artists have set up various contraptions
to create works with little or no human guidance. In
the twentieth century we started to see a movement called
generative art take form. This type of art is all
(08:05):
about creating a system that then creates or generates the
finished art piece. That would mean that the finished work,
such as a painting, wouldn't reflect the feelings or thoughts
of the artists who created the system. In fact, it
starts to raise the question what is the art? Is
it the painting that came about due to a machine
(08:26):
following a program of some sort, or is the art
the program itself? Is the art the process by which
the painting was made? Now, I'm not here to answer
that question. I just think it is an interesting question
to ask. Sometimes people ask much less polite questions, such
(08:46):
as is it art at all? Some art critics went
out of their way to dismiss generative art in the
early days. They found it insulting, but hey, that's kind
of the history of art and general Each new movement
in art inevitably finds both supporters and critics as it emerges.
(09:07):
If anything, you might argue that such a response legitimizes
the movement in you know, a weird way. If people
hate it, it must be something. In two thousand eighteen,
an artist collective called Obvious located out of Paris, France.
They submitted portrait style paintings that were created not by
(09:27):
an actual human painter, but by an artificially intelligent system.
Now they looked a lot like typical eighteenth century style portraits.
There was no attempt to pass off the portrait as
if it were actually made by a human artist. In fact,
the appeal of the piece was largely due to it
(09:48):
being synthetically generated. It went to auction at Christie's and
the AI created painting fetched more than four hundred thousand dollars.
And the way the group trained their AI is relevant
to our discussion about deep fakes. The collective relied on
a type of machine learning called generative adversarial networks or
(10:11):
g a N, which in turn is depending on deep learning.
So it looks like we've got a few things we're
gonna have to define here. Now, I'm going to keep
things fairly high level, because, as it turns out, there
are a few different ways to create machine learning models,
and to go through all of them in exhaustive detail
would represent a university level course in machine learning. I
(10:34):
have neither the time for that nor the expertise. I
would do a terrible job, so we'll go with a
high level perspective here. First. A generative adversarial network uses
two systems. You have a generator and you have a discriminator.
Both of these systems are a type of neural network.
(10:56):
A neural network is a computing model that is inspired
by the way our brains work. Our brains contain billions
of neurons, and these neurons work together, communicating through electrical
and chemical signals, controlling and coordinating pretty much everything in
our bodies. With computers, the neurons are nodes. The job
(11:18):
of a node is, you know, supposed to be kind
of like a neuron cell in the brain. It's to
take in multiple weighted input values and then generate a
single output value. Now, the word weighted w E I
G H T E D weighted is really important here
because the larger and inputs weight the more that input
(11:42):
will have an effect on whatever the output is. So
it kind of comes down to which inputs are the
most important for that nodes particular function. Now, if I
were to make an analogy, I would say, your boss
hands you three tasks to do. One of those tasks
has the label extremely important, and the second task has
(12:03):
the label critically important, and the third task has a
label saying you should have finished that one before it
was handed to you. Okay, so that's just some sort
of snarky office humor that I need to get off
my chest. But more seriously, imagine a node accepting three inputs.
In this example, input one has a fift weight, input
(12:24):
two has a weight, and input three has a ten
percent weight That adds up to and that would tell
you that the output that node generates will be most
affected by input one, followed by input two, and then
input three would have a smaller effect on whatever the
output is. Each node applies a nonlinear transformation on the
(12:48):
input values, again affected by each inputs weight value, and
that generates the output value. The details of that really
are not important are our episode. It involves performing changes
on variables that in turn change the correlation between variables,
and it gets a bit Matthew, and we would get
lost in the weeds. Pretty quickly. The important thing to
(13:11):
remember is that a node within a neural network takes
in a weighted sum of inputs, then performs a process
on those inputs before passing the result on as an output.
Then some other node a layer down will accept that output,
along with outputs from a couple of other nodes one
(13:33):
layer up, and then we'll perform an operation based on
those weighted inputs and pass that on to the next layer,
and so on. So these nodes are in layers, like
you know a cake. One layer of notes processes some inputs,
they send it onto the next layer of nodes, and
then that one does onto the next one, and the
next one and so on. This isn't a new idea.
(13:56):
Computer scientists began theorizing and experimenting with neural network approaches
as far back as the nineteen fifties with the perceptron,
which was a hypothetical system that was described by Frank
Rosenblatt of Cornell University. But it wasn't until the last
decade that computing power and our ability to handle a
(14:17):
lot of data reached a point where these sort of
learning models could really take off. The goal of this
system is to train it to perform a particular task
within a certain level of precision. The weights I mentioned
are adjustable, so you can think of it as teaching
a system which bits are the most important in order
(14:39):
to do whatever it is the system is supposed to
do in order to achieve your task. These are the
bits that are the most important and therefore should matter
the most when you weigh a decision. This is a
bit easier if we talk about a similar system with
the version of IBM S Watson that played on Jeopardy.
That system famously was not connected to the Internet. It
(15:03):
had to rely on all the information that was stored
within itself. When the system encountered a clue in Jeopardy,
it would analyze the clue, and then it would reference
its database to look for possible answers to whatever that
clue was. The system would weigh those possible answers and
attempt to determine which, if any, were the most likely
(15:25):
to be correct. If the certainty was over a certain threshold,
such as sure, the system would buzz in with its answer.
If no response rose above that threshold, the system would
not buzz in. So you could say that Watson was
playing the game with a best guess sort of approach.
Neural networks do essentially that sort of processing. With this
(15:48):
particular type of approach, we know what we want the
outcome to be, so we can judge whether or not
the system was successful. After each attempt, we can adjust
weight on the input between nodes to refine the decision
making process to get more accurate results. If the system
succeeds in its task, we can increase the weights that
(16:11):
contributed to the system picking the correct answer and thus
decrease the inputs that did not contribute to the successful response.
If the system done messed up and gave the wrong answer,
then we do the opposite. We look at the inputs
that contributed to the wrong answer, we diminish their weights,
(16:33):
and we increase the weights of the other input and
then we run the test again a lot. I'll explain
a bit more about this process when we come back,
but first let's take a quick break. Early in the
(16:54):
history of neural networks, computer scientists were hitting some pretty
hard stops do to the limitations of computing power at
the time. Early networks were only a couple of layers deep,
which really meant they weren't terribly powerful, and they could
only tackle rudimentary tasks like figuring out whether or not
a square is drawn on a piece of paper that
(17:17):
isn't terribly sophisticated. In six David Rummelhart, Jeffrey Hinton, and
Ronald Williams published a lecture titled learning representations by back
propagating errors. This was a big breakthrough with deep learning.
This all has to do with a deep learning system
improving its ability to complete a specific task. And basically
(17:40):
the algorithm's job is to go from the output layer,
you know, where the system has made a decision, and
then work backward through the neural network, adjusting the weights
that led to an incorrect decision. So let's say it's
a system that is looking to figure out whether or
not a hat is in a photograph and it says,
(18:02):
there's a cat in this picture, and you look at
the picture and there is no cat there. Then you
would look at the inputs one level back just before
the system said here's a picture of a cat, and
you'd say, all right, which of these inputs lad the
system to believe this was a picture of a cat?
And then you would adjust those Then you would go
(18:23):
back one layer up, So you're working your way up
the model and say which inputs here led to it
giving the outputs that led to the mistake, and you
do this all the way up until you get up
to the input level at the top of the computer model.
You are back propagating, and then you run the test
(18:46):
again to see if you've got improvement. It's exhaustive, but
it's also drastically improved neural network performance, much faster than
just throwing more brute force to it. The algorithm is
entually is checking to see if a small change in
each input value received by a layer of nodes would
(19:06):
have led to a more accurate results. So it's all
about going from that output working your way backward. In
two thousand twelve, Alex Krajewski published a paper that gave
us the next big breakthrough. He argued that a really
deep neural network with a lot of layers could give
really great results if you paired it with enough data
to train the system. So you needed to throw lots
(19:29):
of data at these models, and it needed to be
an enormous amount of data. However, once trained, the system
would produce lower error rates. So yeah, I would take
a long time, but you would get better results. Now,
at the time, a good error rate for such a
system was that means one out of four conclusions the
(19:51):
system would come to would be wrong. If you ran
it across a long enough number of decisions, you would
find that one out of every four wasn't right. The
system that Alex's team worked on produced results that had
an error rate of six percent, so much lower. And
then in just five years, with more improvements to this process,
(20:14):
the classification error rate had dropped down to two point
three percent for deep learning systems. So from to two
point three percent, it was really powerful stuff. Okay, so
you've got your artificial neural network. You've got your layers
and layers of nodes. You've adjusted the weights of the
(20:35):
inputs into each node to see if your system can identify,
you know, pictures of cats, and you start feeding images
to this system, lots of them. This is the domain
that you are feeding to your system. The more images
you can feed to it, the better. And you want
a wide variety of images of all sorts of stuff,
(20:56):
not just of different types of cats, but stuff that
most certainly isn't not a cat, like dogs, or cars
or chartered public accountants. You name it, and you look
to see which images the system identifies correctly and which
ones it screws up, both which images have cats in
it that actually don't have cats in it, or images
(21:17):
the system has identified as saying there is no cat here,
but there is a cat there. This guides you into
adjusting the weights again and again, and you start over
and you do it again, and that's your basic deep
learning system, and it gets better over time as you
train it. It learns. Now, let's transition over to the
adversarial systems I mentioned earlier, because they take this and
(21:40):
twist it a little bit. So you've got to artificial
neural networks and they are using this general approach to
deep learning, and you're setting them up so that they
feed into each other. One network. The generator has the
task to learn how to do something such as create
(22:01):
an eighteenth century style portrait based off lots and lots
of examples of the real thing. The domain the problem
domain the second network. The discriminator has a different job.
It has to tell the difference between authentic portraits that
came from the problem domain and computer generated portraits that
(22:24):
came from the generator itself. So essentially, the discriminator is
like the model I mentioned earlier that was identifying pictures
of cats, It's doing the same sort of thing, except
instead of saying cat or no cat, it's saying real
portrait or computer generated portrait. So there are essentially two
outcomes the discriminator could reach, and that's whether or not
(22:44):
an images computer generated or it wasn't. So do you
see where this is going? You train up both models.
You have the generator attempt to make its own version
of something such as that eighteenth century portrait. It does
so it designs the portrait it based on what the
model believes are the key elements of a portrait, so
(23:05):
things like colors, shapes, the ratio of size, like you know,
how large should the head be in relation to the body.
All of these factors and many more come into play.
The generator creates its own idea of what a portrait
is supposed to look like, and chances are the early
rounds of this will not be terribly convincing. The results
(23:30):
are then fed to the discriminator, which tries to suss
out which of the images fed to it are computer
generated and which ones aren't. After that round, both models
are tweaked. The generator adjusts input weights to get closer
to the genuine article, and the discriminator adjust weights to
reduce false positives or to catch computer generated images. And
(23:53):
then you go again and again and again and again,
and they both get better over time. So, assuming everything
is working properly, over time, the adjustment of input weights
will lead to more convincing results, and given enough time
and enough repetition, you'll end up with a computer generated
painting that you can auction off for nearly half a
(24:13):
million dollars. Though keep in mind that huge price relates
back to the novelty of it being an early AI
generated painting. It would be shocking to me if we
saw that actually become a trend. Also, the painting, while interesting,
isn't exactly so astounding as to make you think there's
no way a machine did that. You'd look at them
(24:35):
and go, yeah, I can imagine a machine did that. One.
A group of computer scientists first described the general adversarial
network architecture in a paper in two thousand and fourteen,
and like other neural networks, these models require a lot
of data. The more the better. In fact, smaller data
sets means the models have to make some pretty big assumptions,
(24:56):
and you tend to get pretty lousy results. More data,
as in more examples, teaches the models more about the
parameters of the domain, whatever it is they are trying
to generate. It refines the approach. So if you have
a sophisticated enough pair of models and you have enough
data to fill up a domain, you can generate some
convincing material, and that includes video. And this brings us
(25:20):
around to deep fakes. And in addition to generative adversarial networks,
a couple of other things really converged to create the
techniques and trends and technology that would allow for deep
fakes proper. In Malcolm Slaney, Michelle Covell, and Christoph Bregler
(25:42):
wrote some software that they called the Video Rewrite Program.
The software would analyze faces and then create or synthesize
lip animation which could be matched to pre recorded audio.
So you could take some film footage of a person
and and reanimate their lips so that they could appear
(26:03):
to say all sorts of things, which in some ways
set the stage for deep fakes. This case, it was
really just focusing on the lips and the general area
around the lips, so you weren't changing the rest of
the expression of the face, and you would have to,
you know, keep your recording to be about the same
(26:23):
length as whatever the film clip was, or you would
have to loop the film clip over and over it,
which would make it, you know, far more obvious that
this was a fake. In addition, motion tracking technology was
advancing over time too, and this also became an important
tool in computer animation. This tool would also be used
by deep fake algorithms to create facial expressions, manipulating the
(26:45):
digital image just as it would if it were a
video game character or a Pixar animated character. Typically, you
need to start with some existing video in order to
manipulate it. You're not actually computer generating the animation, like,
you're not creating a computer generated version of whomever it
is you're doing the fake of. You're using existing imagery
(27:11):
in order to do that and then manipulating that existing imagery,
So it's a little different from computer animation. In two
thousand and sixteen, students and faculty at the Technical University
of Munich created the face to face project that would
be face the numeral two and then face and this
was particularly jaw dropping to me at the time when
(27:33):
I first saw these videos back in I was floored.
They created a system that had a target actor. This
would be the video of the person that you want
to manipulate. In the example they used, it was former
US President George W. Bush. Their process also had a
source actor. This was the source of the expressions and
(27:56):
facial movements you would see in the target So kind
of like a digital puppeteer in a way, but the
way they did it was really cool. They had a
camera trained on the source actor and it would track
specific points of movement on the source actor's face, and
then the system would manipulate the same points of movement
(28:17):
on the target actor's face in the video. So if
the source actor smiled, then the target smiled, so the
source actor would smile, and then you would see George W.
Bush and the video smile in real time. It was
really strange. They used this looping video of George W.
Bush wearing a neutral expression. They had to start with
(28:41):
that as there they're sort of zero point, and I
gotta tell you, it really does look like the former
president George W. Bush is having a bit of a
freak out on a looping video because he keeps on,
opening his mouth, closing his mouth, grimacing, raising his eyebrows.
You need to watch this video. It is still available
(29:02):
online to check out. In ten, students and faculty over
at the University of Washington created the Synthesizing Obama project,
in which they trained a computer model to generate a
synthetic video of former US President Barack Obama, and they
made it lip sync to a pre recorded audio clip
from one of Obama's addresses to the nation. They actually
(29:26):
had the original video of that address for comparison, so
they could look back at that and see how they're
generated one compared to the real thing. And their approach
used a model that analyzed hundreds of hours of video
footage of Obama speaking, and it mapped specific mouth shapes
(29:47):
to specific sounds. It would also include some of Obama's mannerisms,
such as how he moves his head when he talks
or uses facial expressions to emphasize words. And watching the
video and that the the real one next to the
generated one is pretty strange. You can tell the generated
one isn't quite right. It's not matching the audio exactly,
(30:10):
at least not on the early versions, but it's fairly
close and it might even pass casual inspection for a
lot of people who weren't, like, you know, actually paying attention.
Authors Morass and Alexandro defined deep fakes as quote the
product of artificial intelligence applications that merge, combine, replace, and
(30:31):
superimpose images and video clips to create fake videos that
appear authentic end quote. They first emerged in seventeen and
so this is a pretty darn young application of technology.
One thing that is worrisome is that once someone has
access to the tools, it's not that difficult to create
(30:52):
a deep fake video. You pretty much just need a
decent computer, the tools, a bit of know how on
how to do it, and some time you also need
some reference material, as in like videos and images of
the person that you are replicating, and like the machine
learning systems I've mentioned, the more reference material you have,
(31:14):
the better. That's why the deep fakes you encounter these
days tend to be of notable famous people like celebrities
and politicians. Mainly there's no shortage of reference material for
those types of individuals, and so they are easier to
replicate with deep fakes than someone who maintains a much
lower profile. Not to say that that will always be
(31:35):
the case, or that there aren't systems out there that
can accept smaller amounts of reference material. It's just harder
to make a convincing version with fewer samples. But in
order to make a convincing fake, the system really has
to learn how a person moves. All those facial expressions matter.
(31:58):
It also has to learn how a person sounds. Will
get into sound a little bit later. But mannerisms, inflection, accent, emphasis, cadence,
quirks and ticks, all of these things have to be
analyzed and replicated to make a convincing fake, and it
has to be done just right or else it comes
off as creepy or unrealistic. Think about how impressionists will
(32:21):
take a celebrity's manner of speech and then heighten some
of it in comedic effect. You'll hear all the time
with folks who do impressions of people like Jack Nicholson
or Christopher Walkin or Barbara streisand people who have a
very particular way of speaking. Impressionists will take those as
markers and they really punch in on them. Well, a
(32:43):
deep fake can't really do that too much, or else
it won't come across as genuine. It'll feel like you're
watching a famous person impersonating themselves, which is weird. Now.
The earliest mention of deep fakes I can find dates
to a two thousand seventeen Reddit forum in which you
are shared deep faked videos that appeared to show female
(33:03):
celebrities in sexual situations. Heads and faces had been replaced,
and the actors in pornographic movies had their heads or
faces swapped out for these various celebrities. Now the fakes
can look fairly convincing, extremely convincing in some cases, which
can lead to some people assuming that the videos are
(33:25):
genuine and that the folks that they saw in the
videos are really the ones who were in it. And
obviously that's a real problem, right, I mean that this
technology we've given enough reference data DEFEATA system, someone could
fabricate a video that appears to put a person in
a compromising position, whether it's a sexual act or making
(33:47):
damaging statements or committing a crime or whatever. And there
are tools right now that allow you to do pretty
much what the face to face tool was doing back
in two thousands sixteen, a program called avatar. If I
just not that easy to say anyway, It can run
on top of live streaming conference services like Zoom and Skype,
(34:08):
and you can swap out your face for a celebrities face.
Your facial expressions map to the computer manipulated celebrity face.
It just looks at you through your webcam, and then
if you smile, the celebrity image smiles, et cetera. It's
like that old face to face program. It does need
a pretty beefy PC to manage doing all this because
(34:32):
you're also running that live streaming service underneath it. It's
also not exactly user friendly. You need some programming experience
to really get it to work. But it is widely accessible,
as the source code is is open source and it's
on get hubs, so anyone can get it. Samantha Cole,
who writes for Vice, has covered the topic of deep
(34:55):
fakes pretty extensively and the potential harm they can cause,
and I recommend you check out her work if you're
interested in learning more about that. Do be warned that
Coal covers some pretty adult themed topics and I think
she does great work and very important work. But as
a guy who grew up in the Deep South, it's
(35:15):
also the kind of stuff that occasionally makes me clutch
my pearls, But that's more of a statement about me
than her work. She does great work. I think most
of us can imagine plenty of scenarios in which this
sort of technology could cause mischief on a good day
and catastrophe on a bad day, whether it's spreading misinformation,
(35:35):
creating fear, uncertainty and doubt fud or by making people
seem to say things they never actually said, or contributing
to an ugly subculture in which people try to make
their more base fantasies a reality by putting one person's
head on another person's body. You know, it's not great.
There are legitimate uses of the technology too, of course,
(35:58):
you know, tech itself is rarely good or bad. It's
all in how we use it. But this particular technology
has a lot of potentially harmful uses, and Samantha Cole
has done a great job explaining them. When we come back,
I'll talk a bit more about the war against deep
fakes and how people are trying to prepare for a
world that is increasingly filled with media we can't really trust.
(36:21):
But first let's take a quick break. Before the break,
I mentioned Samantha Cole, who has written extensively about deep fags,
and one point she makes that I think is important
for us to note is that the vast majority of
(36:44):
instances of deep fake videos haven't been some manufactured video
of a political leader saying inflammatory things. That continues to
be a big concern. There's a genuine fear that someone
is going to manufacture a video in which a politician
appears to say or do something truly terrible in an
effort to either discredit the politician or perhaps instigate a
(37:08):
conflict with some other group. There are literal doomsday scenarios
in which such a video would prompt a massive military response,
though that does seem like it might be a little
far fetched, though heck, I don't know, considering the world
we live in, maybe it's not that big of a
stretch anyway. Cole's point is that so far, debt has
(37:30):
not happened. She points out that the most frequent use
for the tech either tends to be people goofing around
or disturbingly using it to in her words, quote, take
ownership of women's bodies in non consensual porn end quote.
Cole argues that the reason we haven't really seen deep
fix used much outside of these realms, apart from a
(37:52):
few advertising campaigns, is that people are pretty good at
spotting deep fix they aren't quite at a level where
they can easily pass for the real thing. There's still
something slightly off about them. They tend to butt up
against the uncanny valley. Now, for those of you not
familiar with that term, the uncanny valley describes the feeling
(38:13):
we humans get when we encounter a robot or a
computer generated figure that closely resembles a human or human behavior,
but you can still tell it's not actually a person,
and it's not a good feeling. It tends to be
described as repulsive and disturbing, or at the very best,
(38:34):
off putting. See also the animated film Polar Express. There's
a reason that when that film came out, people kind
of reacted negatively to the animation, and it's also a
reason why Pixar tends to prefer to go with stylized
human characters who are different enough from the way real
(38:54):
humans look to kind of bypass uncanny valley. We just
think of that as a cartoon nuts that's trying to
pass itself off as being human. But while there hasn't
really been a flood of fake videos hitting the Internet
with the intent to discredit politicians or infuriate specific people
or whatever. There remains a general sense that this is coming.
(39:15):
It's just not here now. The sense I get is
that people feel it's an inevitability, and there are already
folks working on tools that will help us sort out
the real stuff from the fakes. Take Microsoft, for example.
There R and D division fittingly called Microsoft Research, developed
a tool they called the Video Authenticator. This tool analyzes
(39:38):
video samples and looks for signs of deep fakery. In
a blog post written by Tom Bert and Eric Horvitts
to Microsoft executives, they say, quote it works by detecting
the blending boundary of the deep fake and subtle fading
or gray scale elements that might not be detectable by
the human eye. End quote. Now I'm no expert, but
(40:01):
to me, it sounds like the video Authenticator is working
in a way that's not too dissimilar to a discriminator
in a generative adversarial network. I mean, the whole purpose
of the discriminator is to discriminate or to tell the
difference between genuine, unaltered videos and computer generated ones. So
(40:23):
the video authenticator is looking for tailtale signs that a
video was not produced through traditional means but was computer generated. However,
that's the very thing that the generators in G A
N systems are looking out for. So when a generator
receives feedback that a video it generated did not slip
(40:43):
past the discriminator, it then tweaks those input weights and
starts to shift its approach in order to bypass whatever
it was that gave away its last attempt, and it
does this again and again. So the video authenticator might
work well for a given amount of time, but I
would suspect that in the long run, the deep fake
(41:05):
systems will become sophisticated enough to fool the authenticator. Of course,
Microsoft will continue to tweak the authenticator as well, and
it will become something of a seesaw battle as one
side outperforms the other temporarily, and then the balance will shift.
Though there may come a time where either the deep
fakes are too good and they don't set off any
(41:27):
alarms from the discriminator, or the discriminator gets so sensitive
that it starts to flag real videos and hits a
lot of false positives and calls them generated videos instead.
Either way, you reach a point where a tool like
this no longer really serves a useful purpose, and the
video authenticator will be obsolete. Now, this is something we
(41:51):
see in artificial intelligence all the time. If you remember
the good old days of capture, you know, the approving
you're not a robot stuff. The stuff up we were
told to do was typically type in a series of
letters and numbers, and it wasn't that hard when it
first started, at least not at first. That's because the
text recognition algorithms of the time weren't very good. They
(42:14):
couldn't decipher mildly deformed text because the shapes of the
text felt too far outside the parameters of what the
system could recognize as a legitimate letter or number. You
make the number a little you know, deformed, and then
suddenly the systems like, well, that doesn't look like a
three to me, because it's not in the shape of
(42:34):
a three. But over time, people developed better text recognition
programs that could recognize these shapes even if they weren't
in a standard three orientation, and those systems began to
defeat those simple early captures that required captured designers to
make tougher versions and Eventually the machines got good enough
(42:55):
that they could match or even outperform humans, and at
that point those tech based captures proved to be more
challenging for people than for machines, which meant if you
use them, you defeated the whole purpose in the first place.
So while this escalation proved to be a challenge for security,
it was a boon for artificial intelligence. And while I
(43:15):
focused almost exclusively on the imagery of video here, the
same sort of stuff is going on with generated speech,
including generated speech that imitates specific voices like deep big videos.
This approach works best if you have a really big
data set of recorded audio, so people like movie and
TV stars, news reporters, politicians, and um, you know, podcasters,
(43:42):
we're great targets for this stuff. There might be hundreds
or you know, in my case, thousands of hours of
recording material to work from. Training a model to use
the frequencies. Timbre, intonation, pronunciation, pauses, and other mannerisms of
speech can versus in a system that can generate vocals
(44:02):
that sound like the target, sometimes to a fairly convincing degree,
and for a while. To peek behind the curtain here
we at tech stuff. We're working with a company that
I'm not going to name, but they were going to
do something like this as an experiment. I was going
to do a whole episode on it, and I had
planned on crafting a segment of that episode only through text.
(44:25):
I was not going to actually record it myself and
then use a system that was trained on my voice
to replicate my voice and deliver that segment on its own.
I was curious if it can nail not just the
audio quality of my voice, which, let's be honest, is amazing.
That's sarcasm. I can't stand listening to myself, but it
(44:48):
would also have to replicate how I actually make certain sounds,
Like would it get the bit of the southern accent
that's in my voice, or the way I emphasize certain words.
Would it us for effect at all? Or would it
just robotically say one word after the next and only
pause when there was some helpful punctuation that told it
(45:09):
to do so. Would it indicate a question by raising
the pitch at the end of its sentence. Sadly, we
never got far with that particular project, so I don't
have any answers for you. I don't know how it
would have turned out, But clearly one of the things
I thought of was that it's a bit of a
red flag. If you can train a computer to sound
(45:30):
exactly like a specific person, that means you could make
that person say anything you like, and obviously, like deep
fake videos, that could have some pretty devastating consequences if
it were at all, you know, believable or seemed realistic. Now,
the company we were working with was working hard to
make sure that the only person to have access to
(45:52):
a specific voice would be the owner of that voice,
or presumably the company employing that person, though that does
bring up a whole bunch of other potential problems, like
can you imagine eliminating voice actors from a job because
you've got enough of their voice and you can just
replicate it. That wouldn't be great, But even so, it
was something I felt was both fascinating from a technology
(46:14):
standpoint and potentially problematic when it comes to an application
of that technology. One other thing I should mention is
that the Internet at large has been pretty active in
fighting deep fakes, not necessarily in detecting them, but removing
the platforms from which they were being shared, Reddit being
a big one, the subreddit that was dedicated to deep
(46:36):
fakes what had been shut down, So there have been
some of those moves as well. Now this is not
directly against the technology, it's more against the proliferation of
the uh the output of that technology. As for detecting
deep fakes, it's interesting to me that people are even
developing tools to detect them, because to me, the best
(46:57):
tools so far seems to be human perception. It's not
that the images aren't really convincing, or that we can
suddenly detect these, you know, blending lines like the video
Authenticator tool. It's rather that it's just not hard for
us to spot a deep fake. Stuff just doesn't quite
look right in the way that people behave in these videos.
(47:21):
The vocals and animation often don't quite match. The expressions
aren't really natural, the progression of mannerisms feels synthetic and
not genuine. It just it looks off. It's that uncanny
Valley thing, and so just paying attention and thinking critically
can really help you suss out the fakes from the
(47:41):
real thing. Even if we reach a point where machines
can create a convincing enough fake to pass for reality,
we can still apply critical thinking, and we always should. Heck,
we should be applying critical thinking even when there's no
doubt as to the validity of the video, because there
may be enough to doubt the content of the video itself.
(48:04):
If I listen to a genuine scam artist in a
genuine video, that doesn't make the scam more legitimate. We
always need to use critical thinking. What I think is
most important is that we acknowledge the very real fact
that there are numerous organizations, agencies, governments, and other groups
that are actively attempting to spread misinformation and disinformation. There
(48:29):
are entire intelligence agencies dedicated to this endeavor, and then
there are more independent groups that are doing it for
one reason or another, typically either to advance a particular
political agenda or just to make as much money as
quickly as possible. This is beyond doubt or question. There
(48:50):
are numerous misinformation campaigns that are actively going on out
there in the real world right now. Most of them
are not depending on deep fakes, because one, deep fakes
aren't really good enough to fool most people right now,
and too, they don't need the deep fakes in the
first place. There are other methods that are simpler that
(49:11):
don't need nearly the processing power that work just fine.
Why would you go through the trouble of synthesizing a
video if you can get a better response with a
blog post filled with lies or half truths. It's just
not a great return on investment. So bottom line, be
vigilant out there, particularly on social media. Be aware that
(49:33):
there are plenty of people who will not hesitate to
mislead others in order to get what they want. Use
a critical eye to evaluate the information you encounter. Ask questions,
check sources, look for corroborating reports. It's a lot of work,
but trust me, it's way better that we do our
best to make sure the stuff we're depending on is
(49:56):
actually dependable. It'll turn out better for us in long run.
Well that wraps up this episode of tech stuff, which yeah,
I used as a backdoor to argue about critical thinking. Again,
sue me, don't, don't really sue me. But I think
that that's another instance where it's a really clear example
where we have to use that kind of stuff. So
(50:18):
I'm gonna keep on stressing it. And you guys are awesome.
I believe in you. I think that when we start
using these tools at our disposal that everybody can develop
just with some practice that things will be better. We'll
be able to suss out the nonsense from the real stuff,
and we're all better off in the long run if
(50:41):
we can do that. If you guys have suggestions for
future topics I should cover in episodes of tech Stuff,
let me know via Twitter. The handle is text stuff
H s W and I'll talk to you again really soon.
(51:01):
Tech Stuff is an I Heart Radio production. For more
podcasts from I Heart Radio, visit the i Heart Radio app,
Apple Podcasts, or wherever you listen to your favorite shows.