Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:15):
Pushkin. You're listening to Brave New Planet, a podcast about
amazing new technologies that could dramatically improve our world. Or
if we don't make wise choices, could leave us a
(00:36):
lot worse off, Utopia or dystopia. It's up to us.
On November eleventh, twenty sixteen, the Babelfish burst from fiction
into reality. The Babelfish was conceived forty years ago in
(00:58):
Douglas Adam's science fiction classic The Hitchhiker's Guide to the Galaxy.
In the story, a hapless Earthling finds himself a stowaway
on a Vogon spaceship. When the alien captain starts an
announcement over the loudspeaker, his companion tells him to stick
a small yellow fish in his ear. Listen, it's important,
(01:24):
it's a I can't just put this in your ear.
Suddenly he's able to understand the language. The Babelfish is small, yellow,
leech like and probably the oddest thing in the universe.
It feeds on brainwave energy, whose ambing all unconscious frequencies,
(01:46):
the practical upshot of which is that if you stick
one in your ear, you instantly understand anything said to
you in any form of language. At the time, the
idea of sticking an instantaneous universal translator in your ear
seems charmingly absurd, But a couple of years ago, Google
and other companies announced plants to start selling Babelfish well
(02:09):
not fish actually, but earbuds that do the same thing.
The key breakthrough came in November twenty sixteen, when Google
replaced the technology behind its translate program. Overnight, the Internet
realized that something extraordinary had happened. A Japanese computer scientist
ran a quick test. He dashed off his own Japanese
(02:31):
translation of the opening lines of Ernest Hemingway's short story
The Snows of Kilmanjaro, and dared Google Translate to turn
it back into English. Here's the opening passage from the
Simon and Schuster audio book. Kilimanjaro is a snow covered
mountain nineteen thousand, seven hundred and ten feet high and
(02:53):
is said to be the highest mountain in Africa. Its
western summit is called the Massai Nagaji Nagai, the House
of God. Close to the western summit there is the
dried and frozen carcass of a leopard. No one has
explained what the leopard seeking at that altitude. Let's just
consider that last sentence. No one has explained what the
(03:16):
leopard was seeking at that altitude. One day earlier, Google
had mangled the back translation quote. Whether the leopard had
what the demand at that altitude? There is no that
nobody explained. But now Google Translate returned quote. No one
(03:37):
has ever explained what leopard wanted at that altitude. It
was perfect except for a missing the the what explained
the great leap? While Google had built a predictive algorithm
that taught itself how to translate between English and Japanese
(03:59):
by training on a vast library of examples and tweaking
its connections to get better and better at predicting the
right answer. In anyways, the algorithm was a black box.
No one understood precisely how it worked, but it did
amazingly well. Predictive algorithms turn out to be remarkably general.
(04:22):
They can be applied to predict which movies a Netflix
user will want to see next, or whether an eye
exam or a mammogram indicates disease. But it doesn't stop there.
Predictive algorithms or also being trained to make societal decisions
who to hire for a job, whether to approve a
mortgage application, what students to let into a college, what
(04:46):
a rest ease to let out on bail? But what
exactly are these big black boxes learning from massive data sets?
Are they gaining deep new insights about people? Or might
they sometimes be automating systemic biases? Today's big question when
(05:07):
should predictive algorithms be allowed to make big decisions about people?
And before they judge us, should we have the right
to know what's inside the black box? My name is
Eric Lander. I'm a scientist who works on ways to
improve human health. I helped lead the Human Genome Project,
and today I lead the Broad Institute of MIT and Harvard.
(05:31):
In the twenty first century, powerful technologies have been appearing
at a breathtaking pace, related to the Internet, artificial intelligence,
genetic engineering, and more. They have amazing potential upsides, but
we can't ignore the risks that come with them. The
decisions aren't just up to scientists or politicians. Whether we
(05:52):
like it or not, we all of us are the
stewards of a brave New planet. This generation's choices will
shape the future as never before. Coming up on today's
episode of Brave New Planet, predictive algorithms. We hear from
(06:20):
a physician at Google about how this technology might help
keep millions of people with diabetes from going blind, and
the idea was, well, if you could retrain the model,
you could get to more patients to screen them for disease.
The first iteration of the model was on par with
the US board sortified ophomologists. I speak with an AI
(06:41):
researcher about how predictive algorithms sometimes learn to be sexist
and racist. If you typed in I am a white man,
you would get positive sentiment. If you typed in I
am a black lesbian, for example, negative sentiment. We hear
how algorithms are affecting the criminal justice system. For black defendants,
it was much more likely to incorrectly predict that they
(07:05):
were going to go on to come in a future
crime when they didn't, and for white defend it was
much more likely to predict that they were going to
go on to not commit a future crime when they did.
And we hear from a policy expert about whether these
systems should be regulated. A lot of the horror stories
are about fully implemented tools that were in work for years.
(07:25):
There's never a pause button. To reevaluate or look at
how a system is working real time. Stay with us
Chapter one, The Big Black Box. To better understand these algorithms,
I decided to speak with one of the creators of
the technology that transformed Google Translate. My name is Greg Kurado,
(07:49):
and I'm a distinguished scientist at Google Research. Early in
his career, Greg had trained neuroscience, but he soon shifted
his focus from organic intelligence to artificial and not turned
out to be really a very lucky moment, because I
was becoming interested in artificial intelligence at exactly the moment
that artificial intelligence was changing so much. Ever since the
(08:12):
field of artificial intelligence started more than sixty years ago,
there have been two warring approaches about how to teach
machines to do human tasks. We might call them human
rules versus machine learning. The way that we used to
try to get computers to recognize patterns was to program
(08:32):
into them specific rules. So we would say, oh, well,
you can tell the difference between a cat and a
dog by how long it's whiskers are and what kind
of fur it has and does it have stripes? And
trying to put these rules into computers. It kind of worked,
but it made for a lot of mistakes. The other
(08:52):
approach was machine learning, let the computer figure everything out
for itself, somewhat like the biological brain. The machine learning
system is actually built of tiny little decision makers or neurons.
They start out connected very much in random ways, but
we give the system feedback. So, for example, if it's
(09:13):
guessing between a cat and a dog and it gets
one wrong, we tell the system that it got one wrong,
and we make little changes inside so that it's much
more likely to recognize that cat as a cat and
not mistake it for a dog. Over time, the system
gets better and better and better. Machine learning had been
around for decades with rather unimpressive results. The number of
(09:37):
connections and neurons in those early systems was pretty small.
We didn't realize until about two and ten that computers
had gotten fast enough and the data sets were big
enough that these systems could actually learn from patterns and
learn from data better than we could describe rules manually.
(10:02):
Machine learning made huge leaps. Google itself became the leading
driver of machine learning. In twenty eleven, Krrado joined with
two colleagues to form a unit called Google Brain. Among
other things, they applied a machine learning approach to language translation.
(10:22):
The strategy turned out to be remarkably effective. It doesn't
learn French the way you would learn French in high school.
It learns French the way you would learn French at home,
much more like the way that a child learns the language.
We give the machine the English sentence, and then we
give it an example of a French translation of that
(10:45):
whole sentence. We show a whole lot of them, probably
more French and English sentences than you could read in
your whole life. And by seeing so many examples of
entire sentences, the system is able to learn, oh, this
is how I would say this in French. That's actually,
at this point about as good as a biling human
(11:09):
would produce. Soon Google was training predictive algorithms for all
sorts of purposes. We use neural network predictors to help
rank search results, tell people organize their photos, to recognize speech,
to find driving directions, to help complete emails. Really anything
that you can think of where there's some notion of
(11:30):
finding a pattern or making a prediction, artificial intelligence might
be at play. Predictive algorithms would become ubiquitous in commerce.
They let Netflix know which movies to recommend to each customer,
Amazon to suggest products users might be interested in purchasing,
and much more well, they're shockingly useful, they can also
(11:53):
be inscrutable. Modern neural networks are like a black box.
Understanding how they make their predictions can be surprisingly difficult.
When you build an artificial neural network, you do not
necessarily understand exactly the final state of how it works.
Figuring out how it works becomes its own science project.
(12:16):
One thing we do know. Predictive algorithms are especially sensitive,
so the choice of examples used to train them. The
systems learn to imitate the examples in the data that
they see. You don't know how well they will do
on things that are very different. So, for example, if
you train a system to recognize cats and dogs, but
(12:38):
you only ever show it border collies and tabbycats, it's
not clear what it will do. When you show it
a picture of chihuahua, all it's ever seen as border collies,
it may not get the right answer. So its concept
of dog is going to be limited by the dogs
it's seen. That's right, and this is why diversity of
(13:00):
data in machine learning systems is so important. You have
to have a data set that represents the entire spectrum
of possibilities that you expect the system to work under.
Teaching algorithms turns out to be not so different than
teaching people. They learn what they see. Chapter two, retina fundoscopy.
(13:27):
It's cool that predictive algorithms can learn to translate languages
and suggest movies, but what about more life changing applications.
My name is Lily Ping. I am a physician by training,
and I am a product manager at Google. I went
to visit doctor Ping because she and her colleagues are
using predictive algorithms to help millions of people avoid going blind. So.
(13:50):
Diabetic retinopathy is a complication of diabetes that affects the
back of the eye, the retina. One of the devastating
complications is vision loss. All patients that have diabetes need
to be screened once a year for a diabetic retnopathy.
This is an asymptomatic disease, which means that you do
not feel the symptoms. You don't experienced vision loss until
(14:12):
it's too late. Now, diabetes is epidemic around the world.
How many diabetics are there? Though by most estimates, there
are over four hundred million patients in the world with diabetes.
How do you screen a patient to see whether they
have diabetic retinopathy. You need to have a special camera
while a fundis camera and it takes a picture through
(14:34):
the people of the back of the eye. We have
a very small supply of retina specialists and eye doctors
and they do a lot more than reading images, so
they needed to scale the reading of these images. Four
hundred million people with diabetes. There just aren't enough specialists
for all the retinal images that need reading, especially in
(14:55):
some countries in Asia where resources are limited and the
incidence of diabetes is skyrocketing. Two hospitals in southern India
recognize the problem and reached out to Google for help.
That point, Google was already sort of well known for
image recognition. We were classifying cats and dogs and consumer images,
(15:18):
and the idea was, well, if you could retrain the
model to recognize diabetic retinopathy, you could potentially help the
hospitals in India get to more patients to screen them
for disease. How did you and your colleagues set out
to attack this problem? So when I first started the project,
we had about one hundred thirty thousand images from eye
(15:40):
hospitals in India as well as a screening program in
the US. Also, we gathered the army of opthalmologists to
grade them eight hundred eighty thousand diagnoses or rendered on
one hundred thirty thousand images. So we took this training
data and we put it in a machine learning model
and had to do The first iteration of the model
(16:01):
was on par with the US board sortified ophomologists. Since then,
we've made some improvements to the model, and the initial
training took about how long The first time we train
a model, it may have taken a couple of weeks,
But then the second time you train the next models
and next models, it's just it's shorter and shorter, sometimes overnight,
(16:22):
sometimes overnight. Well, yes, all right, And by contrast, how
long does it take to train a board certified ophthimologist,
So that usually takes at least five years, and then
you also have additional fellowship ears to specialize in the retina.
And at the end of that you only have one
board certified ophthimologist. Yes, at the end of that you'd
(16:42):
have one very very well trained doctor, but that's not scaled. Yes,
So by contrast, a model like this scales worldwide and
never fatigues. It consistently gives the same diagnosis on the
same image, and it obviously takes a much shorter time
to train. That being said, it does a very very
(17:07):
narrow task that is just a very small portion of
what that doctor can do. The retina screening tools already
being used in India, It was recently approved in Europe
and its under review in the United States. Groups around
the world are now working on other challenges in medical imaging,
like detecting breast cancers at earlier stages. But I was
(17:29):
particularly struck by a surprising discovery by Lily's team that
unexpected information about patients was hiding in their retinal pictures.
In the fundess image, there are blood vessels, and so
one of the thoughts that we had was, because you
can see these vessels, I wonder if we can predict
(17:49):
cardiovascular disease from the same image. So we did an
experiment where we took fundess images and we train a
model to predict whether or not that patient would have
a heart attack in five years. We found that we
could tell whether or not this patient may have a
a vascular event much better than doctors. It speaks to
(18:13):
what might be in this data that we've overlooked. The
model could make predictions that doctors couldn't from the same
type of data. It turned out the computer could also
do a reasonable job of predicting a patient sex, age,
and smoking status. The first time I did this with
an ahomologist, I think she thought I was trolling her.
(18:35):
I said, well, here pictures. Guess which one is a woman,
Guess which one is a man. Guess which one's a smoker,
Guess which one is young. Right, these are all tasks
that doctors don't generally do with these images. It turns
out the model was right ninety eight ninety nine percent
of the time. That being said, there are much easier
ways of getting the sex of a patience. So so,
(18:59):
while scientifically interesting, this is one of the most useless
clinical predictions ever. So how far can it go? If
you gave preference for rock music or not? What do
you think? You know? We tried predicting happiness. That didn't work,
So I'm guessing rock music. Oh, probably not, but who knows. So.
(19:22):
Predictive algorithms can learn a remarkable range of tasks, and
they can even discover hidden patterns that humans miss. We
just have to give them enough training data to learn from.
Sounds pretty fantastic. What could possibly go wrong? Chapter three?
(19:44):
What could possibly go wrong? If predictive algorithms can use
massive data to discover unexpected connections between your eye and
your heart, what might they be learning about, say, human society.
To answer this question, I took a trip to speak
with Kate Crawford, the co founder and co director of
(20:04):
the AI Now Institute at New York University. When we
begin and we were the world's first AI institute dedicated
to studying the social implications of these tools. To me,
these are the biggest challenges that we face right now,
simply because we've spent decades looking at these questions from
a technical lens at the expense of looking at them
(20:26):
at a social and an ethical lens. I knew about
Kate's work because we served together on a working group
about artificial intelligence for the US National Institutes of Health.
I also knew she had an interesting background. I grew
up in Australia. I studied a really strange grab bag
of disciplines. I studied law, I studied philosophy, and then
(20:49):
I got really interested in computer science, and this was
happening at the same time as I was writing electronic
music on large scale modulus synthesizers, and that's still a
thing that I do today. It's almost like the opposite
of artificial intelligence because it's so analog, so I absolutely
love it for that reason. In the year two thousand
and Kate's band released an album entitled twenty twenty that
(21:14):
included a pression song called Machines work so that people
have time to think. It's funny because we use a
sample from an early IBM promotional film that was made
in the nineteen sixties, which says machines can do the
work so that people have time to think, and we
(21:37):
actually ended up sort of cutting it and splicing it
in the track, so it ends up saying that people
can do the work so that machines have time to think.
And strangely, the more that I've been working in the
sort of machine learning space, I think, yeah, there's a
lot of ways in which actually people are doing the
work so that machines can do all the thinking. Kate
(22:00):
gave me a crash course on how predictive algorithms not
only teach themselves language skills, but also in the process
acquire human prejudices, even in something as seemingly benign as
language translation. So in many cases, if you say, translate
a sentence like she is a doctor into a language
(22:21):
like Turkish, and then you translate it back into English,
and you're saying Turkish because Turkish has pronouns that are
not gendered precisely, and so you would expect that you
would get the same sentence back, but you do not.
It will say he is a doctor, so she is
a doctor was translated into gender neutral Turkish as all
(22:41):
beer doctor, which was then back translated into English as
he is a doctor. In fact, you could see how
much the predictive algorithms had learned about gender roles. Just
by giving Google Translate a bunch of gender neutral sentences
in Turkish. You got he is an engineer, she is
a cook. He is a soldier, but she is a teacher.
(23:05):
He is a friend, but she is a lover. He
is happy and she is unhappy. I find that one
particularly odd, and it's not just language translation that's problematic.
The same sort of issues arise in language understanding. Predictive
algorithms were trained to learn analogies by reading lots of texts,
they concluded that dog is to puppy as cat is
(23:28):
to kitten, and man is to king as woman is
to queen. But they also automatically inferred that man is
to computer programmer as woman is to homemaker. And with
the rise of social media, Google used text on the
Internet to train predictive algorithms to infer the sentiment of
(23:49):
tweets and online reviews. Is it a positive sentiment? Is
it a negative sentiment? I believe it was Google who
released their sentiment engine, and you could just try it online,
you know, put in a sentiment and see what you'd get.
And again, similar problems emerged. If you typed in I
am a white man, you would get positive sentiment. If
you typed in a black lesbian, for example, negative sentiment.
(24:12):
Just as Greg Korado explained with chihuahuas and border collies,
the predictive algorithms were learning from the examples they found
in the world, and those examples reflected a lot about
past practices and prejudices. If we think about where you
might be scraping large amounts of text from say Reddit,
for example, and you're not thinking about how that sentiment
(24:35):
might be biased against certain groups, then you're just basically
importing that directly into your tool. But it's not just
conversations on Reddit. There's the cautionary tale of what happens
when Amazon let a computer teach itself how to sift
through mountains of resumes for computer programming jobs to find
the best candidates to interview. So they set up this system,
(24:59):
they designed it, and what they found was a very
quickly this system had learned to discard and really demote
the applications from women. And typically if you had a
women's college mentioned, and even if you had the word
women's on your resume, your application would go to the
bottom of the pile. All right, So how does it
(25:20):
learn that? So, first of all, we take a look
at who is generally hired by Amazon, and of course
they have a very heavily skewed male workforce, and so
the system is learning that these are the sorts of
people who will tend to be hired and promoted. And
it is not a surprise then that they actually found
it impossible to really retrain this system. They ended up
(25:41):
abandoning this tool because simply correcting for a bias is
very hard to do when all of your ground truth
data is so profoundly skewed in a particular direction. So
Amazon dropped this particular machine learning project and Google fixed
the Turkish to English problem. Today, Google Translate gives both
he is a doctor and she is a doctor as
(26:04):
translation options. But biases keep popping up in predictive algorithms
in many settings, there's no systematic way to prevent them. Instead,
spotting and fixing biases has become a game of whacamole
Chapter four quarterbacks. Perhaps it's no surprise that algorithms trained
(26:28):
in the wild west of the Internet or on tech
industry hiring practices learned serious biases. But what about more
sober settings like a hospital. I talked with someone recently
discovered similar problems with potentially life threatening consequences. Hi am
Christine Vogeli. I'm the director of evaluation research at Partner's
(26:52):
Healthcare here in Boston. Partner's Healthcare, recently rebranded as mass
General Brigham, is the largest healthcare provider in Massachusetts, a
system that has six thousand doctors and a dozen hospitals
and serves more than a million patients. As Christine explained
to me, the role of healthcare providers in the US
(27:13):
has been shifting. The responsibility for controlling costs and ensuring
high quality services is now being put down on the
hospitals and the doctors. And to me, this makes a
lot of sense, Right, we really should be the ones
responsible for ensuring that there's good quality care and that
we're doing it efficiently. Healthcare providers are especially focusing their
(27:34):
attention on what they call high risk patients. Really, what
it means is that they have both multiple chronic illnesses
and relatively acute chronic illnesses. So give me a set
of conditions that a patient might have, right, So somebody,
for example, with cardiovascular disease co occurring with diabetes, and
you know, maybe they also have depression. They're just kind
(27:55):
of suffering and trying to get used to having that
complex illness and how to manage it. Partners Healthcare offers
a program to help these complex patients. We have a
nurse or social worker who works as a care manager
who help everything from education to care coordination services. But
really that care manager works essentially as a quarterback, arranges
(28:17):
everything but also provides hands on care to the patient
and the caregiver. Yeah, I think it's a wonder how
we expect patients to go figure out all the things
they're supposed to be doing and how to interact with
the medical system without a quarterback. It's incredibly complex. These
patients have multiple specialists who are interacting with the primary
care physician. They need somebody to be able to tie
(28:39):
it together and be able to create a care plan
for them that they can follow, and it pulls everything
together from all those specialists. Partners Healthcare found that providing
complex patients with quarterbacks both saved money and improved patient's health.
For example, they had fewer emergency visits each year, so
Partners developed a program to identify the top three percent
(29:02):
of patients with the greatest need for the service. Most
were recommended by their physicians, but they also used a
predictive algorithm provided by a major health insurance company that
assigns each patient a risk score. What does the algorithm do?
When you look at the web page, it really describes
itself as a tool to help identify high risk patients.
(29:26):
And that term is really interesting term to me. What
makes a patient high risk? So I think from an
insurance perspective, risk means these patients are going to be
expensive from a healthcare organization perspective, these are patients who
we think we could help, and that's the fundamental challenge
on this one. When the team began to look closely
(29:48):
at the results, they noticed that people recommended by the
algorithm were strikingly different than those recommended by their doctor.
We noticed that black patients overall were underrepresented patients with
similar numbers of chronic illnesses. If they were black, they
had a lower riskcore than if they were white, and
(30:08):
that didn't make sense. Just black patients identified by the
algorithm turned out to have twenty six percent more chronic
illnesses than white patients with the same risk scores. So
what was wrong with the algorithm? It was because given
a certain level of illness, black and minority patients tend
(30:28):
to use fewer healthcare services, and whites tend to use
more even if they have the same level of chronic
even if they have the same level of chronic conditions.
That's right, So in some sense, the algorithm is correctly
predicting the cost associated with the patient, but not the
need exactly. It predicts costs very well, but we're interested
(30:49):
in understanding patients who are sick and have needs It's
important to say that the algorithm only used information about
insurance claims and medical costs. It didn't use any information
about a patient's race. But of course these factors are
correlated with race due to longstanding issues in American society. Frankly,
(31:10):
we have fewer minority physicians and we do white physicians.
So the level of trust minorities with the healthcare system,
we've observed it's lower. And we also know that there
are just systematic barriers to care that certain groups of
patients experience more so. For example, race and poverty go
together and job flexibility. So all these issues with scheduling,
(31:34):
being able to come in, being able to access services
are just heightened for minority populations relative to white populations.
So someone who just has less economic resources might not
be able to get off work, might not be able
to get off work, might not have the flexibility with
childcare to be able to come in for a visit
when they need to. Exactly, so it means that if
(31:55):
one only relied on the algorithm, you wouldn't be targeting
the right people. Yes, we would be targeting more advantaged
patients who tend to use a lot of healthcare services
when they corrected the problem, the proportion black patients in
the high risk group jumped from eighteen percent to forty
seven percent. Christine, together with colleagues from several other institutions,
(32:18):
wrote up a paper describing their findings. It was published
in Science, the nation's leading research journal, in twenty nineteen.
It made a big splash, not least because many other
hospital systems we're using the algorithm and others like it.
We've since changed the algorithm that we use to one
(32:39):
that uses exclusively information about chronic illness and not healthcare utilization,
and has that worked. We're still testing. We think it's
going to work, but as in all of these things,
you really need to test it. You need to understand
and see if there's actually any biases. In the end,
you can't just adopt an algorithm. It's very important to
(33:01):
be very conscious about what you're predicting. It's also very
important to think about what are the factors you're putting
into that prediction algorithm. Even if you believe the ingredients
so right, you do actually have to see how it
works in practice. Anything that has to do with people's lives,
you know, you have to be transparent about it. Chapter
(33:22):
five Compass Transparency. Christine Vogeli and her colleagues were able
to get to the bottom of the issue with the
medical risk prediction because they had ready access to the
partners healthcare data and could test the algorithm. Unfortunately, that's
not always the case. I traveled to New York to
(33:43):
speak with a person who's arguably done more than anyone
to focus attention on the consequences of algorithmic bias. My
name is Julia Anguin. I'm a journalist. I've been writing
about technology for twenty five years, mostly at the Wealthy
Journal and Pro Publicat. Julia grew up in Silicon Valley
as the child of a mathematician and a chemist. She
(34:04):
studied math at the University of Chicago, but decided to
pursue a career your in journalism. Her quantitative skills gave
her a unique lens to report on the societal implications
of technology, and she eventually became interested in investigating high
stakes algorithms. When I learned that there was actually an
algorithm that judges used to help decide what to sentence people,
(34:28):
I was stunned. I thought, this is shocking. I can't
believe this exists, and I'm going to investigate it. What
we're talking about is a score that is assigned to
criminal defendants in many jurisdictions in this country that aims
to predict whether they will go on to commit a
future crime. It's known as a risk assessment score, and
(34:52):
the one that we chose to look at was called
the Compass Risk Assessment Score. Based on the answers to
a long list of questions, Compass gives defendants a risk
score from one to ten. In some jurisdictions, judges use
the Compass score to decide whether defendants should be released
on bail before trial. In others, judges use it to
(35:14):
decide the length of sentence to impose undefendants who plead
guilty or who were convicted a trial. Julia had a
suspicion that the algorithm might reflect bias against black defendants.
Attorney General Eric Holder had actually given a big speech
saying he was concerned about the use of these growers
and whether they were exacerbating racial bias, and so that
(35:37):
was one of the reasons we wanted to investigate. But
investigating wasn't easy. Unlike Christine Vogeli had partner's healthcare. Julia
couldn't inspect the Compass algorithm itself. Now, Compass isn't a
modern neural network who was developed by a company that's
now called Equivand and it's a much simpler algorithm. It's
(35:57):
basically a linear equation that should be easy to understand.
But it's a black box of a different sort. The
algorithm is opaque because to date Fant has insisted on
keeping it a trade secret. Julia also had no way
to download defendants Compass scores from her website, so she
(36:18):
had to gather the data herself. Her team decided to
focus on Broward County, Florida. Florida has great public records laws,
and so we filed a public records request and we
did end up getting eighteen thousand scores. We got scores
for everyone who was arrested for a two year period.
Eighteen thousand scores. All right, So then what did you
(36:41):
do to evaluate these scores? Well, first thing we did
when we got the eighteen thousand scores was actually we
just threw them into a bar chart black and white defendants.
We immediately noticed there was really different looking distributions for
black defendants. The scores were evenly distributed, meaning one through
ten lowest risk to highest Chris. There's equal numbers of
(37:04):
black defendants in every one of those buckets. For white defendants,
the scores were heavily clustered in the low risk range.
And so we thought, there's two options. All the white
people getting scored in Broward County are legitimately really low risk.
They're all Mother Teresa, or there's something weird going on.
Julia sworted the defendants and to those who were rearrested
(37:27):
over the next two years and those who weren't. She
compared the compass scores that had been assigned to each group.
For black defendants, it was much more likely to incorrectly
predict that they were going to go on to commit
a future crime when they didn't, and for white defendants,
it was much more likely to predict that they were
going to go on to not commit a future crime
(37:48):
when they did. They were twice as many false positives
for black defendants as white and twice as many false
negatives for white defendants as black defendants. Julia described the
story of two people whose arrest histories illustrate this difference.
A young eighteen year old black girl named Brecia Borden,
who had been arrested after picking up kid's bicycle from
(38:10):
their front yard. Riding at a few blocks. The mom
came out yelled at her, so that's my kid's bike.
She gave it back, but actually by then the neighbor
had called the police, and so she was arrested for that.
And we compared her with a white man who had
stolen about eighty dollars worth of stuff from a drug
store Vernon Prater. When teenager Brecia Borden got booked into jail,
(38:35):
she got a high compass score and eight, predicting a
high risk that she'd get re arrested, And Vernon Prader,
he got a low score a three. Now he had
already committed two armed robberies and had served time. She
was eighteen. She given back the bike, and of course
(38:59):
these scores turned out to be completely wrong. She did
not go on to commit a future crime in the
next two years, and he actually went on to break
into a warehouse steal thousands of dollars of electronics and
he's serving a ten year ten And so that's what
the difference between a false positive and a false negative
looks like. It looks like Fresha Borden and Vernon Prater
(39:24):
Chapter six Criminal Attitudes. Julia Anguin and her team spent
over a year doing research. In May twenty sixteen, Pro
Publica published their article headlined machine Bias. The subtitle quote
their software used across the country to predict future criminals,
(39:46):
and it's biased against blacks. Julia's team released all the
data they had collected so that anyone could check or
dispute their conclusions. What happened next was truly remarkable. The
Pro Publica article provoked an outcry for some statisticians, who
argued that the data actually proved moved Compass wasn't biased.
(40:11):
How could they reach the opposite conclusion. It turned out
the answer depended on how you define bias. Pro Publica
had to analyze the Compass scores by looking backward after
the outcomes were known among people who are not re arrested.
They found that black people had been assigned much higher
(40:32):
risk scores than white people. That seemed pretty unfair, but
statisticians use the word bias to describe how a predictor
performs when looking forward before the outcomes happened. It turns
out that black people and white people who received the
same risk score had roughly the same chance of being rearrested.
(40:55):
That seems pretty fair, So whether Compass was fair or
unfair depended on your definition of fairness. This sparked an
explosion of academic research. Matt Maticians showed there's no way
out of the problem. They proved a theorem saying it's
impossible to build a risk predictor that's fair when looking
(41:19):
both backward and forward unless the arrest rates for black
people and white people are identical, which they aren't. The
pro public article also focused at tension on many other
ways in which COMPASS scores are biased, like the healthcare
algorithm that Christine Vogeli studied. Compass scores don't explicitly ask
(41:41):
about a person's race, but race is closely correlated with
both the training data and the inputs to the algorithm. First,
the training data, COMPASS isn't actually trained to predict the
probability that a person will commit another crime. Instead, it's
trained to predict whether a person will be arrested for
committing another crime. The problem is there's abundant evidence that
(42:06):
in situations where black people and white people commit crimes
at the same rate, for example, illegal drug use, black
people are much more likely to get arrested, so Compass
is being trained on an unfair outcome. Second, the questionnaire
used to calculate Compass scores is pretty revealing. Some sections
(42:28):
assess peers, work, and social environment. The questions include how
many of your friends and acquaintances have ever been arrested?
How many have been crime victims? How often do you
have trouble paying bills. Other sections are titled criminal personality
and criminal attitude. They ask people to agree or disagree
(42:53):
with such statements as the law doesn't help average people,
or many people get into trouble because society has given
them no education, jobs, or future. In a nutshell, the
predictor penalizes defendants who are honest enough to admit they
live in high crime neighborhoods or they don't fully trust
(43:14):
the system. From the questionnaire, it's not hard to guess
how a teenage black girl arrested for something so minor
is writing someone else's bicycle a few blocks and returning
it might have received a COMPASS score of eight. And
it's not hard to imagine why racially correlated questions would
(43:35):
do a good job of predicting racially correlated arrest rates.
Pro PUBLICA didn't win a Pulitzer Prize for its article,
but it was a remarkable public service Chapter seven Minority report.
Putting aside the details of Compass, I wanted to find
(43:57):
out more about the role of predictive algorithms in courts.
I reached out to one of the leading legal scholars
in the country. I'm Martha Minnow. I'm a law professor
at Harvard, and I have recently immersed myself in issues
of algorithmic fairness. Martha Minnow has a remarkable resume. From
(44:18):
two thousand and nine to twenty seventeen, she served as
dean of the Harvard Law School, following now Supreme Court
Justice Elaina Kagan. Martha also served on the board of
the government sponsored Legal Services Corporation, which provides legal assistance
to low income Americans. She was appointed by her former
law student, President Barack Obama. It became very interested in
(44:42):
and concerned about the increasing use of algorithms in worlds
that touch on my preoccupations with equal protection, do process,
constitutional rights, fairness, anti discrimination. Martha recently co signed a
statement with twenty six other lawyers and scientists raising quote
(45:03):
grave concerns about the use of predictive algorithms for pre
trial risk assessment. I asked her how courts had gotten
involved in the business of prediction. Criminal's justice system has
flirted with the use of prediction forever, including discussions from
the nineteenth century on in this country about dangerousness and
(45:25):
whether people should be detained prevactively. So far, that's not
permitted in the United States. It appears in Minority Report
and other interesting movies. The movie starring Tom Cruise tells
the story of a future in which the PreCrime division
of the police arrest people for crimes they haven't yet committed.
(45:48):
I'm placing you under arrest for the future, murder, Sarah marks.
We are arresting individuals who've broken no law, but they will.
The use of prediction in the context of sentencing is
part of this rather large sphere of discretion that judges
have to decide what kind of sentence fits the crime
(46:08):
you're saying. In sentencing, one is allowed to use essentially
information from the pre crime division about crimes that haven't
been committed yet. Well, I am horrified by that suggestion,
but I think it's fair to raise it as a concern.
The problem is if we actually acknowledge purposes of the
(46:30):
criminal justice system, some of them start to get into
the future. So if one purpose is simply incapacitation, prevent
this person from walking the streets because they might hurt
someone else, there's a prediction built in. So judges have
been factoring in predictions about a defendant's future behavior for
(46:50):
a long time. And judges certainly aren't perfect. They can
be biased or sometimes just cranky. There are even studies
showing the judges hand down harsher sentences before lunch breaks
than after. Now, the defenders of risk Prediction score will say, well,
it's always not what's the ideal but compared to what
(47:13):
and if the alternative is we're relying entirely on the
individual judges and their prejudices, their lack of education, what
they had for lunch. Isn't this better that it will
provide some kind of scaffold for more consistency. Journalist Julia
Anguin has heard the same arguments some good friends right
(47:37):
who really believe in the use of these criminal risks
score algorithms. I've said to me, look, Julia, the fact
is judges are terribly biased, and this is an improvement,
and my feeling is That's probably true for some judges
and maybe less true for other judges. But I don't
think it is a reason to automate bias, right, Like
(47:59):
I don't understand why you say, Okay, humans are flawed,
so why don't we make a flawed algorithm and bake
it into every decision, because then it's really intractable. Martha
also worries that numerical risk scores are misleading. The judges
think high numbers mean people are very likely to commit
violent crime. In fact, the actual probability of violence is
(48:22):
very low, about eight percent according to a public assessment,
And she thinks numerical scores can lull judges into a
false sense of certainty. There's an appearance of objectivity because
it's math, but is it really Then for lawyers, they
may have had no math, no numeracy education since high school.
(48:46):
Many people go to a law in part because they
don't want to do anything with numbers. And there is
a larger problem, which is the deference to expertise, particularly
scientific expertise. Finally, I wanted to ask Martha if defendants
have a constitutional right to know what's inside the black
(49:07):
box that's helping to term in their fate. I confess
I thought the answer was an obvious yes until I
read a twenty sixteen decision by Wisconsin's Supreme Court. The
defendant in that case, Eric Loomis, pled guilty to operating
a car without the owner's permission and fleeing a traffic officer.
(49:29):
When Loomis was sentenced, the presentencing report given to the
judge included a Compass score that predicted Loomis had a
high risk for committing future crimes. He was sentenced to
six years in prison. Loomis appealed, arguing that his inability
to inspect the Compass algorithm violated his constitutional right to
(49:52):
due process. Wisconsin's Supreme Court ultimately decided that Loomis had
no right to know how Compass worked. Why. First, the
Wisconsin court said the score was just one of several
inputs to the judge's sentencing decision. Second, the court said
even if Lomas didn't know how the score was determined,
(50:14):
he could still dispute its accuracy. Lomas appealed to the
US Supreme Court, but it declined to hear the case.
I find that troubling and not persuasive. It was up
to you, how would you change the law. I actually
would require transparency for any use of any algorithm by
(50:37):
a government agency or court that has the consequence of
influencing not just deciding, but influencing decisions about individual's rights.
And those rights could be rights to liberty, property opportunities.
So transparency, transparency, and am be able to see what
(50:58):
this algorithm does, absolutely and have the code and be
able to give it to your own lawyer and your
own experts. But should a state be able to buy
a computer program that's proprietary. I mean it would say, well,
I'd love to give it to you, but it's proprietary.
I can't. Should that be okay? I think not, because
if that then limits the transparency, that seems a breach.
(51:20):
But you know, this is a major problem, the outsourcing
of government activity that has the effect of bypassing restrictions.
Take another example, when the US government hires private contractors
to engage in war activities, they are not governed by
the same rules that govern the US military. She's saying
(51:43):
that government can get around constitutional limitations on the government
by just outsourcing it to somebody who's not the government.
It's currently the case, and I think that's wrong for
her part, journalist Julia Angwin is baffled by the Wisconsin
Court's ruling. I mean, we have this idea that you
should be able to argue against whatever accusations are made.
(52:06):
But I don't know how you make an argument against
a score, like the score says you're seven, but you
think you're a four. How do you make that argument
If you don't know how that seven was calculated? You
can't make an argument that you're a four Chapter eight
(52:28):
robo recruiter. Even if you never find yourself in a
criminal court filling out a compass questionnaire, that doesn't mean
you won't be judged by a predictive algorithm. There's actually
a good chance it will happen the next time you
go looking for a job. I spoke to a scientist
at a high tech company that screens job applicants. My
(52:49):
name is Lindsay Zulaga, and I'm actually educated as a physicist,
but now working for a company called higher View. Higher
View is a video interviewing platform. Companies create an interview,
candidates can take it at any time that's convenient for them,
So they go through the questions and they record themselves answer.
(53:09):
So it's really a great substitute for kind of the
resume phone screening part of the process. When a candidate
takes a video interview, they're creating thousands of unique points
of data. A candidate's verbal and nonverbal cues give us
insight into their emotional, engagement, thinking, and problem solving style.
(53:34):
This combination of cutting edge AI and validated science is
the perfect partner for making data driven talent decisions. Higher View.
You know, we'll have a customer and they are hiring
for something like a call center, say it's sales calls.
(53:55):
And what we do is we look at past employees
that applied, and we look at their video interviews. We
look at the words they said, tone of voice, pauses,
and facial expressions, things like that, and we look for
patterns in how those people with good sales numbers behave
as compared to people as low sales numbers. And then
we have this algorithm that scores new candidates as they
(54:17):
come in, and so we help kind of get those
more promising candidates to the top of the pile so
they're seen more quickly. So Higher View trains a predictive
algorithm on video interviews of past applicants who turned out
to be successful employees. But how does higher View know
its program isn't learning sexism or racism or other similar biases.
(54:41):
There are lots of reasons to worry. For example, studies
from M I. T have shown that facial recognition algorithms
can have a hard time reading emotions from black people's faces.
And how would Higher Views program evaluate videos from people
who might look or sound different than the average employee, say,
people who don't speak English as a native language, who
(55:04):
are disabled, who are on the autism spectrum, or even
people who are just a little quirky. Well, Lindsay says
Higher View tests for certain kinds of bias, So we
audit the algorithm after the fact and see if it's
scoring different groups differently in terms of age, race, and gender.
So if we do see that happening a lot of times,
(55:27):
that's probably coming from the training data. So maybe there
is only one female software engineer in this data set,
the model might mimic that bias. If we do see
any of that adverse impact, we simply remove the features
that are causing it, so we can say this model
is being sexist. How does the model even know what
gender the person is? So we look at all the features,
(55:49):
and we find the features that are the most correlated
to gender. If there are, we simply remove some of
those features. I asked lindsay why people should believe higher
views or any company's assurances, or whether something more was needed.
You seem thoughtful about this, but there will be many
people coming into the industry over time might not be
(56:10):
as thoughtful or as sophisticated as you are. Do you
think it would be a good idea to have third
parties come in to certify the audits for bias? I
know that's a hard question, I guess I I kind
(56:30):
of lean towards no. So you're talking about having a
third party entity that comes in an assess and certifies
the audit. You know, because you've described what I think
is a really impressive process. But of course, how do
we know it's true? You know, you could reveal all
your algorithms, but probably not the thing you want to do,
And so the next best thing is a certifier says yes,
(56:53):
this audit has been done. Probably you know your financials
presumably get audited. Why not the result of the algorithm?
I guess a little bit. The reason I'm not sure
about the certification is just. It is mostly just because
I feel like I don't know how it would work exactly,
Like you're right totally that finances are audited. I haven't
thought about it enough to have like a strong opinion
(57:15):
that it should happen, because it's like, Okay, we have
all these different models, it's constantly changing. How to do
they audit every single model all the time. I was
impressed with Lindsay's willingness as a scientist to think in
real time about a hard question, and it turns out
she kept thinking about it afterwards. A few months later,
(57:38):
she wrote back to me to say that she changed
her mind. We do have a lot of private information,
but if we don't share it, people tend to assume
the worst. So I've decided, after thinking about it quite
a bit, that I definitely support the third party auditing
of algorithms. Sometimes people you assume we're doing horrible, horrible things,
(58:00):
and that can be frustrating. But I do think the
more transparent we can be about what we are doing
is important. Several months later, Lindsay emailed again to say
that Higher View was now undergoing a third party audit.
She says she's excited to learn from the results, Chapter nine,
(58:23):
confronting the black box so higher view at first, Reluctant
says it's now engaging external auditors. What about Equivant, whose
Compass scores can heavily influence prison sentences, but which is
steadfastly refused to let anyone even see how their simple
(58:43):
algorithm works. Well. Just before we release this podcast, I
checked back with them. A company spokesperson wrote that Equivant
now agrees that the Compass scoring process quote should be
made available for third party examination, but they weren't releasing
it yet because they first wanted to file for copyright
(59:04):
protection on their simple algorithm. So we're still waiting. You
might ask, should it be up to the companies to decide?
Aren't there laws or regulations? The answer is there's not much.
Governments are just now waking up to the idea that
they have a role to play. I traveled back to
(59:26):
New York City to talk to someone who's been involved
in this question. My name's Rashida Richardson, and I'm a
civil rights lawyer that focuses on the social implications of
artificial intelligence. Rashida served as the director of policy research
at a i Now Institute at NYU, where she worked
with Kate Crawford, the Australian expert and algorithmic bias that
(59:49):
I spoke to earlier in the episode. In twenty eighteen,
New York City became the first jurisdiction in the US
to create a task force to come up with recommendations
about government use of predictive algorithms, or, as they call them,
automated decision systems. Unfortunately, the task force bogged down in
(01:00:10):
details and wasn't very productive. In response, Rashida led a
group of twenty seven experts that wrote a fifty six
page shadow report entitled Confronting Black Boxes that offered concrete proposals.
New York City it turns out, uses quite a few
(01:00:30):
algorithms to make major decisions. You have the school matching algorithms.
You have an algorithm used by the Child Welfare agency here.
You have public benefits algorithms that are used to determine
who will qualify or have their public benefits, whether that's
(01:00:50):
Medicaid or temporary food assistance terminated, or whether they'll receive
access to those benefits. You have a gang database which
tries to identify who is likely to be in a gang,
and that's both used by the DA's office and the
police department. If you had to make a yes, how
many predictive algorithms are used by the City of New York,
(01:01:15):
I'd say upwards to thirty and I'm underestimating with that number.
How many of these thirty plus algorithms are transparent about
how they work, about their code. None. So what should
New York do? It was up to you what should
(01:01:36):
be the behavior of a responsible city with respect to
the algorithms it uses. I think the first step is
creating greater transparency, some annual acknowledgement of what is being used,
how it's being used, whether it's been tested or had
a validation study. And then you would also want general
(01:01:56):
information about the inputs or factors that are used by
these systems to make predictions, because in some cases you
have factors that are just discriminatory or proxies for protected
status is like race, gender, ability status. All right, So
step one, disclose what systems you're using. Yes, And then
(01:02:17):
the second step, I think is creating a system of audits,
both prior to procurement and then once procured, ongoing auditing
of the system to at least have a gauge on
what it's doing real time. A lot of the horror
stories we hear are about fully implemented tools that we're
in works for years. There's never your pause button to
(01:02:39):
reevaluate or look at how a system is working real time.
And even when I did studies on the use of
predictive policing systems, I looked at thirteen jurisdictions, only one
of them actually did a retrospective review of their system.
So what's your theory about how do you get the
auditing done? If you are going to outsource to third parties,
(01:03:01):
I think it's going to have to be some approval
process to assess their level of independence, but also any
conflict of interest she use that may come up, and
then also doing some thinking about what types of expertise
are needed, because I think if you don't necessarily have
someone who understands that social context or even the history
of a certain government sector, then you could have a
(01:03:25):
tool that is technically accurate and meets all of the
technical standards, but is still reproducing harm because it's not
paying attention to that social context. Should a government be
permitted to purchase an automated decision system where the code
can't be disclosed by contract now, and in fact, there's
(01:03:48):
movement around creating more provisions that vendors must waive trade
secrecy claims once they enter a contract with the government.
Rashida says, we need laws to regulate the use of
predictive algorithms, both by governments and by private companies like
higher View. We're beginning to see bills being explored in
(01:04:08):
different states. Massachusetts, Vermont, and Washington DC are considering setting
up commissions to look at the government use of predictive algorithms.
Idaho recently passed a first in the nation law requiring
that pre trial risk algorithms be free of bias and transparent.
(01:04:28):
It blocks manufacturers of tools like Compass from claiming trade
secret protection. And at the national level, a bill was
recently introduced in the US Congress, the Algorithmic Accountability Act.
The bill would require that private companies ensure certain types
of algorithms are audited for bias. Unfortunately, it doesn't require
(01:04:53):
that the results of the audit are made public, so
there's still a long way to go. Rashida thinks it's
important that regulations don't just focus on technical issues. They
need to look at the larger context. Part of the
problems that were identif fine with these systems is that
they're amplifying and reproducing a lot of the historical and
(01:05:14):
current discrimination that we see in society. There are large
questions we've been unable to answer as a society of
how do you deal with the compounded effect of fifty
years of discrimination? And we don't have a simple answer,
and there's not necessarily going to be a technical solution.
But I think having access to more data in an
understanding of how these systems are working will help us
(01:05:36):
evaluate whether these tools are even being evalue added and
addressing the larger social questions. Finally, Kate Crawford says laws
alone likely won't be enough. There's another thing we need
to focus on. In the end, it really matters who
is in the room designing these systems. If you have
(01:05:57):
people sitting around a conference table, they all look the same.
Perhaps they all did the same type of engineering degree.
Perhaps they're all men. Perhaps they're all pretty middle class
or pretty well off. They're going to be designing systems
that reflect their worldview. What we're learning is that the
more diverse those rooms are, and the more we can
question those kinds of assumptions, the better we can actually
(01:06:17):
design systems for a diverse world. Conclusion, choose your planet.
So there you have it, Steorides of the Brave New Planet.
Predictive algorithms, a sixty year old dream of artificial intelligence
(01:06:39):
machines making human like decisions has finally become a reality.
If a task can be turned into a prediction problem,
and if you've got a mountain of training data, algorithms
can learn to do the job. Countless applications are possible,
translating languages instantaneously, providing expert medical diagnoses for eye diseases
(01:07:03):
and cancer to patients anywhere, improving drug development, all at
levels comparable to or better than human experts. But it's
also letting governments and companies make automatic decisions about you,
whether you should get admitted to college, be hired for
a job, get a loan, get housing assistance, be granted bail,
(01:07:26):
or get medical attention. The problem is that algorithms that
learn to make human like decisions based on past human
outcomes can acquire a lot of human biases about gender, race, class,
and more often masquerading as objective judgment. Even worse, you
(01:07:48):
usually don't have a right even to know you're being
judged by a machine, or what's inside the black box,
or whether the algorithms are accurate or fair. Should laws
require that automated decision systems used by governments or companies
be transparent? Should they require public auditing for a curacy
(01:08:09):
and fairness? And what exactly is fairness? Anyway? Governments are
just beginning to wake up to these issues, and they're
not sure what they should do. In the coming years,
they'll decide what rules to set, or perhaps to do
nothing at all. So what can you do a lot?
(01:08:30):
It turns out you don't have to be an expert
and you don't have to do it alone. Start by
learning a bit more. Invite friends over virtually or in
person when it's saved for dinner and debate about what
we should do. Or organize a conversation at a book club,
a faith group, or a campus event. And then email
(01:08:53):
your city or state representatives to ask what they're doing
about the issue, maybe even proposing first steps like setting
up a task force. When people get engaged, action happens.
You'll find lots of resources and ideas at our website,
Brave New Planet dot org. It's time to choose our planet.
(01:09:16):
The future is up to us, James. Brave New Planet
is a co production of the Broad Institute of Mt
and Harvard Pushkin Industries in the Boston Globe, with support
(01:09:37):
from the Alfred P. Sloan Foundation. Our show is produced
by Rebecca Lee Douglas with Mary Doo theme song composed
by Ned Porter, mastering and sound designed by James Garver,
fact checking by Joseph Fridman, and a Stitt and Enchant.
Special Thanks to Christine Heenan and Rachel Roberts at Clarendon Communications,
(01:09:58):
to Lee McGuire, Kristen Zarelli and Justine Levin Allerhand at
the Broad, to Milobelle and Heather Faine at Pushkin, and
to Eli and Edy Brode who made Broad Institute possible.
This is Brave New Planet. I'm Ericlander.