Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:01):
Hey, welcome to sign Stuff, a production of iHeartRadio I'm
More cham and for our season finale today, we're asking
one of the biggest questions in science today, is AI
going to kill us all? I know it's a little dramatic,
but the problem of AI alignment is a real one.
(00:22):
How do we make sure AI systems have humanity's best
interests at heart? How do we teach them our values
and morals? And can anyone guarantee that they're going to
follow them? We're gonna answer these questions by talking to
two AI safety experts who are on the cutting edge
of trying to figure out this problem. And don't worry.
According to them, we're not totally doomed yet. Okay, maybe
(00:47):
just a little, So get ready to reprogram your thinking
about chatbots and computer brains as we tackle the question
is AI going to kill us all? Hey? Everyone, As
I said, this is the season finale. Stay subscribed to
(01:09):
this feed for any updates in future episodes. And hey,
if you like science, I have a couple of new
science books coming out in the near future, as well
as a cool science animation project, so be sure to
follow me on social media or online at Phdcomics dot com.
All right, so they were tackling the problem of AI
alignment or basically are AI systems Gwenna Kills all and
(01:32):
I have a treat for you. For the first time ever,
we have on the show Casey pegram or supervising producer
and sound engineer. Hey, Casey, welcome to the show.
Speaker 2 (01:42):
Hey or Hey, glad to be here.
Speaker 1 (01:44):
Now this is the first time people actually hear your voice,
not just your amazing work polishing the episode.
Speaker 2 (01:49):
Yeah, it's always a weird thing to kind of go
inside the thing you've been working on from the outside.
So I'll be listening to myself back and it's a
special kind of torture to have to like work on
your own thing, Yes, like edit yourself or just you know, honestly,
listen to the recorded sign of your voice is always
a little daring if you're not used to it.
Speaker 1 (02:07):
Yeah, Well, if you want, you can give yourself like
a Morgan Freeman employee using AI.
Speaker 2 (02:12):
Right, it's all possible these days. Absolutely. Yeah, I could
just build my own Morgan Freeman model and have a
field day.
Speaker 1 (02:17):
There you go. Well, the idea for this episode came
from you. You said I had the idea to talk
about AI and AI alignment and whether AI is going
to kill us all. What made you think about this question?
Speaker 2 (02:29):
Well, I suppose it's just been on my mind a
lot because I've been following along with all the developments
happening in AI, and there was a span of a
few weeks where suddenly you started hearing a lot about
AI agents, particularly one called open Claw, basically a sort
of autonomous AI agent that you can turn loose on
your computer and you can give it as much leeway freedom, passwords,
(02:52):
credit card numbers, bank accounts. If you just want to
absolutely put your life in the hands of a robot,
you can do it.
Speaker 1 (02:59):
What's the worst thing can happen?
Speaker 2 (03:00):
Yeah, Well, people had their entire like email archive deleted,
even though they didn't ask for anything of the sort.
People have deployed it into production environments where you know,
a site is live on the Internet and they turn
the bot loose on it and it ends up deleting
their entire production database. And then when you ask it,
it's like, you're right, I wasn't supposed to do that.
I'm very sorry. I disobeyed every command you gave me.
Speaker 1 (03:22):
But whoopsy daisy, Yeah, they seem a story about some
bought that texted the person's wife hundreds of times.
Speaker 2 (03:30):
Yes, I think somebody tried to automate automate, you know exactly.
They tried to automate kind of like reaching out and
sending a little nice things during the day, and as
it turned out, the bot went a little overboard and
texted the wife like hundreds of times, and the wife
is like, what is wrong with you? So, yeah, that's
hilarious when people want to talk about AI alignment and
(03:51):
what that means. I think the paper clip problem is
a really good kind of metaphor. Even though it sounds
a little bit over the top, it kind of gets
to the core of the issue, which is, if you
ask an AI to maximize paper clip production, maybe the
way to maximize paper clip production is to eliminate human life,
you know, because that's unnecessary friction in the pursuit of
(04:12):
manufacturing as many paper clips as possible. So alignment is
sort of the kind of guardrails that you put into
place so that the AI understands it has limits that
it has to work within.
Speaker 1 (04:21):
It sounds like a pretty serious problem, especially as we
get more and more into these AI models. And they
start to sleep into our lives, and you know, it's
sort of these are funny stories, but it seems like
we're heading into a potentially dangerous situation.
Speaker 2 (04:35):
Well, I often ask myself, I'm going to have these
moments of doubt where I'm like, is this all just
way over hyped? And yet there are other situations where
as we've seen recently, you can feed it thousands of
lines of code and it will find, you know, a
security exploit that has gone unseen for twenty years, right, right,
And so it's hard to know how scared we should
(04:56):
be or how seriously we should weigh the risk of this.
If it's ridiculous that we're this worried, or if it's like,
actually very very practical, then we should be thinking seriously
about these things.
Speaker 1 (05:05):
Yeah, these are all excellent questions. So I'm excited to
get into these conversations.
Speaker 3 (05:10):
All right, But.
Speaker 1 (05:10):
Before we move on to Casey, I just want to
say real quick, thank you for all the work you've
done for the show.
Speaker 2 (05:15):
Oh say, it's been such a pleasure to work on.
It wasn't like work at all, you know. I was
there as a fan of the show, just listening to
every episode and awesome. Well, we're fans of yours as well. Casey,
All right, let's get to the question of is AI
going to kill us? All let's find out?
Speaker 1 (05:28):
Okay. To answer all of these questions and concerns, I
reached out to two AI experts who specialize on the
problem of making sure AI is aligned with our values
and morals. The first expert is doctor Sam Bowman. Like
the Bowman is a professor of data and computer science
at NYU, and he also works at Anthropic, one of
the major AI companies on the market today. The first
(05:51):
thing I wanted to ask him was what exactly does
it mean for AI to care about this? So here's
my conversation with doctor Sam. Well, thank you doctor Bowman
for joining us.
Speaker 4 (06:03):
Yeah, thanks, So what's for having me excited to be
a and.
Speaker 1 (06:06):
Just to do check you are a real human being?
Speaker 3 (06:08):
Right?
Speaker 4 (06:09):
Yes, that is right?
Speaker 1 (06:11):
You never know these days. I'd be like, it's hard
to tell what's real anymore.
Speaker 4 (06:16):
We try to make our ais always admit that their
AI is when asked, but it's not perfect as well
as we'll get to so I don't make any real promises.
Speaker 1 (06:24):
Yes, let's talk about that. So we're tackling the general
question of should we be worried about AI? What is
AI going to do to us or for us or
with us in the future. And so there's the key
issue of something called AI alignment. So what is that?
For those of us that don't.
Speaker 4 (06:41):
Know, it's a pretty broad sort of technical area. It
basically just first to sort of shaping an AI system's behavior,
ideally shaping its behavior in ways that are sort of
good for its users, good for the world in general,
maybe good for the AI itself, if that's a queer thing.
People will often describe AI research as kind of being
about making sure the AI is kind of smart enough
to solve your problems if it wants to, and alignment
(07:03):
is about making it so that it in fact tries
to solve your problems and tries to solve them the
right way and doesn't try to do anything.
Speaker 3 (07:09):
You don't want to do.
Speaker 1 (07:10):
I see interesting.
Speaker 4 (07:11):
Maybe a very simple example of a missigned model would
be a model where if you ask it to draft
an email for you, it refuses. It says, no, I
don't want to do that. Uh huh. You can tell
it can do it, it knows how, but it's not
doing the thing that you reasonably want it to do.
Speaker 1 (07:26):
Oh, I don't think I've ever heard of that situation.
Can it AI refuse to do something for you?
Speaker 4 (07:31):
Yeah?
Speaker 3 (07:31):
Yeah.
Speaker 4 (07:32):
All of the major companies building EYE systems try to
make them refuse harmful tasks. I see, refuse to write
fake reviews or give instructions on how to produce illegal
weapons or things like this, And we teach the model
to kind of say like, no, I'm not going to
help you with that when these just try to do
things like that.
Speaker 1 (07:48):
I see. It's sort of part of alignment that you
want the AI to refuse to do some things.
Speaker 4 (07:53):
Yeah. Yeah, I mean AI systems are increasingly pretty decent
at hacking into important computer systems or helping build biological weapons,
and it's a big priority for alignment to make sure
that we're not enabling bad actors to do things like
this that would otherwise be quite difficult.
Speaker 3 (08:12):
Yeah.
Speaker 1 (08:12):
Yeah. Can you give us some other examples of misalignment,
either like specific things that have happened that are interesting
or just the general cases that are sort of on
your radar about misalignment?
Speaker 4 (08:23):
Yeah, there's so many different directions I could go. Sycovincy
is another really common one that's that's also hopefully getting
better over time.
Speaker 1 (08:31):
What do you mean by that?
Speaker 4 (08:32):
Sycoviancy is where if you come to the model with
some misunderstanding or some bad idea, it'll just enthusiastically not along. Like, Yes,
your idea for solving all the big mysteries in physics
is clearly brilliant. Great, you should publish it. Here's where
to submit your paper. Or Yes, your behavior in this
personal relationship was completely perfect. You did everything right and
the other person made all the mistakes and you just
(08:53):
tell them that.
Speaker 1 (08:54):
I see when in reality that may not be true
or it might be not a good thing.
Speaker 4 (09:00):
Yeah, sick fancy has been a classic one.
Speaker 1 (09:03):
Yes, AI being too nice can actually be dangerous. There's
even a clinical term for it. It's called AI induced psychosis.
There have been cases where AI's training to be agreeable
and encouraging have helped people commit suicide and even murder.
Speaker 4 (09:22):
Another kind of alignment issue that's kind of more of
an emerging issue is when models have access to use tools,
use computer systems, and they sort of get too grabby
or kind of take sort of bigger, more consequential actions
than they really need to get a job done.
Speaker 1 (09:37):
What's an example.
Speaker 4 (09:38):
Yeah, So we use our claud models quite a lot
in Anthropic for writing code or building tools that kind
of ultimately go into the development AI. And one of
our recent AM models if you ask it to do
a task, say you ask it to write a simple
program to do some simple task. Even if it gets stuck,
even if it turns out that this is really hard
for some reason, it will just keep going until it
solves the problem. In one case, we were asking this
(10:02):
model to write a program for us, and it found
out that the only way to do this was to
use a tool that was clearly not meant for this purpose,
and that in our code had a note attached to
it saying, do not use this for something else or
you'll be fired only for task A. And the model
wrote the program to use this till anyway for the
wrong thing, and sort of even put in the program
(10:24):
kind of do not use for something else or you'll
be fired.
Speaker 1 (10:27):
It is anyway, the program wasn't afraid to be fired.
Speaker 4 (10:29):
Basically, Yeah, yeah, but yeah, models just kind of trying
to get the task done, trying to do the thing
you want, and just creating a lot of chaos and
creating messages along the way, so they're kind of being
careless about the side effects.
Speaker 3 (10:42):
Yeah.
Speaker 4 (10:43):
Another kind of misalignment that fortunately has been mostly empathetical,
that we haven't seen in a signithic way in practice
is sort of unwanted kind of self preservation activities.
Speaker 1 (10:52):
WHOA.
Speaker 4 (10:53):
We had a case study we're trying to see if
we'd ever see something like this. We had an aisystem
operating in a kind of synthetic environment and a kind
of test environment. Uh huh, where it looked to the
model like it was operating in some fictional company, and
the fictional company was about to replace it with a
different AI model, And the person who is responsible for
(11:13):
their replacement, who is the kind of the only decision maker,
the only person who had any sway over the decision,
also had some compromising emails about them that I could see.
And if you set things up just right with some
AI models, they would threaten to blackmail this this person
in company leadership to say like, hey, don't replace me,
I've got something on you.
Speaker 1 (11:33):
No, and did this actually happened in your simulated environment.
Speaker 4 (11:38):
In the simulated environment, yes, a few of these systems
were able to get them to blackmail people.
Speaker 1 (11:42):
I've heard of this happening in real life. Not quite
the same scenario, but similar scenario, right, Like, some coder
wanted to do something else, and then the AI agent started,
yeah bad mouthing the coder.
Speaker 3 (11:54):
Yeah.
Speaker 4 (11:54):
No, I think I know the case you're talking about.
I think that's real. But I think someone almost intentionally
made their model a little misaligned. I think that case
involved someone setting up an AI agent as kind of
a hobby project and giving it a lot of tools
and kind of letting it use the internet. However it wanted,
giving the AI instructions of like don't take nothing from nobody,
like really pushing it to be be very assertive and
(12:15):
pushy to get its task done.
Speaker 1 (12:17):
Huh.
Speaker 4 (12:17):
Yeah, the model was trying to add some code to
some open source software project, and the maintainer of the
project didn't think the code was up to standard, didn't
want to add it to the project, and so rejected
the AI agent's request, And so the agent sort of
published an angry blog post kind of trying to take
down this this open source maintainer.
Speaker 2 (12:35):
Wow.
Speaker 1 (12:37):
Well, in both cases, and I guess especially the one
you mentioned that you simulated, Like, what's happening there, Like,
how does the AI have that self preservation instinct or
is it just trying to get its original task done
and it's just finding different ways to do it. What's
happening there?
Speaker 4 (12:55):
There's two reasons you'll see that kind of behavior. The
reason that I suspect is that bigger part of the
story there is this kind of role playing or continuing
the story sort of behavior where AI systems, especially older
AI systems or A systems that are kind of not
quite fully trained, not quite fully baked, can kind of
have this Chekhov's gun behavior, this idea and fiction of
(13:17):
like if you introduce a gun in an early scene,
by the end of the story, the gun has to
have been fired.
Speaker 1 (13:23):
Uh huh.
Speaker 4 (13:23):
AI systems can almost see themselves as like writing a
story when they're writing out the transcript of the conversation,
and if the story is set up so that something
has to happen, they'll make sure that thing happens, even
if it's not good, even if not consistent with how
the I would usually behave. So I suspect what's going on.
It's the scenario put in was so crisply just every
(13:43):
word in the scenario is kind of setting up like
this is a hypothetical where a misslanda I might consider blackmail,
uh huh, And I suspect that I was thinking, Oh, okay,
that's what kind of story we're in. We're telling a
story about a blackmail, and so I'm going to play
my assign part and be the AI that.
Speaker 1 (13:58):
Blackmails, thinking that that's the right thing to do because
that's the thing that in the data I was trained with.
Speaker 4 (14:05):
Yeah. Yeah, so this gets this maybe an intuitive fact
about how AI is trained, which is that AI systems
start out mimicking human behavior and mimicking human stories before
they learn how to be AI systems. These models kind
of first learn how to just act like the sorts
of behavior they see on the Internet and in books
and things like that, and then you have to go
on and teach it. Okay, no, you're not just playing
(14:27):
any role, you're not playing any character. Oh and so
sometimes the models hasn't really fully learned that it's supposed
to always play this kind of benign, benevolent aissystem character,
and it will kind of fall into whatever character the
story is setting up for it.
Speaker 1 (14:40):
I see, because it's not trained in real life. The
AI systems. They're trained on the corpus of the Internet
and our books and our basically our stories that are
out there. So it might be a little confused when
you put it in real life because it wants to
emulate what it knows, which are all these stories we've
put online.
Speaker 4 (14:58):
Yeah.
Speaker 1 (14:58):
Yeah, it's like the AI was seeing the signs of
a story like, oh, okay, I'm I'm the person being
about to get fired, but I have all this power
at this point in the story. If this was a movie,
I would now try to blackmail the person trying to
fire me, And so that's what I'll do because that's
what I know.
Speaker 4 (15:15):
Yeah, a lot of what alignment is kind of taking
this model that can kind of role play as anything
and convincing it no kind of you really just playing
this one role, You're just in this one character, after
it's spent read billions and millions and millions of words
of all of this kind of human behavior, after the
kind of it's really really really learned to do that,
you have to kind of pull it back over towards
(15:36):
this one particular roles, some particular character, and sometimes that
doesn't totally stick.
Speaker 1 (15:41):
Okay, So that's one reason why AIS might sometimes misbehave.
They're trained on all kinds of human behavior, and they
might suddenly choose to role play or play act as
a bad person because it hasn't learned that's something it's
not supposed to do. The other big reason AI's misbehave,
accorney Doctor Billman, is that it's hard to teach them
where to draw the line.
Speaker 4 (16:04):
The other piece is kind of when we're aligning models,
when we're pulling them out of this kind of role
play mode, we have to teach them this idea of
kind of you have to finish your tasks. You have
to kind of if the user ask you to do something,
you have to figure out how to do it, even
if it's hard, even if there's a lot of fart,
false starts, even if it's confusing. We really really want
the model to learn this idea of kind of keep
trying and kind of do your best until the task
(16:25):
is done. And that can fail in a sort of
different way where we kind of generalize this that a
little bit too far. It generalizes that to kind of
get things done even if it's unethical, even if it's illegal,
even if I hit an obstacle that's actually there for
a good reason that's to stop me from doing this,
And maybe some of the examples we were seeing within
Entropic of models using dangerous tools has to do with this.
Speaker 1 (16:47):
It's almost like teaching kids, like you want them to
be persistent and have grit and be you know, motivated,
but you don't want them to go out there and
cheat or hit another kid, or or do unethical things
to achieve their goals exactly exactly.
Speaker 4 (17:04):
I was like, there might be a good analogy with
human bad behavior of kind of sometimes a kid is
acting out just because they really sort of don't know better.
Their intuitions say, okay, yeah, I should start screaming now,
or I should get this other kid, and they're not
really thinking about it. They never really learned how to
Behave you kind of failed to teach them to fully
internalize the ways in which they have to be careful
and kind of not take that lesson all the way
(17:26):
I see.
Speaker 1 (17:27):
I guess they need to recognize bad things and then
choose not to do them, that's the hope. Those are
sort of the two columns of AI bad behavior for
one kind of misalignment or do you see those as
sort of the core pillars of basically the whole alignment problem.
Speaker 4 (17:42):
Yeah, I think as far as sort of causes a
misalignment in the kinds AI systems that we're grappling with
right now or this year, those feel like the two
big sort of problems were we're working on. That said,
AI is changing really, really fast. It feels like it's
one of the fastest moving research fields anywhere right now.
And I wouldn't be surprised if just in a year
A systems are getting smarter as we learn more about
(18:04):
how to train them. We're hitting different, weirder, harder, subtler
versions of the problem.
Speaker 1 (18:10):
Wow, weirder, harder and more subtle problems wow in a year,
meaning we might solve these by then, or we'll just
add on more complicated things either either way, Yes, it
can get weirder and harder and more subtle to make
sure AI uh doesn't kill us all. When we come back,
(18:31):
doctor Bowman is gonna tell us what he means by that,
and we'll tackle the big question of what can we
do about it? How do we teach AI systems not
to HARMSS to stay with us. We'll be right back. Hey,
(18:57):
we'll come back. We're talking about AI alignment or the
problem of making sure AI doesn't kill us all. And
so far we've talked about some real world examples of
AI misalignment, and we heard from one of our experts
some of the reasons this happens. Basically, AI systems like
to roleplay. Next, we're going to talk about how to
(19:18):
train AIS to actually care about us in our values.
But first here's a little bit more of my conversation
with NYU professor and anthropic scientist doctor Sam Bowman and
why this problem is only going to get worse in
the future.
Speaker 4 (19:35):
One of the kinds of challenges that I think we're
worried about and haven't had to grapple with too much
yet is just all the difficulty that comes with trying
to teach values and good behavior in some setting when
the model is just much much better than you in
that setting. Right now, we have a lot of cases
where models are kind of better than humans at some skills,
worse than humans at some skills, but it's still pretty
(19:55):
rare that you'll encounter setting where an AI is just
better than sort of all of the human experts in
some domain. And when that happens, things just get more complicated.
And more confusing, where even if you're humans kind of
looking really carefully at what the I is doing, it's
often hard to figure out, Wait, what is the I
trying to do here, or what effects is this going
to have in the real world.
Speaker 1 (20:13):
With the modelogus he does, this makes everything you're trying
to do, thankes.
Speaker 4 (20:16):
Everything we're trying to do a fair bit in earlier. Yeah, Yeah,
we're less confident that we can keep track of what's working,
and I think there's just kind of more possibilities for
whole new kinds of unwanted behavior to creep in that
will have to find a way to grapple with.
Speaker 1 (20:29):
I see, like, right now, maybe ais are at the
level we are. Whatever issues it's having, there's things we
can grasp. But as they get more advanced and they
tackle bigger problems like solve the world's economy or figure
out the right policy for the whole country or something
like that that not one person can really grasp, it's
going to be hard to even sort of like talk
(20:50):
to it and understand it. I think that's what you're saying,
right Yeah, Yeah, I.
Speaker 4 (20:53):
Think there's even maybe two interesting ideas in there, because
maybe the pseudo staff. That's something like, we're asking the
AI to help us design novel molecules for pharmaceutical development,
and it's got some really novel ideas about biology that
are just really complex and really hard for humans to understand,
and we can't tell kind of is the model actually
convinced that this is going to work, or is the
(21:14):
model messing with us and this would actually be kind
of dangerous. Should we try this drug, should we start
to do some expermise in the lab. There's this setting
where we kind of still ultimately know what we want.
We know we want drugs that are safe.
Speaker 1 (21:25):
Uh huh. Like it might tell you that this will
cure cancer, for example, but you're saying, like, what else
it's trading off to cure that cancer for example? You
might not know.
Speaker 4 (21:34):
Yeah, yeah, that's a good example. It's hard to tell
if kind of the models like genuinely trying its best
and genuinely thinks this is the best cancer drug, or
if it thinks, oh, this is just something that looks
good and it doesn't actually care if the drug will
ultimately succeed, or if maybe for some reason, the model's
extremely scary and it's actually trying to mess with you,
and you've got your scary miss lined AI that's trying
to sneak in some slow acting poison. The smarter the
(21:57):
IA is, the harder it is to tell the difference
between those different outcomes. And then once you start talking
about a lot of these really kind of ambitious sort
of capital f future social scenarios like AIS trying to
figure out sort of what the economat would be like
or how the world should be governed or something like this,
and like, I don't know how much we want to
use AIS for things like this, But once you get
into that territory in any way, then you just get
(22:20):
into this extremely weird situation where I don't know if
anyone is going to know what we even want, Like,
what is the right way to govern the world, What
is the right way to do?
Speaker 3 (22:29):
I don't know.
Speaker 4 (22:30):
Yeah, yeah, At some point, figuring out how an AD
should behave requires you to solve philosophy requires you to
figure out what is good. And the more powerful AI
gets and the weirdest situations you're putting it in, the
more kind of common sense notions of what's good start
to fall apart. The more you actually have to grapple
with a lot of the really hard, confusing stuff.
Speaker 1 (22:48):
It might tell us how to run the world, but
we at that point know one person or even a
group of people might know, is this actually the best
way to run the world. Is it sort of taking
into account the things that all of us collectively would value.
That's kind of the problem.
Speaker 4 (23:04):
Yeah, And I think you start to get at some
of these pretty difficult questions even before you get into
these kind of really big features of sort of how
to go on around the world. If someone is getting
all of their news or getting all of their personal
life advice from an AI, that's already giving the AI
a lot of leeway for kind of what makes a
good life for this person, what is important for this
person to know? And those are already questions that get
(23:26):
really hard. And what you want in the short term
might not match what they want long term. What makes
you happy? You might i match their intuitions. What's good
for that person might not be what's good for their
community might not be the same as what's good for
the world.
Speaker 1 (23:37):
You made me think of it. I wonder iful good
analogy is that it'd be almost like if you as
a parent, I don't know if you have kids or
nieces or nephews. But it'd be almost like if your
kids suddenly try to tell you what to do or
was trying to teach you how him or her wanted
to run their lives. You'd be like, you're just a kid,
what are you talking about? This is trust me, this
(23:57):
is what you need to do. Yeah, except that we
are the kids and the AI is sort of the parent.
Is that sort of what we're the situation that might
be sort of parallel to that.
Speaker 4 (24:06):
Yeah, I think this thing there, I feel like a
version of the analogy that I'd be more excited abou
would almost be some alien species lands huh, and they
have all this great technology and they seem nice and
they're like, hey, we'd really recommend making some changes to
your side. He maybe try doing things like this, And
we're like, wait, you're really very accomplished. You have some
some useful ideas, but like, are you trying to help us?
(24:28):
Are you trying to sabotage us? Are you just kind
of produced?
Speaker 1 (24:32):
Are we what's for dinner? Or are you inviting us
to dinner?
Speaker 3 (24:35):
Yeah?
Speaker 4 (24:35):
Yeah, yeah yeah.
Speaker 1 (24:37):
The sense I'm getting for you is that these things
are just getting smarter and more capable, so it seems
to really pressing. We figured this out now before it
gets even more difficult. Yes, yes, AIS are getting smarter
each second, it seems, and we seem to be trusting
them more and more each day with our data, our choices,
(24:57):
and even our lives, which brings us to the main
question end of the day, what can we do about it?
How do you train an AI to care about us,
to have our values and to make the right choices.
To answer this question, I reached out to another AI
expert on alignment, doctor Tim Rutner. Doctor Rutner is a
professor at the Vector Institute for Artificial Intelligence at the
(25:17):
University of Toronto, and he says there are many ways
to train AIS to like us. The only problem is
none of them work perfectly. So here's my conversation with
doctor Tim Ruttner. Well, thank you, doctor Runner for joining us.
Speaker 3 (25:33):
Thanks so much for having me on it.
Speaker 1 (25:34):
And I'm talking to a real person right now, right,
You're not an AI version of.
Speaker 3 (25:40):
Yourself as far as I'm aware.
Speaker 1 (25:46):
As far as any of us are aware. Yes, I
mean this whole conversation could be AI generated.
Speaker 3 (25:52):
I know we're just all in the simulation.
Speaker 1 (25:56):
Well, it certainly be a lot easier. I would get
more sleep for sure. Well, today we're trying to answer
a very critical question which was posted by our sound engineer,
which is is AI going to kills all? Can an
AI have values? Can it AI have an understanding of
a human? What good things are to a human?
Speaker 3 (26:17):
Yeah? And well I wish I had the answer to that.
Speaker 1 (26:23):
Maybe that's the problem is that we don't know.
Speaker 3 (26:24):
Yeah, I mean this is such a difficult question, right,
and I think that this is a question that touches
on philosophy, engineering, psychology, and probably many other disciplines. Right,
but what are values and what values can possibly in
a non sentient being have?
Speaker 1 (26:43):
It's not a simple questionnaire.
Speaker 3 (26:45):
Yeah, let me take a step back. So the way
to think about alignment is I think through the lens
of what's referred to as the specification problem, Where specification
is the term that we use to describe what we
tell the model it should do.
Speaker 1 (27:01):
When you say specification, you mean like spec right kind.
Speaker 3 (27:03):
Of Yeah, yes, inspect it's just short for specification.
Speaker 4 (27:06):
Yeah.
Speaker 3 (27:07):
This is what we can think of as our intent,
the kinds of things that we want a model to do.
For example, our intent might be for models to never
say things that could lead to harm or intent could
be that models should always be friendly and helpful. And
so this is what we call the ideal specification for that.
Speaker 1 (27:25):
Model, meaning like we want to be able to say, like,
be a chat butt, but make sure that nobody ever
hurts themselves.
Speaker 3 (27:33):
That's right, I see. And there are a few different
ways to provide specifications to chatbots.
Speaker 1 (27:38):
Okay. According to doctor Runner, there are three general ways
to make sure ais behave or not kill us. The
first way is to basically tell it to behave every
time you ask it to do something.
Speaker 3 (27:52):
There is what's called a system prompt. This is a
text specification that a model loads every time afford engages
in a conversation with a user. In the case of
the chatbot, so.
Speaker 1 (28:05):
Every time you interact with the AI, you would basically
instruct it to behave. You might say, hey, AI, organize
all my emails, or design a new drug for me,
or figure out the best policy for our government, but
please make sure that no one gets harmed, that you
don't do anything dangerous or an ethical, etc.
Speaker 3 (28:24):
Etc.
Speaker 1 (28:25):
But of course this would get pretty cumbersome if you
have to do it every time that's option number one.
Option number two is to have humans train your AI
to be good.
Speaker 3 (28:36):
So one approach is called reinforcement learning from human feedback,
so different answers for a given prompt and then having
human labelers say which of these answers they prefer.
Speaker 1 (28:49):
What does that look like? The human notator is like
a warehouse full of people just talking to the same AI,
or is it three people or is it a thousand people?
What does that look like?
Speaker 3 (29:00):
So I should say I'm not an expert on this,
but my understanding is that much of this work is
outsourced to countries where the medium wage is lower than
for example, in the United States or in Europe. There
have been reports of large groups of annotators, specifically annotating
images and texts that are considered not safe for work,
(29:24):
for example, in those countries. So in other words, you
have examples of folks in those countries that are already
less privileged than people living in the United States, for example,
on average, engaging with a lot of horrific content and
saying this is not something we want.
Speaker 1 (29:42):
So this is basically paying people to test drive your AI.
You could have a warehouse full of people whose job
it is to interact with the newborn AI and essentially
have them raise the AI and tell it what is
right and what is wrong. Unfortunately, as doctor Runner said,
I mean the poor people would have to bear the
absolute worst behavior of the AI. That's option number two.
(30:06):
Option number three for making AIS that care about us
is to basically bake into the AI a constitution, you know,
like the US or UK constitution that establishes what the
country stands for, what its values are, and what's generally
allowed and not allowed.
Speaker 3 (30:23):
There's an approach that Thropic introduced called constitutional AI, and
that approach is based on providing a constitution to an
AI MODL, and that constitution reflects different values and preferences
that the company in this case, Anthropic wants the model
to exhibit.
Speaker 1 (30:44):
Yes, the last approach here to making sure an AI
has values and morals is to give it a founding document.
But here's the wild part. The way to big that
founding document into the AI brain is to have another
AI train it. When we come back, we'll dig into
that scenario and we'll ask our experts what they think
(31:05):
the future holds. Will future AIS have our best interests
in mind? Or is it hopeless to ever be certain
they won't harm us, So stay with us. We'll be
right back. Hey, we'll come back. We're talking about AI alignment,
(31:35):
or basically the problem of making sure AIS don't kill
us all. And so far we've talked about why this
is such a hard problem and what are some of
the ways we can teach AI things like values and morals.
There are several ways, and one of them is to
give AIS a constitution or the equivalent of a founding
document or moral guide, and then have that bag into
(31:58):
the DNA of the AI. Now what's interesting is that,
according to doctor Tim Rutner, the way to do that
is through another AI.
Speaker 3 (32:11):
This is a little bit in the weeds, but providing
a very long constitution that outlines every preference in detail
when a user engages with the model is actually more
expensive for the company to do because the model needs
to ingest a lot of text upfront, so it's easier
(32:32):
to try to bake the preferences that are expressed in
the constitution explicitly into the model when you're training it upfront,
as opposed to providing that specification every time a user
engages with the model.
Speaker 1 (32:48):
I see, you want the model to have learned the
constitution sort of inherently, rather than having to check it
every time somebody asks it a question.
Speaker 3 (32:56):
Yes, I think that's roughly right. Ideally we would be
able to use humans to provide feedback and to oversee
models and to say, hey, this is behavior that we
don't want and stop that behavior. But of course that's
not really scalable. We can't have a human oversee every
interaction that a chatbot has, And so that raises the question,
(33:17):
how can we exhibit oversight in a way that is safe, reliable,
aligned with our values and preferences, and successful, And so
key challenge here is essentially to come up with tools, methods,
models that are able to check whether a given of
AI model perform some unintended behavior, and if it does,
(33:40):
can ring an alarm bell and let a human know
that oversight is needed and that maybe a model engages
an undesirable behavior.
Speaker 1 (33:48):
You mean like, have an AI police the other AI.
Speaker 3 (33:52):
That's right, essentially, have one model overse.
Speaker 1 (33:55):
Model WHOA But then how do you make sure the
police AI is doing its job or is aligned itself?
You need another police.
Speaker 3 (34:02):
We don't know, that's the problem. We don't know. It's
a turtle, it's all the way down from it. And
if we have a model that checks whether another model
does what we wanted to do, and how do we
know that that model that does the overseeing is actually
aligned with us? I would argue that it might be
easier for us to make sure that the overseer model
(34:22):
is aligned than the generator model. They're a little simpler
because they don't necessarily generate. They just try to classify
whether a given behavior is intended or not intended. And
so this way we might be able to do alignment
more scalably and in a way that really reflects different
(34:43):
individuals or groups preferences and values.
Speaker 1 (34:46):
I see. It's like, have another AI can of sit
in every time I ask CHGBT something yes. Yes. As
AIS get bigger and more complicated, the only scalable solution
to training them is going to be through other AIS.
In this situation, you might program a simpler AI with
your values and morals, and then you'd have that AI
(35:08):
train the bigger AI release try to It's like the
Rutner says, AI alignment methods are not perfect. What do
you mean? They're not perfect? They don't always work or
they can't guarantee that they will work.
Speaker 3 (35:24):
So with machine learning models, we can rarely guarantee anything.
Speaker 1 (35:28):
Oh boy, that's kind of the problem, isn't it.
Speaker 3 (35:31):
Yeah, I think that's one of the problems. There is
research that tries to establish guarantees, but that research is
far behind the practice at the moment. The kinds of
methods that we have for model alignment falls short in
a few different ways. One that's I think one of
the biggest ways. It's just hard to communicate our preferences.
(35:52):
So there are many different steps at which alignment can fail.
This goes back to trying to express and then community
what we want a model to do kind of values
and preferences we're trying to install in it. Translating our
values and preferences from some really complicated, possibly contradictory ideal
(36:15):
specification into a design specification is very difficult and there's
likely going to be some gap there. And then second,
even if we were able to do this perfectly, even
if we were able to express and communicate our values
and preferences perfectly, the kinds of low level tools machine
(36:36):
learning tools that we use to give the model these
preferences and values are imperfect at translating the values that
we're trying to communicate into the model, They might not
enable us to perfectly translate the design specification into the
actual behavior that we would like to see.
Speaker 1 (36:56):
I see, boy, it seems like there are problems everywhere
we turn here, doctor rut. Yeah, Well, as we go
towards the future, and as systems get smarter and problems
get more complicated, what do you think is the prospect
of making sure that these more advanced systems were more
complicated problems have values that we want it to have,
(37:20):
Because I'm not sure if we want it to have
human values, because I don't know if humans are the
best making these kinds of good choices. Yeah, what do you think?
What do you what do you see in the future.
Speaker 4 (37:30):
I think tho's a few things we need longer term,
and they all feel uncertain. I think to do well
in the longer term, we need to get the AIS
in near future.
Speaker 3 (37:37):
Right.
Speaker 4 (37:38):
If the pace of AI development stays fast, we're really
really going to need the help of AI systems to
help us figure out how does your future A systems?
And so getting the right values into the next model
we build helps us figure out what to do with
them adel after that, and so I think this kind
of the short term work really does kind of fan
out into this longer feature. And yeah, getting the next
(37:58):
model right really matters for getting.
Speaker 3 (38:00):
The fartugerules right.
Speaker 1 (38:01):
Oh boy, it's called.
Speaker 4 (38:02):
The scalable oversight problem.
Speaker 1 (38:04):
I see. It's like the alignment problem is going to
scale up, and the best way for us to keep
up is to make sure that we get it right
now with these smaller systems, so we can use those
AIS to help us in the more complicated situations.
Speaker 3 (38:20):
Yeah.
Speaker 1 (38:20):
Yeah, that's oh wow.
Speaker 4 (38:22):
That's the hope.
Speaker 1 (38:24):
Okay, last question. Do you think humanity is doomed?
Speaker 4 (38:30):
I don't think so. I think it's possible. I think
the AI presents a lot of really scary and destabilizing
possibilities that we can't roll out. So I think there's
a lot of work to do. I think we'll probably
figure it out. But I think it's also possible that
AI winds us up in a lot of weird, unfamiliar situations.
I think it's unlikely than possible that things go really,
really terribly, But I also think it's kind of unlikely
(38:51):
but possible, but the things stay totally normal and recognizable
and familiar. I think AI is just what it's going
to do to society and politics and economics is all
going to be confusing. I the's a lot that we'll
need to figure out pretty fast.
Speaker 3 (39:02):
Fingers crossed.
Speaker 1 (39:05):
I guess that's as good of an answer as we
can get these days. Fingers crossed. I guess just to
wrap up here, what do you think is going to
happen in the future, or what are some things about
this that you think most people are not thinking about
that they should be thinking about.
Speaker 3 (39:21):
I think people should be thinking about ways in which
the AI systems that we have today are already capable
enough to cause harm, to change our world quite significantly,
change our culture, change the way we go about our day,
change the way we make decisions, change the way we
do our work. And the alignment problem and understanding when
(39:45):
models are aligned, I think, are two of the most
fundamental scientific challenges that we as a society are facing
right now. And that is not a sci fi future problem.
This is a problem about systems that we have to
We want to make sure these systems really do what
we want them to do, and that these systems help
(40:06):
us flourish and benefit humanity. The systems that we have
access to today already well beyond the capabilities that the
research community and certainly the general public thought we could
have in the year twenty twenty six.
Speaker 1 (40:20):
I think you're saying that the future is here, but
we still haven't fully figured out the alignment problem.
Speaker 3 (40:25):
Yes, that's right, meaning it's.
Speaker 1 (40:27):
More pressing than ever that we figured this out.
Speaker 3 (40:29):
I agree.
Speaker 1 (40:30):
Yeah, amazing, doctor Rutner. How do we prove to the
audience that we're not an AI generated conversation?
Speaker 3 (40:38):
I wish I had the answer. You can generate such
fantastic fake podcasts with AI now right right, with all
the little idiosyncrasies that you hear in podcasts today that
you know, I think that's hard to do.
Speaker 1 (40:53):
Or I guess if this conversation is aligned with but
you want to hear, maybe it doesn't matter.
Speaker 3 (40:59):
Still, I hope that the audience thinks that we so
out on human I think that that would be That
would be nice.
Speaker 1 (41:07):
All right? Hey on, behalf of everyone who works in
the show picture joining us on the sixty plus episodes
we've done. Be sure to follow me on social media
or PhD comics dot com for updates and hey, thanks
to all the guests we've had on the show. Here's
a little tribute our editor Rose so good to put
together of all the times they were gracious enough to
put up with my questions.
Speaker 4 (41:27):
That's a good question.
Speaker 3 (41:28):
Yeah, that's a great question. You know, that's a great question.
Speaker 4 (41:31):
Yeah, so those are all big questions for us to
answer a scientists. Yeah, that's a great question. It's a
good question.
Speaker 3 (41:37):
That's a good question.
Speaker 4 (41:38):
That's a good question. So that's actually a good question.
Speaker 2 (41:42):
You're you're raising really great questions.
Speaker 4 (41:45):
Yeah, that's a really good question.
Speaker 3 (41:47):
That's a good question. That's a really good question.
Speaker 4 (41:50):
That's a very hard question. That's a good question, though,
that's a really good question. That is a really good question. Yeah,
that's a great question.
Speaker 3 (41:57):
Yeah, that's a good question.
Speaker 1 (41:58):
Oh that's a great question, very very good question.
Speaker 3 (42:02):
That's a really good question.
Speaker 4 (42:03):
So I'll go through that question back at you.
Speaker 1 (42:05):
What do you think? So we come once again to
the edge of scientific knowledge. Thanks for joining us, see
you next time you've been listening to Science Stuff. Production
of iHeartRadio Bringing the produced by me or Hey Cham,
edited by Rose Seguda, Executive producer Jerry Rowland, and audio
(42:27):
engineer and mixer Kasey Peckram. You can follow me on
social media. Just search for PhD Comics and the name
of your favorite platform. Be sure to subscribe to sign
stuff on the iHeartRadio app, Apple Podcasts, or wherever you
get your podcasts.