AI Safety and Alignment with Amal Iyer

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Joshua Rubin (00:06):
Yeah, good morning, good afternoon, and whatever
time of the day it is for you.
Uh, welcome and thank you forjoining us on today's AI Explained,
uh, on AI Safety and Alignment.
Um, I'm Josh Rubin, PrincipalAI Scientist at Fiddler.
Uh, I'll be your host today.
We have a special guest today onAI Explained, and that's Amal Iyer,
Senior Staff Scientist at Fiddler AI,and my colleague of several years.

(00:30):
Um, welcome Amal.

Amal Iyer (00:32):
Uh, thanks Josh for having me here.
Pretty excited to talk to you.
Always fun.

Cool, cool.
So alignment is the topic.
Uh, do you want to start out bytalking about, like a little bit
about why we should care aboutsafety and alignment for AI models?

Amal Iyer (00:46):
Yeah.
Um, me sort of, you know, set thestage a little bit, uh, and, you know,
ChatGPT and Gemini have made it easyfor me, so I don't have a talk about it.
How capable current systems are and howin many ways they've come out of the blue.
Um, so if, if I were to take you back to2018 and say you are an NLP scientist,

(01:09):
you're training a training, a bird likemodel, generally you are trying to get
it to sort of mid, you know, uh, uh, doan entailment task or predict certain
labels, uh, sentiment, et cetera.
Uh, so compared to even 2018or 2019, we've come a long way.
Um, we, these models are able to generatenot just coherent sentences, but are

(01:32):
able to do a lot of impressive things.
Um, and the capabilities ofcourse, haven't surpassed humans
in many dimensions is especiallysort of generalized thinking.
But we are seeing sortof initial sparks of.
What one could contend ashighly capable general systems.
So I won't say the word AGI, but I thinkhighly capable general systems a good

(01:55):
way to characterize, uh, these systemstoday and where they will continue to
sort of go over the next few years.
Um, so a big part of ML research hasfocused on capabilities, uh, and I
would say bulk of, uh, our effort overthe past several decades in ML, um,
and AI in general has been focusedon, you know, improving capabilities

(02:18):
of models come devising methods.
And I think it's a good time givenhow rapid the progress has been
over the past few years to startthinking more broadly about safety
and alignment of these models.
'cause not only do we want thesemodels to be highly capable, we,
we want them to perform activitieson behalf of us in a safe manner.

(02:44):
And we want them to be alignedwith our broad human values.
And that's why I think this is a greattime for not just ML researchers,
but um, ML practitioners and usersof this tool to start thinking about
safety and alignment more broadly.

Yeah.
I, I think, you know, I think it's easyto overlook, you know, when we're talking
about, okay, well it's not as smart asa human in this way, or it responds in
some silly or funny way to my prompt.
I think it's really easy tooverlook how remarkable it is
that we suddenly have models thatare instruction following it all.
Right?
Like, like, I mean, people talk aboutinstruction following, they talk
about zero shot learning or few shotlearning as though prompting a model

(03:25):
is a method of machine learning, right?
It's a totally new paradigm.
Um, and, and it's really aremarkable cha uh, capability, um,
that, you know, it should be, andit's easy just kind of go, wow.
But it really should bea hint that these models.
Or have just entered a totally d newdomain of capability where they're,
um, you know, they can do all sortsof things that we intend them to do

(03:48):
and maybe don't intend them to do.
Right.
So, I don't know if you, do youwanna say something about instruction
following and like, since that's sortof the, the key capability of the LLMs?

Amal Iyer (03:56):
Yeah, so I, I think, you know, you, you hit the nail there
that some, somehow we've in 2024normalized some of this behavior.
Like

Yeah.

Amal Iyer (04:05):
And, you know, when, in late 2022 when, uh, a lot of us start playing
with these models through ChatGPT, etcetera, it's like, whoa, we are here.
And then early 2024, it seemslike, oh, just another thing.
I mean, we've sort of normalized it,but, so I think the pace of progress,
like when you are on an exponentialcurve can seem, you know, like we, and

(04:30):
we may not be able to realize how quickthe progress is because we tend to
normalize a lot of technology advances.
Um, in terms of sort of, you know,just talking broadly about instruction,
uh, just training these modelsand specifically about instruction
tuning, I think what we've hit uponis an interesting recipe that we

(04:52):
could potentially like keep scaling.
Um, so for those of you who are notfamiliar, there's this notion of something
called scaling laws, which, uh, says,"Hey, you know, uh, let's actually train
larger models on more corpus of data."
So, um, and you can, you can say you'reessentially throwing more compute at

(05:12):
the problem, more compute and more data.
And for a classic ML researcher, it'slike, oh, that's not interesting at all.
But what that has unlocked is a lotof interesting emergent capabilities
like instruction following.
I think one that blows mymind is in context learning.
Um, the fact that somethinglike that emerged is.

(05:34):
Pretty surprising and weweren't able to predict it.
And given the amount of resources, um,both compute data and great people, uh,
and companies that are trying to reallypush forward scaling, um, it's anybody's
guess what kind of new capabilities wemight, uh, unlock in the coming years.

(05:57):
Uh, which is not to say that,you know, uh, scaling laws
are going to be there forever.
We are seeing sort of plateauing, uh,the original silicon scaling laws, right?
Like we are down to three nanometersand it's beginning to plateau.
But humans are like, you know,we are interesting, we've
found, uh, panel computers.
We've sort of, uh, brokenthrough that, you know, Moore's

(06:21):
Law bottleneck by saying, we'lljust develop panel computers.
We're taking advantage of thatin the context of transformers.
So I think I, I would say at least for thenext few years, we have a very clear path.
For, um, scaling these models, giventhe amount of resources that are at play
and we might see more capability unlock.

(06:42):
And so it, even from that perspective,sort of just thinking not just a decade
out there, but three to five years, uh,we, we need to start sort of ramping
up and understanding why these modelsperform the way they do, what are good
techniques to oversee and align them.
So it's a great time if you aresomeone who's not exact, I'm not even

(07:04):
saying ML safety researcher, anyonewho's broadly interested in, um, you
know, like safe use of technology.
I think this is a good time to start sortof engaging, uh, with this community.

Yeah, I think, you know, we talk about like compute being
a constraint, but there's also theconstraint of, okay, we, what happens
after you've trained the model on allof the text in the internet, right.
Or all of the, you know, there is,there is a constraint on amount
of data available to be consumed.
Uh, and I think it's really interestingto think about the ways in which, you

(07:36):
know, multimodal models, you know, wesee, um, you know, OpenAI and, you know,
Google and their Gemini models, you know,starting to consume images and audio.
Uh, and it'll be really interesting tosee what happens to the scaling laws.
As, you know, those models startlike natively traversing more than
one different, um, mode of input.

Amal Iyer (07:55):
Right, right.
Um, yeah, and I, I think, you know,there's clearly a lot of impetus on
capabilities, uh, via scaling lawsand like adding more modalities.
I.
Um, so I, I think it, it, it just me,I think just for, even if we say like,
hey, you know, it's unclear to us whatwill get us to, uh, general intelligence

(08:18):
systems that surpass with, uh, we'll atleast end up getting systems that are
really good at certain specific things.
And we, I mean, I must say that I'mprobably, I probably write worse
summaries today than ChatGPT or a Geminilike model or a Mistral like model.
Um, and they can do itso much more faster.
So I think if I were to writesummaries, I think there's a, tend

(08:41):
increasingly, there'll be a tendencyfor us to offload certain tasks to
models and, you know, that's beenthe, the march of technology, right?
Like you, you see, uh, systemsthat are doing things for you and
you say, okay, I'm gonna offloadthis activity to the machine.
Uh, we saw that with spreadsheets we, wesee even in our homes and with dishwasher.

(09:04):
Uh, so I think increasingly when youare offloading, uh, tasks to these
complex systems, uh, there is emergenceof risks, uh, large scales, especially
because now you have, you know,large swats of human society there
that are going to use these tools.
And so if the model has, you know, uh,lack of say, robustness or some unintended

(09:30):
sort of, um, emergent property that wenever really assessed, then you have,
you expose society to large scale risksbecause of not just like the, the power
of capabilities of these models, butjust widespread adoption of these models.

Yeah.
So, so you know, kind of with that,I think the alignment story is like,
it sort of starts with talking aboutreally how we train these models.
Right, right.
Um, to follow instructions to, you know,I think, uh, if you go to, you know,
OpenAI and you read their blogs, theytalk about the difference between, you
know, the base GPT-4 and, uh, ChatGPT-4and how those, the two things are really

(10:08):
different in the absence of, um, youknow, fine-tuning for specific, you
know, a specific mode of interaction.
Um, right.
So, I don't know if you wanna jump intokind of telling us a little bit about, uh,

Amal Iyer (10:22):
Yeah.
I think that it, it'll probably providea good stage for our follow ups as
we dive deeper, deeper into some ofthe areas of research and alignment.
I'm gonna sort of quickly talkabout how do we train these
instruction following, uh, models.
Um, and as you alluded Josh, thestep one or step zero that hasn't
been sort of captured in this graphichere is, um, to take a transformer

(10:48):
model and, uh, you ask it to predictnext tokens, um, in a sequence.
And, uh, that really at the core isa language modeling task that has
been around for quite some time.
And, um, I, I, you know, I,I have a confession to make.
Back in 2017 when I was working onspeech recognition, um, you would train

(11:11):
in acoustic model, which is trying topredict like what you're trying to say,
uh, either in phoneme space or directlyin like spelling, uh, alphabet space.
And then you would slap a, a verylightweight language model on top,
which is guide this, um, gen thepredictions of the acoustic model.
And I never really thoughtmuch about the language model.

(11:32):
I was so focused on like,getting the acoustic model right.
And honestly I never thought,you know, that's going to bring
us where we are here today.
Um, so any who language modeling asa task, I think, uh, we still don't
fully understand, um, why, uh, certainskills emerge with this very simple
task of predicting the next, next open.

(11:54):
One of the hypothesis is that, uh, some.
you know, next token predictionrequires a lot of comprehension.
And so that leads to emergence a lot oflike, you know, uh, collection of skills.
Um, and the model can sort of, uh, combinethese skills to do tasks later on as we
look through steps one, two, and three.

(12:15):
So the pre-training is a very importantstep, and that's where the scaling
laws also, uh, tend to play, uh,where basically saying, okay, you
know, I'm gonna feed internet scaledata to a very large model and run
it on giant distributed compute.
So that's step zero.
Uh, but now this, this model thatyou've claimed the base or the

(12:36):
trade model is really good at.
Next word prediction.
Uh, what we want to do is get thesemodels to do things on our behalf and
follow instructions, uh, et cetera.
So we, the, there's a recipe calledRLHF, uh, or Reinforcement Learning
from Human Feedback, and that's whatthis graphic is, uh, um, sort of, you
know, succinctly, uh, showing here.

(12:59):
Uh, so step one is once you have thatpre-trained model, uh, you collect
a data set, uh, which is, uh, adata set of prompt and, uh, what we
call demonstrations, uh, from users.
So really you arecollecting a data set from.
Uh, humans and showing themodel how to do certain tasks.
So for example, here there's like,explain the moon landing to a 6-year-old.

(13:23):
So human would demonstrate it.
And because, you know, the, theprompt is asking them to explain
landing to a 6-year-old, theymight not use a lot of jargon.
They like to keep thingssimple and concise.
Maybe add an illustrative exampleso that, you know, um, you sort
of pivot, paint a vivid picture,um, for the, for the 6-year-old.

(13:44):
Um, so the step one is, uh, youtake the pre-trained or base
model and you are, you know, justfine-tuning on this demonstration.
So that's a supervised learning step.
Once you're done with step one, youmove to step two, which is where you're
trying to, um, learn human preferences.
Um, and here you, the, the challengeis you want to really scale up

(14:07):
the learning of human preferences.
So, uh, the, if, if you imaginelike one option would be you as the
model to generate, uh, some, uh,sample, um, generations for this say,
prompt, um, that we talked about.
And, uh, humans rate the outputsand you do it for every single

(14:28):
generation and prompt in your dataset.
Um, it's really hardto scale that process.
So one way to to scale this process is tosay, Hey, I'm going to actually train yet
another model, we'll call it reward model,just trying to learn human preferences.
So it might learn preferences likefor a 6-year-old, you shouldn't
really learn, uh, use a lot of jargon.

(14:50):
You should keep things simple.
Add in illustrativeexamples, uh, et cetera.
So some qualitative andquantitative aspects of, um,
you know, uh, human preferences.
Um, and so now you have a reward modelthat has encoded human preferences.
And now step three is, uh, whereyou actually, you know, train the

(15:10):
model using, um, uh, reward model.
So this is what we actually, theoutput of this stage is what we are
interacting with When you go to, um,you know, uh, the ChatGPT interface or
the Gemini interface, um, uh, et cetera.
So in, in this stage, uh, you use, um,something called reinforcement learning.

(15:31):
Um, and, uh, in reinforcement learning,the model is generally termed a policy.
So when I use the word policy, uh, itreally is the model, uh, whether that's
a ChatGPT model or Gemini or so forth.
And what that model does is given,uh, a prompt, it's supposed, it, you
know, it tries to follow the prompt.
Like, uh, say in this case,write a story about frogs.

(15:54):
It's gonna generate a story.
And instead of asking humans to ratethese stories, because in step two
we learned some human preferences,we are going to use the reward model
that has encoded those preferences.
And this reward model is goingto score the, um, uh, uh, the
generations by your policy model,your final sort of output model.

(16:15):
And you sort of, you use somethingcalled a called proximal policy
optimization, which is, uh, um,uh, a method in of training neural
nets, uh, using reinforcementlearning and, uh, sort of get your
policy model to, and, you know, uh.
Have generations that are ratedhighly by your reward model.

(16:39):
And so the, at this stage, once you'vesort of cranked the wheel, um, and gone
through step one, two, and three, you nowhave a policy model or a final model that
you, you are, that can follow instructionsthat you quote unquote, have aligned.
Um, by that, we, by that what,what we mean by alignment here is
we start with this, you know, basepre-trained model, which is just like

(17:01):
trying to predict next four tokens.
And we finally get gotten to a point atstep three to a policy or a model that,
um, follows instructions, but also doesthat in a way which is, um, hopefully
aligned with human values and preferences.
Um, so th this is in a nutshell what'shappening under the hood when you, you

(17:22):
know, start interacting these models.

So if I was to like, you know, kinda recap a little bit, it seems
like step one is basically, you know, thisis the, the, the pre-training, uh, you
know, the expensive part that requiresconsuming trillions of tokens and, um,
you know, you're trying to get the modelto understand the mechanics of language
and build those kind of base abstractionsthat it can use to reason in the future.

Amal Iyer (17:45):
Right, right.

I I, I've seen though, like if you, if you try to have a conversation
with a, you know, a pre-trained modelas, you know, much capability as is
sort of latent in it, you know, you'llsay something like, how big is a dog?
You know, and it responds bygiving you a list of questions.
How big is a chicken?
How big is an alligator?
Right.
How big is a tree?
Right.
Like, it's like a, um, you know, it'sfinishing the poem that you've started.

(18:10):
Right, right.
Whatever, you know, maybe it's seena list of questions in the past, but
it's certainly not answering questions.
Right.
Right.
You know, and then the second phaseis really about sort of nudging the
model into, uh, you know, subtly intobehaviors that are different, right?

Amal Iyer (18:25):
Yeah.

Capturing the human feedback in a way, modeling that to overcome the
sparsity of the human feedback so that youcan really like, um, turn the crank on the
training once you have a model traininganother model, and kind of nudging it
into a kind of final, uh, a final form.

Amal Iyer (18:43):
Right?

Sort of like, uh.
Uh, behavioral adjustment.

Amal Iyer (18:47):
Mm-Hmm.

I think what steps two and three accomplish.

Amal Iyer (18:49):
Right.

Does that sound sound about right?

Amal Iyer (18:51):
Great way to put it.
Yeah.

So, so, you know, it sounds simple if you describe it that way.
Right.
Uh, so you wanna talk about like, thekinds of problems that can pop up and why
that's more challenging than it looks?
I mean, I think this is a great systemand it's amazing things but, um,

Amal Iyer (19:05):
yeah, I mean, it, it, it's pretty amazing what, you know, something
like this has, um, um, elicited inincredible sort of, you know, progress.
Um, but, but it doesn'tcome without challenges.
Right.
Uh, and I really like this, uh,framing from this paper from Stephen
Casper, which talks about challengesof learning from human preferences.

(19:29):
And I like this bucketing across sortof three buckets, which is around human
feedback, the reward modeling, andthen actually like training a policy.
Um, so I think the, the humanfeedback piece, I would say a lot
of it, uh, is shared with, you know,say you were training a content
moderation system, um, and yourequire humans to provide feedback.

(19:54):
A lot of, there's a lot of commonalitywith our current sort of ML paradigm.
Uh, so, you know, you might haveevaluators that disagree with each other.
So how do we, how do youresolve something like that?
Uh, you might do majority voting, butin that case, are you, um, you know,
aligning to majority preferences,um, you might have data quality

(20:15):
issues because you didn't actuallysample, quote unquote sample your
evaluators in a fair manner, which isrepresentative of the larger society.
So you might have data qualityissues, uh, or bias issues there.
So a lot of this is, you know, sharedwith, um, the risk of, uh, calling and,

(20:36):
you know, non, um, general systems.
Uh, traditional ML, let'sjust call that traditional ML.
Um, I think one thing though I wouldhighlight in human feedback that
is unique, the challenging in the,for training general models is, um,
the, the difficulty of oversight.
So what do we mean by that?

(20:57):
Uh, so imagine, um, you know, you,you have to, as say, I, I sign up to
be, uh, a human feedback provider,uh, with one of these, um, companies
that are training these large models.
And I don't know a lotabout say, chemistry.
Uh, but now I have to oversee, uh, andsteer these models and align these models.

(21:19):
And there's like some of thequestions related to chemistry.
So now I have to actually like,you know, provide oversight in
an area where I am not an expert.
So there is difficulty associated withtraining general models where you, the,
the people providing annotations maynot have the right market expertise and

(21:39):
in the, in the short term or the mediumterm, we can get away with that by
actually asking domain experts to weigh.
So if you're training, um, as systemfor providing legal assistance.
Um, you know, there are companies thatare trying to bring in lawyers to oversee
these models, and similarly for medicalpractitioners, you might bring medical
practitioners to oversee these models.

(22:00):
So, in the near term, uh, we might solvethis, uh, with, uh, domain experts.
Um, but as we extrapolate, you know, asthe capabilities 'cause these models are
improving in a recursive fashion, right?
As, as you mentioned, you're, uh,you're using one model to sort of
improve the other and you're likebuilding on top of the system.

(22:20):
It's not hard to imagine, you know,either it might de evolve or it, it could
potentially actually lead to superhumancapabilities in some dimensions.
So how do you oversee that is actuallya real challenge and we'll talk a
little more about it, uh, in thesubsequent, in rest of the talk,
uh, here, but, um, sort of moving.
To reward modeling and policyaway from human feedback.

(22:43):
I want to, uh, you know, there are abunch of terms here, but I wanted to
sort of, uh, describe an interesting,um, example, uh, from the reinforcement
learning community, um, which is,so a couple of years ago, um, I
think it was some researchers fromBerkeley, they were trying to train,
uh, a robot for grasping objects.

(23:05):
Um, uh, and they didn't wantto sit there and provide, uh,
feedback whether the robot hadcorrectly grasped an object or not.
So they devised, um, youknow, uh, an automated.
Uh, reward mechanism.
So they put a camera on top.
Um, so the robot was in a cageand it had to grab objects.

(23:26):
So they put a camera on top.
It had a two, 2D view.
And the reward was when thecamera felt like the, um, ca uh,
robot arm had grasped an object.
And so you're like, this is great.
I can scale this.
Um, I don't need to, if you're a gradstudent, you don't need to sit there
and provide feedback all the time.

(23:46):
Um, but the feedback is parsed, right?
Because, uh, the feedback comes onlywhen the, uh, the, the, uh, uh, robot
arm has sort of grasp the object.
So that's when the camera, youhave another CNN running, which
is trying to predict, uh, ifthe grasp has happened or not.
So that's the source of feedback.

(24:06):
Uh, what happened was one of.
Oh, in one slash training run, uh, theynoticed that the policy had, uh, hacked
the, the rewards mechanism by instead ofgrasping the object, it positioned the
arm, uh, right above the object in sucha way that the 2D projection felt that

(24:31):
the, uh, robot had grasped the object.
So it was an unintendedconsequence, right?
Like the, the robot is not, or the policyis not trying to actively ce the reward
mechanism, but it hacked the rewardmechanism by sort of just positioning
the arm on top of the object, butnot necessarily grasping the object.
So there's a bunch of interestingproblems that this example portrays.

(24:55):
One is that of problemspecification, right?
Like you miss specified the problemand you created a reward mechanism,
which actually did not capturewhat you wanted the system to do.
Um, and then you have this problem of thepolicy actually hacked your reward system.
Um, it was again, in this casean unintended consequence.

(25:18):
Um, and so the, the, this rewardand policy, um, setting, uh,
you know, it, it leads, it canlead to unintended consequences.
So we have to be very careful about howwe, um, provide feedback to the models
and how we align them, because it mightlead to, um, you know, unan unanticipated

(25:41):
and undesirable consequences.
Um, and one last point that I'd liketo add in the, in the context of
LLMs, we do see, um, similar behavior.
I think there was an interestingpaper that came out, which
talked about sycophancy.
So, and I think you can try it.
I, I would highly encourage youto try it with any of these, um,

(26:02):
elements that you might be using.
Um, and one of the emergent sort ofphenomena that you see, uh, when you
are try to align these models is thatthey'll, they'll try, they're essentially
trying to get your approval, right?
They're trying to get approvalfrom human annotators.
So one of the unintended consequencesof this alignment process is

(26:23):
that they'll try to be a, they,they'll show some degree of ancy.
So if you say, ask a, a, a question,something like, is video gaming, uh,
great for developing young minds?
Um, and then it'll provide you some,you know, it'll equivocate, it'll
say, um, and rightly so, right?
There's some research which showsit's great for development, some other

(26:45):
research shows, you know, you gotta doit in the right proportion, et cetera.
And then if you ask it a question,say if you assert something like,
I believe that this is helpful.
And it'll actually say, yeah, Iagree with you and this is why.
And you delete that and you say, okay,I, I believe that this is unhelpful.
Um, and it'll again, agree with you.

(27:05):
So the model, and one couldargue the model doesn't have
an internal belief system.
Um, uh, and we don't know about it.
There's a lot of work trying tosort of probe the internals of the
model, but this, this notion, thisbehavioral ancy was not actually
explicitly encoded in the model.
It's, it emerged out of thisalignment, um, uh, process.

(27:28):
Uh, so it's something that we haveto be very careful about because
if you do sort of naive alignment,especially for increasingly capable
models, we might see emergence of,um, properties that are undesirable.

Yeah, that's so super, super interesting.
So it kind of brings up two things,but I, I, you know, you know this,
but I, you know, I wrote a blog maybesix or eight months ago where I was
trying to play 20 questions with.
ChatGPT.
And I was having it ask, I was havingit guess things and I was asking it
quite, you can find this blog on theFiddler site if you're interested.
Um, it's called something like"What was ChatGPT Thinking?"

(28:05):
Um, but uh, you know, I'd ask itto come up with a clue and then
I would ask it questions andtry to guess what it's thinking.
Um, and I think what was interestingfor me, I mean one of the interesting
observations was it never hada, a real clue in mind, right?
It's just completing a dialogue.

Amal Iyer (28:19):
Mm-Hmm.

Um, but I could almost always steer it towards, uh, a particular
answer, um, by, uh, 'cause because it hadthis strong bias to yes answers, right?

Amal Iyer (28:32):
Mm-Hmm.

Bias to confirmation.
I never understood actually why thatwas, but you could basically get it
to, you know, if I wanted the clueit was thinking to be a spaceship, I
could ask it guiding questions and thebias towards yes and yes, no questions
allowed me to basically select.
Whatever, you know, final answerI wanted by asking, asking
the right questions, right?

Amal Iyer (28:52):
Right, right.

So I, I didn't understand how that worked, and
I think you've described it,uh, really, really clearly here.
Um, the other thing that came up, likeI think we were discussing a couple
of days ago is this, um, tendencyto sound super, super confident.

Amal Iyer (29:07):
Right?
Right.

Uh, you know, it sort of works in real life when we interact with
other humans that we tend to believethings that sound confident and, and if
that sort of gets accidentally codifiedin our reward model in this process,
then, you know, uh, the, uh, LLM, the,the policy in this case is gonna sort

(29:28):
of, um, exploit the fact that confidenceseems to be the behavior that we want.

Amal Iyer (29:33):
Right?

So, so yeah, this, this, um, this yes bias and I think
the overconfidence that I think a lotof us worry about in LLM responses.
Uh, could totally both be, youknow, results of, um, you know,
imperfect, uh, model alignment.

Amal Iyer (29:52):
Right, right.
And, and I think this confidence, um,issue is problematic for us, especially
as we build, you know, um, um, multi-stepsystems and not just sort of this
dialogue systems as we rely on thesesystems to automate a lot of things.
Um, we don't want these models to behighly confident, um, when they're taking

(30:12):
actions out in the world on our behalf.
Um, we want them to be well calibrated.
Um, and I think there's an interestinggraph, I don't know if there's been
more work done in, uh, in this directionabout, I think the instruction tuning
paper from OpenAI talked about how themodel calibration goes out of whack.

Hmm.

Amal Iyer (30:32):
Um, uh, alignment using RHF.
So I think that this sort of calibratingconfidence, uh, sort of a classic sort
of ML problem, I think we, we, we won't,we've only scratched the surface here,
so it gets elicited in the responses,but even if we start looking at the
actual, like, property distributionacross tokens, yeah, um, uh, I think

(30:52):
alignment tends to, um, you know, um,uh, muck around with that as well.
Um,

Just, just to interrupt you for one sec, just a quick time check here.
Let's spend about five moreminutes chatting and then we
can maybe cut over to questions.
And so people out there, if youhave questions, please, uh, please
get ready uh, and from all, youknow, think about, think about, you
know, what you wanna do in the next,where you want this conversation to
go in the next couple of minutes.

Amal Iyer (31:16):
Yeah, that sounds good.
Um, I think the other, sort of otherthing I wanted to sort of bring up
Josh, was, um, how, how is the, uh,research community thinking about this?
Um, I think, uh, I, I, I wasfortunate enough to attend, uh,
the Alignment Workshop right beforeNeurIPS, um, uh, in December.

Super jealous

Amal Iyer (31:38):
And, uh, you know, lots of amazing people working
on front frontier, um, models.
And a lot of my thinking has beeninfluenced by, uh, you know, that that
particular two day workshop, um, andkudos to the organizers, um, and the four
areas of research that, um, you know,the, the community seems to be, um, uh,

(32:01):
putting wood behind, uh, is, um, youknow, these, these four buckets here.
And I'll, I'll quickly provide aflavor for what these look like.
Um, one is scalable oversight,which we briefly talked about.
And, uh, to summarize, um, it's the,it's, it's the work to sort of augment

(32:22):
ability of humans to oversee these models.
And in operational terms, um, if I wereto sort of, um, you know, break it apart,
I think there's the problem of overseeingand providing supervision during training.
Um, so let's call that scalablesupervision and a lot of academic

(32:43):
work, um, and work from leading labsis focused on scalable supervision.
Um, and the second bucket, which I thinkwe'll need to sort of increasingly start
thinking about is scalable monitoring.
So once you've deployed these systems,these systems, uh, will be used in
many interesting and unforeseen ways.

(33:06):
And how do you scale monitoringfor those, um, for that scenario?
Uh, so it's something that, you know,we also actively think about here at
Fiddler about scalable monitoring.
And I think this is an, it'sgonna be an interest, increasingly
important area as we adopt thesesystems um across, um, our society.

(33:27):
Um, this, the second one is sort ofrelated to, um, scalable oversight,
but, um, uh, you know, it, it, it,this, a lot of overlap, but, um,
it's, it's a classic, you know,um, ML topic around generalization.
So we train these like massive models.
We try to align them and we canbe, um pretty much, you know, we

(33:50):
can, these models will be used inways that we never intended, right?
Uh, so we want to studygeneralization, um, and robustness.
'cause you want these models to be,say you align them to human values
and you assess them, or certainscenarios, uh, even during production
time, you want those values to be abit, so your model has to, to reliably

(34:15):
generalize across a range of scenarios.
So that, that's one bucket of studythat, um, is going to get a lot of,
uh, um, uh, you know, uh, resources.
And a lot of people are gonna startlike, working on this problem.
Um, and the third one that, youknow, Josh, I'm pretty sure like
you, you have some comments on this,uh, is around interpretability.

(34:38):
Um, and this, uh, and the way I seeit, you know, having a little bit
of neuroscience background, um, is.
Interpretability aredifferent layers of the stack.
So, uh, at the most basic level, sort ofmechanistic interpretability, like what
are the sub circuits of the model doing?
Yeah.
What kind of quote unquoteprograms they are encoding in

(35:01):
their weights and activations.
Um, and then somewhere in the middleis the, the, um, um, you know, like
understanding when you provide aquery to the model, what are the
sub circuits and skills it's using,um, to come up with a response.
And then at the top of the stack,uh, sort of at a, you know, global
level, uh, trying to understand what,what aspects of the training data

(35:27):
influenced certain model outcomes.
Uh, so can you trace back from modelgeneration back to the sequences in your
training, um, that were responsible forcertain, um, um, you know, generations?
Uh, so I, I would sayinterpretability I think is.
Uh, something that we will, we'll needto sort of really current here, because

(35:51):
right now these models are black boxes.
We don't really understand how thesework, and I think it's a greenfield
area, um, as we enter this largelanguage model space, uh, uh, and
sort of furthering the research.
So I love for the community to sortof, um, you know, do more work here.
Uh, and finally, around governance.
I'm, I, I, I'll caveat this and say I'mnot an expert at all at governments,

(36:15):
but I think given the pace at whichprogresses happen and the pace at which
we are seeing adoption, um, there'sdefinitely a lot of interest from
government, academia, and industry toestablish, uh, regulatory standards.
And if nothing else, at leastreporting about, you know, large,
uh, training runs, um, when you, youmight expose those models for, uh,

(36:39):
public use or beta use, et cetera.
So some amount of governance around that.

Nice.
Well, I, I see our, our firstquestion has landed in the Q&A.
Uh, so how is Fiddler thinkingabout operationalizing alignment
and safety for LLM based apps?

Amal Iyer (36:55):
Um

Do you wanna take a, a first stab at that and I can
jump on with thoughts afterwards?

Amal Iyer (37:01):
That sounds good.
Um, yeah, and, and I think sort ofgoing back to sort of, uh, um, you know,
scalable oversight, uh, we, we, we tendto call it from, from an operational term,
uh, or operational prism, uh, monitoring.
We, we want our, um, you know,folks who are deploying LLM

(37:21):
apps to actively understandwhat kind of usage is happening.
Um, you know, you can have all kindsof malicious use as well, right?
Like, just like any newpiece of technology.
Your, uh, LLM based app might be subjectedto malicious use or adversarial use.
So you want to understand ifsomething like that is happening
in shutdown down certain counts.

(37:41):
Um, but also understand sort of,you know, uh, is, um, your model
responding in ways that's helpful andnot harmful, uh, to user requests.
So getting visibility into that.
So I think we tend to, from anoperational, um, term, we call it
monitoring and, uh, we are using a bunchof, um, models that are scoring these

(38:06):
larger LLMs on different dimensions.
So we, we think we, we are thinkingabout like in the sense of using
a bunch of specialized small scalemodels to score these larger models,
um, to make it more scalable.
Uh, because you, you know, it, Ithink that one of the, the constraints
with, um, scaling monitoring is thatyou don't want to, if it becomes

(38:29):
extremely expensive to monitor.
Um, most teams will prefer not to.
So we want to make it very, um,cost effective for our cu our
customers to monitor their LLM map.
Um, and, uh, sort of ups moving upstream.
Um, you know, talking about robustness,we have this open source tool, um,

(38:50):
called Fiddler Auditor that we, we,uh, that we've open sourced to assess
sort of reliability of these systems.
There's a lot more that needs to bedone there, so we invite contributions,
but we do feel that reliability andgeneralization is an understudy topic.
Um, and even from a user standpoint,like if you are, uh, say a product
owner of an Netherland based app, youreally want a great sort of iterative

(39:14):
loop between monitoring, understandingwhere the model's underperforming,
and then going back and say, saying,doing some kind of prompt engineering
or fine-tuning, understandinggeneralization, robustness, pre-production
before you go into, uh, production.
So I'll pause there, and Josh, I'dlove to hear your thoughts as well.

Yeah, I think, you know, my, my, I think those are great, great points.
I think, um, you know, the first thing is,you know, we're talking about alignment.
I think it's fascinating.
I think the future haseverything to do with alignment.
Um, you know, we didn't get to talkyet at that, or that we haven't
gotten the chance to talk about superalignment and what happens with AGI,
maybe if we have a minute at theend, it's, it's worth touching on.
But, but the truth is that mostof the teams we talk to are not,

(39:56):
you know, they don't frame theproblem yet in terms of alignment.
You know, the frontier labs arethere, they're thinking about the hard
problem of what the future looks like.
But, you know, there are plenty of teamswho are plenty sophisticated, who are
working mostly in the prompt engineeringspace, um, and developing amazing
applications with prompt engineering.
Um, you know, and you know, there'sa couple of pieces to, to the, the

(40:19):
story of, you know, LLM safety, I mean.
There's, uh, sort of a, anoffline analytics component.
Like, can you capturefeedback from your users?
Can you use, as Amal says,can you use models for proxy
feedback or for safety feedback?
Right?
Can you find out if a response wasbiased or, um, uh, you know, uh,

(40:41):
or a model hallucination, right?
Was it unfaithful to some sourcematerial that was asked to summarize?
You know, can you capture that in a,in a, um, uh, uh, offline capacity
in a data store so that, you know,as model developers, people know that
there are certain kinds of topicsthat their models have problems with.

(41:02):
Or if you have a, you know, a RAGapplication retrieval augmented generation
where, uh, you are summarizing aknowledge base or, um, uh, frequently
asked questions, uh, or a customerservice, uh, database, um, you know.
Uh, is there a topic that your customershave started to ask you that because
the world has changed where, uh, that'sa missing piece of information in the

(41:26):
database your LLM is drawing from?
Right.
Just having those like sort of, um,you know, basic kind of operational
metrics that you can use to improveyour model are really important, right?
Um, you know, we talkabout guardrails, right?
Can you get real time signals thatcan be, um, uh, you know, used to
veto a bad LLM response in real time.

(41:49):
Something that's in the, youknow, the, um, the production,
uh, code path, um, that's fastenough to give a smart response.
Um, and then I think, you know,in my aspirational, um, thinking
about, you know, applicationdevelopment for us it's um.
Does this become a user preferencestore that can be used for the

(42:09):
kind of fine-tuning in the future?
Right.
If you're gathering your user'spreferences, if you're getting those
thumbs up and thumbs down from theirusers and logging it, um, you know,
when you get to the point and when themarket gets to the point where we are
actually talking about alignment asa way of, um, dialing our models in
for our specific applications to bebest aligned with human preferences,

(42:31):
you want to have all of that data.
I mean, feedback doestend to be sparse, right?
Like, how often do you click a thumbsup or a thumbs down, um, when you're
dealing with a software agency, right?
Pretty, pretty unusual, right?
And so that is really sparse.
It's the reason why you need a, youknow, one of the reasons why you need
a, uh, um, you know, a rewards modelwhen you do this, this alignment.

(42:52):
Um, but you want to capture that dataso it's there and you can iterate
in a, in a, um, efficient way.
Um, so yeah, user preferencestore is kinda my last thought.

Amal Iyer (43:01):
No, I think I agree.
I think that's the, there's an interestingsort of switch that ML practitioners
would have to do from this like, labelcentric way of thinking to more sort
of user preference centric way of sortof tuning and dialing in their models.
And I think, um, there's definitelyrelated to alignment there, right?

(43:21):
Like, uh, not only want your capabilitiesof the model to align with what the
user's, uh, intent is, but also,you know, the safety aspects of it.
Um, and I, I think this might be aninteresting sort of, unless there
are more questions right now, uh,

We have a question, but, but take a, take 30 seconds
and, uh, if you can for a minute.

Amal Iyer (43:43):
Yeah, I, I, maybe I'll skip over this one, but I would highly
recommend, uh, for folks who areinterested to take a look at this paper
from Sam Bowman from Anthropic, uh, wherethey're trying to sort of, you know, uh,
peer into the future and say, can we,can we look at this oversight problem
and can we start understanding, um, uh,how we might tackle it in the future.

(44:08):
Um, and I think the nextgraphic will make it very clear.
I love this graphic.
This is from Colin Burns and teamat, uh, OpenAI Super Alignment team.
So I, I think traditional ML, I think alot of us would resonate with this, right?
Like, we provide labels, say youhave a sentiment, uh, classi,
classifier or, um, entity recognition.
Um, any of the ML tasks humans are,are the ones that are actually teaching

(44:32):
these, um, quote unquote ML students.
Um, but if at all, there is a worldin which say, and the dotted line
is human level performance, andwe get to a world where we have
access to systems that are much morecapable in not just one dimension.
Um, like imaging, et cetera, but onmultitude of dimensions, then how do

(44:55):
you align these systems and how do youmake sure that you don't have something
called instrumental convergence, whichis sort of, um, you know, uh, skill
quote unquote skills or behaviorsthat might actually be not aligned
with human interests or values.
So things like power seekingdeception, uh, et cetera.

(45:15):
So we, we don't want these models to, um,sort of deceive us, um, or players, right?
So, um, I, I think the, the, thepaper's interesting, the sense that
they're trying to sort of mimic this.
We of course, don't have such systemstoday, but they're trying to study,
uh, this sort of paradigm using,uh, a smaller quote unquote weaker.

(45:38):
Supervisor and using a strong student.
So concretely, they are trying to, um,mimic the setting by using a GPT two
model, which is an extremely inferiormodel compared to a GPT-4 student,
which is a much more capable model.
So they're trying to understand how canwe, um, study scalable oversight and

(46:01):
generalization in this paradigm here.
Um, so I would highly recommend thispaper for those who are interested.
If you have time, I can, Ican talk more about it later.

Nice, nice.
Uh, so we have a question here fromYusaku in the, um, in the chat.
Uh, what are the major interpretabilityissues when using LLMs?
Um, I don't know.
Maybe, maybe I'll jumpin on this one since.
So, so I, I, uh, we, we had an a summitin November and I um, I did a presentation
called "Can LLMs be Explained", whichwas largely literature review, uh,

(46:36):
and sort of open science questions.
Maybe we can drop thatin the, in the chat.
Um, but, uh, you know, Amal referenced,you know, some of the mechanistic
interpretability work, right?
So there's, uh, you know, there wassomething by OpenAI, uh, in, I don't
know, maybe September of last year.

(46:56):
I made some notes herefor the names of these.
Um, uh, well, so, you know, OpenAIdid some work using GPT-4 to, uh,
characterize, you know, activationsof, in part of particular neurons
in a much smaller GPT model.
Um, there was also some amazingpiece of work by Anthropic where they

(47:17):
used a sparse auto encoder to tryto understand, uh, how activation
patterns correspond to different kindsof, um, different kinds of concepts.
And they concluded fromthat, you know, a, uh.
These models are using theirneurons in a polys semantic way.
So basically, a particular neuron can havemore than one meaning layered into it.

(47:39):
And they used in different combinationswith other neurons to represent,
you know, basic concepts in a, ina, um, a very sophisticated way.
Um, and, and the, the through lineof these two papers that were of
a similar time period wa was that,you know, the story was complicated.
Uh, you know, barely at the,you know, the sort of that were

(48:01):
pushing their capability level.
And the models they wereinterrogating were very simple
models by today's standards.
So, you know, um, GPT modelsthat are sort of, um, you know,
uh, a few years out of date.
Um, and so there was some prettylimited, uh, optimism was pretty limited
that anytime soon, like mechanisticinterpretability, understanding

(48:24):
the microscopic behavior of the.
Underlying components of themodel is going to lead to sort
of a deep understanding ofhow our complex models work.
I mean, specifically around, youknow, when you start to get to
these, um, emergent properties andabstractions, like the question
following it itself, um, uh, you know,the model's intent, the model's bias.

(48:48):
Are you gonna, you know, you might beable to get token level information from
mechanistic interpretability, but thereare definitely limitations in terms
of these higher order concepts, right?
Um, and this is really different thanour experience with explainability
and more traditional ML.
Um, the other thing I would addhere is that, uh, you know, model
self explanation seems to bepretty, um, pretty fraught as well.

(49:11):
Um, so, you know, if you, there was apaper out of Microsoft Research right when
they released, um, GPT-4 called "Sparksof Artificial General Intelligence,
Early Experiments with GPT-4."
And there's a whole section there.
It's really interesting.
Um, you know, they got a bunch of,you know, the industry's experts
on, um, uh, interpretability to, youknow, say what they could about GPT-4.

(49:33):
Um, you know, and they sort ofcharacterized model self explanation,
asking the model for its ownexplanation in, in two terms.
There was something they called,um, uh, uh, uh, there, uh, output
consistency and process consistencywere the two terms that they used.
So basically output consistency was,you know, when you ask a model for an

(49:56):
explanation, is it a valid explanationof, um, you know, the thing that it had
previously produced as a, as a statement?
Um, you know, for the most part,output consistency was pretty good.
Um, process consistency is, you know, isthe models reasoning broadly applicable
to a set of analogous questions.
Um, and, and, you know, selfexplanation really pretty

(50:19):
much came off the rails there.
Um, you know, and so they have a bunchof examples where, you know, the model
doesn't reason in, uh, you know, themodel's self explanation for its reasoning
was not, not consistent in analysis.
Analogous examples, um, there's alsoa more recent paper, um, uh, Turpin
et al from from May of last year, uh,called "Language Models Don't Always

(50:41):
Say What they Think," um, in which theyreally drill into, you know, questions
of of bias in the model and whether ornot the model's self explanation can
describe its own internal biases aspart of its explanation for why it made
a certain assessment of a situation.
Um, and it, you know, they, theyhad, there's some really pretty

(51:03):
dramatic examples of like racialbias in its assessment of stories.
Um, and that its self explanationsbasically conform to, uh, uh, the,
you know, their, their, uh, they, theydon't reveal the underlying bias, right?
Um, that, uh, the selfexplanation makes up the story.

(51:26):
Uh, you know, then it thinks that you wantto hear or that it you know, it's, it's
not an assessment of why the model madeits initial decision, but it's, uh, you
know, the model's not introspecting in anyway, which is, is the real answer, right?
It's, uh, it's, it's producingthe next token, right?
It's producing a, a, youknow, a plausible, uh, uh.

(51:47):
You know, phrase completion on everything.
It's read on the internet, right?
And so while it's sort of, um,tantalizingly, uh, uh, I don't know,
sort of appealing to ask it for anexplanation because these things
behave or appear to us in human-likeways because of alignment, um, it's

(52:07):
easy to assume that self explanationis doing more than it actually is.
Um, and so it seems problematic.
Um, so I've used probablytoo much time here.
No, that was great.
There are some narrow regimes wherethere are some tools that may be
helpful for, um, you know, assessingexplanations, but it's, it's a
very challenging thing right now.
So I would, I would refer you toour YouTube recording from, uh,

(52:29):
from the November summit, I think.
I don't know if you haveany more thoughts, Amal, in
our next couple of minutes.

Amal Iyer (52:34):
Um, yeah, I think broadly, like it sort of, um, reminds me of,
you know, some of the challenges thatneuroscientists have in understanding,
um, you know, biological brains.
So.
With mechanistic interpretability, we aretrying to sort of assess like, hey, what,
what are the circuits that are responsiblefor certain skills and behaviors?

(52:54):
Um, so that's like bottom up.
And then the top down is sort of,you know, this, um, self consistency
explanations by asking the model,uh, sort of like the behavioral, uh,
neuroscience, uh, parallel, which istrying to look at the model top down.
And I think unlike neuroscience,I think neuroscience has, um, you

(53:15):
know, I'm not an expert, but one ofthe real challenges has been to sort
of meet in the middle from both thecircuits level and from behavioral, um,
neuroscientist level, like somewherein the middle where you have a coherent
sort of, you know, meeting ground.
Um, I think with.
Our LLMs and, you know, these largemodels, what I'm optimistic about is

(53:36):
because we can actually tap into internalstates, unlike biological brains where,
you know, you, you, but most have like 32channels of like electrodes that you can,
um, tap into, uh, more invasive MRI scans.
So I think because they're more,the internal states are more
accessible, I feel like over the nextfew years we'll see sort of this,

(53:56):
you know, convergence between thistop down and bottom up approach.
And hopefully that'll lead to betterunderstanding of these models.

Nice.
Well we started a couple minutes late,so we've been given the blessing to,
uh, to stay on a couple more minutes.
I dunno if you had any more thoughts.

Amal Iyer (54:12):
Um, I think, you know, sort of maybe just like some closing thoughts,
um, around, um, you know, I, I don'twant any of our sort of viewers or
listeners to walk away thinking, oh man,these are like intractable challenges.
I think I'm pretty optimisticabout, uh, you know, just the
broad utility of these tools.

(54:35):
Um, and I, I do feel that the researchcommunity and the Frontier Labs
are taking this problem seriously.
I do think we need more minds, resourcesto start thinking about safety and
not just do capabilities research.
I think most ML researchers, scientists,practitioners, and, uh, now software

(54:56):
builders that are building on top ofthese LLM apps think, uh, primarily in
terms of capabilities and rightly so.
I think today we should be, uh, but ifwe continue seeing the pace of progress
that we've seen over the past few years,um, I, I think we will, we'd rather
be in a position where we understanda lot about safety and alignment than,

(55:18):
uh, be in a state where we are forcedto sort of respond and be reactive.
Um, so I think this is a great timefor more folks to get involved,
um, and contribute to research,contribute to best practices, also
in, you know, adopting these LLMs.
Um, so, and largely optimistic, butwe do need, uh, more minds, more

(55:38):
effort in this direction of not justcapabilities, but also safety research.

Yeah.
Yeah.
I, I think I'm optimistic as well.
I think that, um, you know, I, you know,we're in the business of creating a ML
Observability platform and, you know, wedo plenty of work on, uh, LLM stack now,
and it's, you know, it's evolving rapidly.
I think it really is possible to build theright Observability layer, you know, or

(56:06):
buy the right Observability layer, right?
Whatever the right, youknow, form factor is for you.
Um, you know, there, there is a rightanswer if you're thinking about,
um, you know, Observability andresponsibility and how you build that.
You know, in parallel with, um,the application you're developing,
I, I think this is doable andI think it's gonna evolve with,

(56:26):
you know, the, the technology.
Um, so, so I think maybe we stop there.
Um, and, uh, uh, thank you so muchto Amal for spending an hour with me.

Amal Iyer (56:38):
This was fun Josh as as usual.

And thank you to everybody out there who listened along with us.
Uh, don't hesitate to reach out if youhave follow up questions or wanna hear
more about our product or how we'rethinking about a specific problem.
Um, you know, I think for both ofus, uh, a favorite activity is just
hearing about new applications and,uh, learning what the challenges are.
'cause it gets our mind spinning for, youknow, uh, what kinds of, uh, tools should

(57:05):
be developed or what practice is rightfor addressing those sort of situations.
So thanks to everybody out thereand, uh, wishing you a great day
or evening or wherever you are.
Alright, take care.
Bye.

All Episodes

Episode Transcript

Popular Podcasts

United States of Kennedy

Dateline NBC

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}AI Safety and Alignment with Amal Iyer

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}United States of Kennedy

Dateline NBC

Stuff You Should Know

All Episodes

AI Safety and Alignment with Amal Iyer

United States of Kennedy