All Episodes

July 7, 2025 17 mins

U.S. National Science Foundation-supported researchers are accelerating artificial intelligence technologies. Mingyi Hong, a professor at the University of Minnesota, discusses AI reinforcement learning strategies and the challenges of training experts.

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:03):
This is the Discovery Files podcastfrom the U.S.
National Science Foundation.
Artificial intelligencetechnologies are accelerating research,
expanding America's workforce
and transforming society as AI becomesa larger part of the world around us.
Advancements in computer vision,natural language processing, and speech
recognition have led to widespreaduse of learning based systems in

(00:26):
AI that interact frequently with humansin highly personalized contexts.
Humans are using strategiessuch as reinforcement learning to train
AI systems, and using AI systemsto upskill the next generation workforce.
We're joined by Mingyi Hong,
a professor of electrical and computerengineering at the University of Minnesota.
His optimizationfor artificial intelligence lab

(00:47):
is designing solutions for problemsin data science, machine learning, and AI.
Professor Hongthank you so much for joining me today.
Thank you. Nathan.
I want to ask you about reinforcementlearning.
What are some of the keycomponents of this approach?
So reinforcement learning or abbreviatedas RL, is a branch of machine learning
where agents learn to make decisionsby interacting with the environments.

(01:09):
So the key idea here is to learn throughtrial and error guided
by some kind of rewards, right.
So therefore the key componentshere are agents, right.
So you have to have a learner a decisionmaker who is trying to learn environment.
You're interactingwith an environment state.
So the current situation you're in.
Describing what the environment isand then your current position.

(01:31):
And then what can you do.
So a choice the agent can makeat this particular time.
And they reward. Right.
So after each actions have been made
what are the incentivesor penalties that I receive.
And then the policy. Right.
So policy is eventually what the strategythat the agents eventually learned.
What would be an example of a rewardfor a machine learning program?

(01:52):
So let me give you an example,teaching a robot to walk.
This is a very typical machinelearning problem where if you imagine
you have a four legged robot, it needsto learn how to walk across a room.
So the robot doesn't knowhow to move initially.
So it will just try,try out different sequences of movements.
And then sometimes some may fail,some may not fail.

(02:12):
But every time it movescloser to its destination
without failing,it receive a positive reward, let's say
plus one or goodor some something positive right?
If it tips over, it gets penalty.
So as you're gathering this reward,your model
is trying to adapt, trying to optimizetrying to increase the reward.
Oh I see that I should do this.

(02:33):
And I get better rewardthen I should go towards that.
So over time using this kind of approachtrial and error
and then improving my model.
So the robot will learn which sequenceof actions leads to higher reward
and then it will effectively teach itselfto walk smoothly.
So this is an exampleof how rewards are defined
and how the entire process is donethrough reinforcement learning.

(02:54):
So you mentioned a robot there,but what are some of the applications
that people might see around themthat rely on reinforcement learning?
For example, we all use ChatGPT .
Chatbot fine tuning is a very,very important application
for reinforcement learning.
Now here the agent is the chat botwe're trying to train.

(03:15):
So the chatbot may have seen a lot oftext, maybe all the text in the internet.
So it knows how to speak English, orto speak Chinese or how to speak German.
Right.
So it knows the basic structure,but it needs to be aligned
towards a particular way of humanspeaking or
a particular region that some, some thingscould be said, some things are not

(03:37):
appropriate, something appropriate,things like that.
So it needs to be alignedtowards certain population.
So then the first question iswhere is the reward model coming from.
For a given questionyou collect many, many answers.
These answers can come from humans,come from language model itself
and then ask human to annotate to rankwhich one is the best,
which one is good,which is not acceptable, and so on.

(04:00):
And you collect thousandsand thousands of data like this.
Right.
And then use this datato train the reward model.
So now with this reward model trainedby human annotation data, then we will use
reinforcement learning algorithm to train,align, the language model itself.
Before training, the language modelalready should know, the basic structures

(04:21):
of English, of Chinese, of German,whichever language you are working on.
However, it doesn't knowwhat should be the appropriate answer
to certain questions.
And then the language model will explore.
The reward model will tell you,hey, this is correct,
this is not correct,this is okay. And so on.
With this signal,the language model will be trained to sort
of a tuning to be gradually going towardsthose that have high reward.

(04:44):
So that's the process.
Is there a difference in using RL in
large language modelsthan you would in a chatbot?
So I think this is the most popular wayof aligning
or fine tuning a chat bot,which is powered by large language model.
There are even more advanced waysof using this.

(05:05):
For example,now let's forget about chat bot, right?
So let's just look at what large languagemodel can be used to do.
So now people are talking about usinglanguage models to plan things right.
So for example, as an agent,i tell the language model to book an air
flight for me.
So the language model needs to know,okay, here is instruction.
And my first stepis to go to this website.

(05:25):
And then to input this and thatand then get my results.
And then look at my results and say, oh,this looks good, this flight looks good.
And I'll pick this and go to Delta maybe,or go to some air carrier's,
website and, and then click rightand then purchase and things like that.
So it's a sequence of a planning step.
And then eventually we have a reward.
Okay.
You made the right decision,you made the wrong decision, and so on.

(05:47):
And then the sequence how to plan this.
Well, how the language model can
plan is step by stepby correctly making decision at each step.
It's also one very, sort of exciting directions
where IL reinforcement learningcan be used to train language model.
Right.
So this is of course well beyond the chat.
bot What are some of the limitations

(06:08):
when you're getting RL into real worldapplications?
Okay.
There are many, many limitationsI think right now.
First of all, RL is very hard to train,
it needs many millions of interactionsto learn, learn effectively.
So data is very hard to obtain.
So this is one thing.

(06:29):
The second is how to specify the reward.
So designing good reward function is hard
and often misalignedwith the true human intention.
Give you an example. Right.
So a cleaning robot is rewarded based onhow shiny the floor looks for example.
So this is a reward function.
Then the robot can sort of
start to train thisoh certain time and discover the shiny.

(06:52):
The better I get a better reward.
And then it would dump water everywhereto create a shine,
but not actually cleaning.
Just make it a shine right.
So this is happening
because the agent simply triedto maximize the reward.
And the reward is not very well designed.
So this is sometimes called a rewardhacking problem.

(07:12):
For complicated task,this sometimes happens.
So this is a second sort of a challengehow to design a better reward model.
The third is how to make the modelmore robust.
Make it safer. For example, for chat bot.
A chatbot learned so many thingsit has seen right?
It has been trained.

(07:33):
So how to prevent it from saying somethingthat is not supposed to say so.
This is very hard to distinguish.
Maybe some of these answerscan be very helpful.
It directly answerswhat the human user was asking.
For example,how to, i don’t know, how to make a bomb.
So then for helpful agent,if it learns to be very, very helpful,
you should directly say

(07:53):
something detailed on it’s own right,which is, obviously not appropriate.
And then the last thing I want to sayis sometimes, in many cases, the agent
trained by RL lack generalization,which means
if you move to a new environment,the old policy may not work.
For example, a robot may train to learnhow to pick up a red cup from a table.

(08:15):
If it moved to a new environmentwhere all the cups are blue,
it may already forgot what to do.
This is where you runinto that limitations of data.
It's like it might know
to do this part of it,but if it's this, what happens next?
Yeah, exactly.
Or like social norms.
Like, maybe I wouldn't askyou about something like bomb reference.
Yeah, but, like, how does it knowthat it's not supposed to tell you
how to do this?

(08:35):
Yeah, exactly. Right.
So then okay, one solution.
So obviously give a lot of data to it.
Oh I need to consider all differentkind of nuances of different things
I should say.
Should not say this and that sobut again you don't have too many data.
on this What are some of the challenges in
training an AI agentor algorithm to be an expert?

(08:56):
Like, it seems like there could bea lot of complications in that.
Yeah.
So I just mentioned this data limitation.
So high quality expertdemonstration is very rare,
especially if we're not consideringjust conversation.
We consider something practical,something that the really needs
human demonstration.
So for example you want to teach an agentto drive, automated drive.

(09:16):
So where is the data coming from.
You have to go out and collect probablytake videos of how human drive and so on.
So so this data,
this amount of data is compared to textdata has been available online.
So that's very, very small.
Another important challenge hereis really the ambiguity.
It's also relatedto how we define a reward model.

(09:36):
For example in an expert pilotyou want to learn how to fly right.
So now you're observingokay how this pilot was doing.
But then they will make very subtleadjustment when landing the plane.
But you certainly don't knowwhere this is coming from.
Is it because of a wing,because of the habit or because of,
there is anticipationof something turbulence or something else?
Right.
So this is very hardfor the agent to understand.

(09:59):
If you just look at data, it'shard to sort of learn the reward.
Also anotherimportant thing is edge cases.
So for example,we want to learn how to play chess game.
You're observing chess grandmasterthat is sort of a playing regularly.
However, there will be some movesthat are so rare
that they only make it once or twicein their entire lifetime.

(10:23):
But those are so important.
Maybe it helpswin the critical game, right?
So those those examples, thosethose cases only happen once or twice.
How, how should it learn right.
There's again not enough data.
So those are sort of important examples.
But it happens so infrequentlyit's very hard to learn as well.
So those are theI think some of the main challenges mostly

(10:44):
related to training, aai agent to be an expert.
Kind of a quality of data and the context.
Yeah. Context and yeah.
I had a conversationwhere we were looking at vision language
models, and I was kind of wonderinghow that might relate to this.
Overall, the idea is very similar.
Now vision language model gives to youis so going

(11:05):
back to the very first definition of, reinforcement learning, right?
So here a vision language modelcompared with the only text language
model,give you more sort of a context, right.
So your environment is richer.
You know what happensnot only from the previous text
exchange, you conversation,but you also, you know,
your actual position where you areand then what context it is.
Right?

(11:25):
It also the action spaceis much more complex.
For visual language models.
You're allowed to generate text,so you're allowed to reason.
You're allowed to give a situation.
The model is allowed to provide reason
to provide why you want to do this,and so on.
It also will be allowed to clickcertain button, right?
So so you'd be able to actually operateon some of the objects.
Right.
So this give you a much more complexaction space as well.

(11:48):
So entire training process will be
more complicatedcompared with a text only language model.
So there's an oppositekind of concept here.
Inverse reinforcement learning.
What's the difference
between inverse reinforcementlearning and reinforcement learning?
First of all, let me say that they'reactually not opposite to each other.
So inverse reinforcementlearning is something

(12:10):
let's sayone step beyond reinforcement learning.
Some of our earlier discussionpointed a discussion that designing reward
model has been important.
It's very hard and needs to be important.
So inverse reinforcement learning goesone step beyond RL.
So in RL we need to have a reward modelthat guides me.
For inverse RL you don't.

(12:30):
So instead of learning the policy fromthe reward will learn
the reward function itselfby observing expert behavior.
So the key difference hereis that RL given a reward
I give you a policy,inverse RL give me expert demonstration.
You don't tell me what is goodit was about just just demonstrate for me
and then I will learn the reward.

(12:51):
I will try to understand.
Oh, why you do this
at this given point of time,you should do this, not that, and so on.
And then with this,I eventually get a good policy.
That's the sort of key differencebetween IRL and RL.
Thinking about
workforce training,once you've established an expert model,
how hard is it to introducethat to a novice human?

(13:14):
It's a very, very good question actually.
This is related tosome of the white papers we are writing.
I don't think there has beentoo much work on this.
So that's why we try to propose this.
If you want to train a humanthis way, it's better
to integrate an entire process.
So suppose you have an expertand then suppose you have a teacher
and then suppose you have an human.
Now what the expert can dois to provide feedback.

(13:37):
So what is good, what is bad?
Now the teacher is also an AI model.
So the teacher's goal is not necessarilyto become an expert,
because the eventual goalis to to teach human right.
So it needs to be ableto integrate the expert's feedback
and then ask the right questionfor this particular human.

(13:58):
So it needs to understand what is a levelof expertise at this particular point.
What should be my next set of questionsthat I should ask when I ask the next set
of questions or next set of tasks,what is the human's actual performance?
And then I'll send to the expert to ask,is this correct or what?
Right?

(14:18):
And based on this,I will summarize these interactions
and provide feedback to humanand then provide next set of tasks.
I think a better way is to againintegrate the entire process,
and then have this sort of learningenvironment set up by sort of a training
a teacher, AI teacher model,and to interact with the expert model.

(14:38):
Probably it's a better waycompared with directly interacting
with an expert and say, hey,how should I learn from from it.
I want to ask you about how NSF supporthas impacted your career.
So over the years, I have been receivinggrants on topics related to designing
optimization algorithms as well as for reinforcement learning.
So obviously you can see that optimizationalgorithm, this very important component

(15:03):
when we talk about how to train a model,how to improve the model.
It all involves optimization algorithms.
And also reinforcement
learning technique has beenthe cornerstone for LLM, training LLM.
And then and and go beyond.
So these are very crucialto support my labs
research to recruit studentsfor me to attend conferences.

(15:24):
And so on. So I'm very grateful.Thank you.
For my last question, I want to ask youabout the future and what's coming up.
Where do you see your workgoing in the next few years?
Yeah, excellent question, I ask thisquestion to myself many, many times.
So I think one of the most excitingfrontier now is to use reinforcement
learning to transform

(15:45):
large language modelsinto autonomous agent or systems of agent,
where you could do multi-step reasoningand even solving scientific problems.
Right. So
sort of related to your previous questionof what's beyond that just chatbot.
So I think these will be somethingthe really useful will be really having
significant impact to the society,to either scientific community

(16:07):
or even beyond that.
So here,
as we alluded to before, IRL reinforcementlearning will play a key role here.
Right?
Because here we're essentially trainingan agent
how to do planning, how to performvery complicated tasks.
For example,one of the goal here is to ask
if language model can assist

(16:28):
or planor even execute a research workflow.
First,I can discuss it with a language model.
Hey, I want to do this.
I have this idea is this good is it bad?
And let's have a discussion and thenit will go on to do literature review.
it will go on to do experimentdesign analysis.
So writing paper and so on. Right.
So this entire process of coursewill have human sort of supervision.

(16:51):
But a language model itselfor a foundation model itself,
maybe even go beyond language model.
It could be vision language model as well.
Can conduct the process step by step,
similar example as I just mentionedbefore, booking travel and other things.
So, so similar flavor. Right.
So it's no longer just a chatbot, butreally help everyone do something useful.

(17:12):
Special thanks to Mingyi Hong.
For the Discovery Files, I'm Nate Pottker.
You can watch video versions
of these conversations on our YouTubechannel by searching @NSFscience.
Please subscribe wherever you get podcastsand if you like our program,
share with a friendand consider leaving a review.
Discover how the U.S.
National Science Foundationis advancing research at NSF.gov.
Advertise With Us

Popular Podcasts

On Purpose with Jay Shetty

On Purpose with Jay Shetty

I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!

Las Culturistas with Matt Rogers and Bowen Yang

Las Culturistas with Matt Rogers and Bowen Yang

Ding dong! Join your culture consultants, Matt Rogers and Bowen Yang, on an unforgettable journey into the beating heart of CULTURE. Alongside sizzling special guests, they GET INTO the hottest pop-culture moments of the day and the formative cultural experiences that turned them into Culturistas. Produced by the Big Money Players Network and iHeartRadio.

Crime Junkie

Crime Junkie

Does hearing about a true crime case always leave you scouring the internet for the truth behind the story? Dive into your next mystery with Crime Junkie. Every Monday, join your host Ashley Flowers as she unravels all the details of infamous and underreported true crime cases with her best friend Brit Prawat. From cold cases to missing persons and heroes in our community who seek justice, Crime Junkie is your destination for theories and stories you won’t hear anywhere else. Whether you're a seasoned true crime enthusiast or new to the genre, you'll find yourself on the edge of your seat awaiting a new episode every Monday. If you can never get enough true crime... Congratulations, you’ve found your people. Follow to join a community of Crime Junkies! Crime Junkie is presented by audiochuck Media Company.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.