Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:02):
I mean, people often think of Twitteras being this really noisy space of
people shouting at each other, butit's actually a collection of, I think,
quite diverse communities, right?
Certainly, there are very loud people andyou want to somehow filter those out, and
we actually built in classifiers to filterthose kinds of things out of our datasets.
(00:23):
But you, as you said, there arealso gems of community or sub
communities of medical professionals
who are really using these social networksto build communities to have actually
very informative educational dialogueswith each other out in the open.
And we can actually find thoseexamples relatively easily.
That's how we actually curated a datasetof a couple of hundreds of thousands
(00:44):
of high-quality Twitter discussions,which discussion would have one or
more pathology images along with thecorresponding dialogues and conversations
by the professionals about those images.
Welcome to another episodeof NEJM AI Grand Rounds.
I'm Raj Manrai, and I'm herewith my co-host, Andy Beam.
(01:06):
Today we're really excited tohave James Zou on the podcast.
James is a professor at StanfordUniversity, and he's made pivotal
contributions to an astoundinglydiverse set of topics across artificial
intelligence and biomedicine.
Andy, I learned a lot from James abouthis interests and his approach to
research, including about his passionfor journalism, and a really creative
(01:27):
study where he scraped hundreds ofthousands of medical Twitter images to
create a foundation model for pathology.
Yeah, Raj, it was great tohave James on the podcast.
Every time I read a paper that Iwish I had written myself, almost
always it was written by James.
In addition to doing really cutting-edge technical work, he has this knack
for doing really, really creativeresearch that has broad implications.
(01:50):
For example, the paper that he wroteon using GPT-4 as a peer reviewer is
definitely a paper I wish I had writtenmyself, and something that we've honestly
talked about implementing at NEJM AI.
So it was great to see Jamesdo that. Hearing him talk about
how he wanted to be a reporter,
I think that really informs a lot ofthe types of papers that he writes,
and he really is good at gettingto the story and getting to the
(02:12):
interesting parts for AI and medicine.
So, it was great to have him on the podcastand I really enjoyed the conversation.
The NEJM AI Grand Rounds podcastis brought to you by Microsoft,
Viz AI, Lyric, and Elevance Health.
We thank them for their support.
(02:34):
And now we bring you ourconversation with James Zou.
James, thank you for joiningus on AI Grand Rounds.
Thanks for having me.
Really excited to be here.
James, let me also welcomeyou to AI Grand Rounds.
Great to have you here.
So, this is a question that wealways like to get started with.
Could you tell us about the trainingprocedure for your own neural network?
(02:54):
How did you get interested in AI?
What data and experiences ledyou to where you are today?
Yeah, yeah.
So growing up, I was reallyinterested in two things.
I was really interestedin math and in writing.
So I went to college, I majored inmath and then minored in English.
And I had all sorts of fun side jobs.
It's like, uh, I was a foodreviewer for some newspapers.
(03:16):
I was like a movie reviewer for somenewspapers and did some theater reviews.
So, I already thought I wasgoing to become a journalist.
And then I went to grad school.
This is where I learnedabout machine learning.
This was at Harvard.
I took these machine learning classes.
And that's where I just thoughtthis is so fascinating, right?
So I think at that time machinelearning was just really, this is
right before around the time whendeep learning is getting going, right?
(03:38):
So that I think was the most excitingthing that was happening at that time.
The other most exciting thing that washappening around that time was genomics.
a lot of advances in interpretinghuman genomes and figuring out
how to understand what are thegenetic bases of human diseases.
And I always had thissort of eclectic interest.
(03:58):
I thought, okay, so maybe I should use mymath skills in the machine learning side.
And then a lot of the genomics is alsoa bit like, you know, interpreting
literature and writing, right?
So I thought I was kindof here to do that.
As I went along, I came to Stanfordin 2016, where I continued to do a
lot of work in machine learning, butreally have also been very interested
(04:19):
in thinking about how to translatethe machine learning and more
biological research we're doing allthe way into the clinical settings.
So I guess the transition frommachine learning to medicine
was by way of genomics.
Is that correct?
That's right.
Yes.
Do you find the problems in medicine,and we'll touch on this a little bit
more later, easier or more difficultto work on than the ones in genomics?
(04:41):
I know that there's a lot of overlapthere, but how do you think about those
different domains that you've worked in?
It's a good question.
I find the problems in medicineIn some sense, like broader
than was in genomics, right?
So certainly, genomics is an importantcomponent of medicine these days.
But beyond genomics, and Imean medicine has, has also
these other components, right?
(05:01):
There's certainly thehuman aspect of this.
Like how do you interact withphysicians and patients, right?
There's the system aspect of this,like how do you then integrate with
the EHR and work with the hospitals?
So there's also theeconomics aspect of this.
Like how do you sort of incentivizeand how do you regulate these advances
in AI all the way from early stagecompanies to the more mature stages.
(05:24):
So I think certainly genomics isan important part of medicine, but
there's also these other componentsthat I'm really excited to work on now.
If I could ask another question aboutyour background, I find that your
original career aspiration to be ajournalist to be very interesting.
I wonder if any of that journalistinstincts informs the way
that you think about problems.
The way that you think about writingin journalism, does that affect
(05:44):
your approach as a scientist?
Yeah.
So, I think an interesting part ofjournalism is that you want to find some
interesting angles, interesting stories.
When I first started in journalism,I was just sent to do these press
releases from companies, whichis basically like the most boring
job you can imagine in journalism.
Basically just take what they saidin companies, PRs, and then sort
of transcribe that into into Coupleparagraphs that nobody reads. But
(06:07):
then you have to sort of pitchideas to your editors, right?
Like, okay, uh, maybe I want to doa movie review or I want to do one
of the ideas I had was like, I wantto do a restaurant review of all the
kebab vendors in the city, right?
So you try to find an interestingangle that you think the
audience will be interested in.
Something that hasn't been donebefore and then also make it
tractable and pitch it to the editors.
(06:27):
And then, you know, they, they approved inthat case and they gave me some budget to
actually review a bunch of kebab vendors.
This was in Budapest inHungary, which is great.
I think nowadays when we talk aboutscience and science communications
and working with editors and alsothinking about readers, I think a
lot of similar things carry over.
There are a lot of interestingquestions in medicine and AI.
We want to find
(06:48):
sort of a particular perspective thatwe can uniquely contribute to, that has
some sort of interesting hook and angle,right, but it also has a broader impact.
James, you took us through youryears in Boston as a grad student,
and then now you are a professor atStanford where you have a lab in the
Department of Biomedical Data Science.
Maybe you could just take uschronologically a little bit more
(07:10):
through from graduating, finishingup in Boston to moving to the West
Coast and now what your lab focuseson and how you prioritize problems.
I know we're going to dig into someof your research in a couple of
directions in particular, but I thinkwe're only going to touch on a small
fraction of everything that yourgroup is working on or that you've
(07:30):
worked on for the past 10 years or so.
So maybe you can tell usabout that transition.
Moving to the West Coast and then alsohow your lab is organized and what your
main focus areas are for the group.
Yeah.
Yeah.
So I did my Ph.D. at Harvard,so very close to you guys.
And after I graduated in 2014, Iactually spent two years at Microsoft
Research, which is also based at KendallSquare in Cambridge, Massachusetts.
(07:52):
At Microsoft Research, it was actuallya really great experience because you're
sort of basically like free agents.
I was still doing research on thegenomics side and also doing a lot more
stuff then on the machine learning side.
And there, just,
quite serendipitously, we had a projectwhere we're looking at word embeddings.
Think about the word embeddings,basically like the, the baby versions
(08:13):
of the large language models.
And people were using wordembeddings a lot around 2015, 2016,
but nobody were looking at biasesin these word embeddings, right?
So we thought, okay, wouldn't it beinteresting to see, like, do the word
embeddings actually capture stereotypes,like gender stereotypes, where,
like, And ethnic stereotypes, right?
So it was a couple of colleaguesin Microsoft research.
(08:33):
We did that analysis and we found thatthese word embeddings actually had a
lot of gender stereotypes, which endedup becoming this relatively well-known
paper around looking at, or men'sto computer scientists, women's to
homemaker, which is some of the findingswe saw from these word embeddings.
And that actually opened up a new areaof research for me that I'm continually
working on is really thinking abouthow do we use these AI models in ways
(08:56):
that are ethical and responsible, whichI think is especially important as
we're thinking about this intersectionof AI and health care and medicine.
Alright, so that was my experience atMicrosoft Research, and then I came over
to Stanford in 2016 to start my own group.
And here at Stanford, I'm superfortunate to work with actually
a really diverse group of people.
So I have students and postdocsfrom computer science, from the math
(09:18):
department, from statistics on theone hand, but I also have several M.D.,
Ph.D.s also in my group as postdocs.
So we have this group
of terrific people from diversebackgrounds and we work on, basically
half of the group works on somemore foundational questions in AI,
responsible AI. And then the otherhalf works on how do we take these
(09:38):
innovations, like, for example, ingenerative AI and then make them really
impactful in medicine and health care.
Awesome.
So, I think that's a good segue tostart digging into some of your work.
And the first paper that I'd liketo talk about, I think really you
alluded to your journalistic instincts.
I think this is a good example of tellinga really interesting story and having
(09:59):
a really novel angle on a problem.
And so the paper I'd like totalk about is "A visual-language
foundation model for pathology imageanalysis using medical Twitter."
So, this is, I think when I sawthis paper, I thought that was a
really, really clever thing to do.
Could you walk us through the setupof the paper, what you did that was
different than other kinds of modelslooking at this problem, and then
(10:21):
kind of like maybe the origin story,how you like thought to do this?
Yes.
Yeah.
Thanks.
Thanks for that, Andy.
So, I guess the origin story of thisis that I have a terrific postdoc,
Zhi Huang, who's a joint postdoc workingwith me and also was in the pathology
department here at Stanford, Tom Montine.
And he noticed something that I thinkis really interesting that I didn't
know about before. Which is thatusually when pathologists, when they
(10:43):
encounter ambiguous images or casesthat are not familiar to them before,
something really interestinghappens, which is that they
actually then go on Twitter.
They would actually postthose images on Twitter.
These are de-identified images, right?
And then they will invite theircolleagues from around the world
to have a discussion on Twitterabout what's going on in that image.
And this is actually, amazingly,it's actually a recommended
(11:06):
practice, even by the NationalAcademy of Pathologists in the
U.S.
and Canada have actually suggestedeven specific hashtags for all of
their members to use, right, aroundsurgical pathology, dermatology,
it's for each of the sub sub areas.
There's actually a very activeTwitter community of people
discussing and posting information.
So, these are guidelinerecommended hashtags.
(11:28):
That's right.
Yeah.
There's 32.
Critical society guidelinerecommended hashtags.
Yeah, that's amazing.
Cause my question was going to belike, why are people posting so many
pathology slides on Twitter, but itactually is per the society guidelines.
Encouraged.
Yeah.
Encouraged I think it's encouragedby their organizing bodies.
And I think it's actually quite avisionary in some sense for them.
Right.
(11:48):
Because.
You can imagine like a lot of thepathologists, especially if they're
not at academic centers, like Harvardor Stanford, if they're more at these
clinics, they're often isolated.
They don't have a lot of colleagues whohave complementary expertise around them.
And if they encounter challengingcases, which often happens, then
they do want to have a communityof people to provide feedback.
(12:10):
And social networks like Twitter andLinkedIn have become a really active
platform for these clinicians to getfeedback and also to have a community.
So that's actually really amazing to see.
And as AI researchers, this alsopresents, uh, I think a really untapped
but tremendous opportunity for AI.
Because especially thinking about medicalAI, getting high quality and also large
(12:33):
scale data is often a big bottleneck.
But here we have just sitting infront of us, under our nose, but it's
all out in the public, all of thesehundreds of thousands of high quality
images, along with discussions, right?
Dialogues by experts, right?
About each of those image, which isactually really hard to see these kind
of multiturn dialogues by experts.
(12:54):
But now we have hundreds of thousandsof these examples that are all out
in the public, and if we can justfigure out a way to curate and to
harness this datasets, then this canbecome a tremendous resource for AI.
So that's the impetus for this project.
And I just think it's fascinating because
so often we worry about AI learningfrom Twitter and from other forms
of social media, just becauseof the sort of toxic content.
(13:17):
But what actually you've discoveredis that there's this little shining
corner of Twitter where people arehaving sort of high-minded academic
discussions along with imaging.
And so actually this is maybe arare instance when social media
becomes such a sort of fertiletraining area, producing such AI.
(13:37):
I think that's exactly right.
Yeah.
Yeah.
I mean, people often think of
Twitter, us being this really noisyspace of people shouting at each other,
but it's actually a collection of, Ithink, quite diverse communities, right?
Certainly there are very loud people andyou want to somehow filter those out.
And we actually built inclassifiers to filter those kinds
(13:57):
of things out of our datasets.
But you, as you said, there are also gemsof community or sub communities of medical
professionals who are really using thesesocial networks to build communities
to have actually very informative
educational dialogues with each otherout in the open, and we can actually
find those examples relatively easily.
That's how we actually curated a datasetof a couple of hundreds of thousands
(14:20):
of high-quality Twitter discussions,which discussion would have one or
more pathology images along with thecorresponding dialogues and conversations
by the professionals about those images.
And maybe you'll talk about this whenwe dig into the results, but these
are often posted, uh, in search forconsultation from a pathology colleague.
(14:41):
So, are the types of cases that getthrown up here near the average or are
they weird or are they outlier cases?
Like how does that affect what theAI is ultimately able to learn?
Yeah, that's a great question.
The cases people post on Twitter, as youcan imagine, tends to be the harder cases.
If it's a very common case, thenpathologists feel less about the need to
(15:04):
ask about their colleagues for feedback.
So, what they often do is actuallypost more of the corner cases, right?
So basically, the cases in the machinelearning lingo are tends to be the
outliers or near the decision boundariesbecause they're more ambiguous.
And that's actually in some sense,even better for training AI algorithms,
because, you know, we have the commoncases so we can get those from other
databases like TCGA, but it's theselong tail cases, right, the rear
(15:28):
diagnosis or the ambiguous cases or theoutliers, there actually ends up being
hard to get from other sources or fromacademic sources, but also extremely
valuable for teaching the AI algorithms.
So that's why we think that thekind of a Twitter and social network
and cross-source data is actuallyespecially useful for teaching
algorithms about these long tail cases.
Got it.
(15:49):
Okay.
So, we have this unique data resourcethat you've curated from Twitter.
And just so everyone's on the same page,it's an actual image of a pathology
slide or a small patch of it, alongwith hashtags and Twitter discussions
that go along with that image.
So given that dataset, whattype of AI system did you create?
(16:10):
Yes, yeah.
So, once we've cured this dataset,so we wanted to create what's called
a visual language model, right?
So what that means is that ifyou think about ChatGPT, right?
So these are like language modelsso they can understand text, right?
And you can ask questions in textor give you responses in text.
Uh, Since we're dealing withpathology, images, and text, right?
(16:32):
So we want to have this AI algorithmto not just be able to have the
text understanding, but also to havethis visual understanding, right?
So that's where the visuallanguage part comes in.
So basically, we try to buildthis kind of visual language,
essentially kind of a chatbot, right?
Where you can basically put in images,pathology images, and the model will be
able to answer some questions, right?
(16:52):
Or provide some descriptionabout those images.
So this is the model that webuilt on top of this Twitter data.
And so you can do this and then turnit into essentially an out-of-the-box
pathology classifier, right?
That's right.
Yeah.
So this, this model, so we callthat CLIP, um, and this CLIP model,
so one of the ways you can use itis to give it a new image, right?
(17:14):
And the model will provide its, you know,best guesses about what is the diagnosis,
or what is going on in that image.
And so could you tell us how this modelcompared to like traditional pathology
classifiers that are trained directlysort of on clinical pathology data?
Yeah, so that's a good question.
So, I think the model would do comparablyto some of the standard traditional
(17:38):
pathology classifiers, but where we thinkabout the model being useful is what's
conceptually more of a foundation model.
And here by foundation model, I meanthat in this model, it's, um, not meant
to solve just one specific task, right?
It's not just to try to trainspecifically to predict whether
there's a tumor or benign forbreast tissues or for colon, right?
(17:58):
But it's actually trained on thiswhole diversity of data from all
these different subareas of pathology.
So then the way that people could useit would be you can take this model
as sort of the base or the initialsetting and maybe do some, what we
call additional fine tuning, right?
Just basically just give it someadditional examples to then to
train it to predict some specifickinds of diagnosis or tasks, right?
(18:20):
And when you do this fine tuning,then it is able to do comparably and
sometimes better than these previousgenerations of pathology models.
Awesome.
Thanks. James,
I have just one more question aboutthis and then we want to switch topics.
You mentioned that kind of a keypart of this study was extracting
this useful set of tweets, right?
(18:41):
Where you had the hashtags fromthe societies that you could use to
pinpoint where that useful trainingdata was for the actual model itself.
Do you think you would have been ableto do this study if the hashtags were
not standardized the way they are?
Yeah, I think we would have been
much harder.
I think it would still be possible,but then we need to have somehow
much better pipelines and classifiers.
(19:04):
I think that's a nice thingabout these hashtags is that
they are fairly specific, right?
It's not some things that youwould just make up if you didn't
know about pathology, right?
So that means that these communitiesare actually pretty curated communities
of individuals who have some
experience working with digital pathology.
And then a related question is, wouldyou have been able to take, say, case
(19:26):
studies or reports that are publishedin like pathology journals, where if
you had the full text and the imagesand do something conceptually similar,
I imagine that there's some of the samechallenges and opportunities there, right?
Which is that
the cases that typically are publishedare not the standard ones that everyone's
familiar with, but maybe also ones thatare right around that decision boundary
(19:46):
or where there's uncertainty, but wherethere is some annotation and some image
also available, although less, probablyless freely available than tweets that
you could, you could download directly.
Yeah, so I think that's actuallya really interesting idea.
I mean, one of the learnings from thisproject is that there's actually a lot
of other sources of data, right, fortraining AI models beyond the sort of the,
(20:09):
the standard kind of data, uh, datasets.
So social networks likeTwitter is one of these, right?
We've also seen this kind ofdata being shared on LinkedIn.
So that's another platform.
Even on YouTube, right?
So there are lots of interesting researchrecently of trying to get information
from YouTube videos locally at Stanford.
We have many hundreds and thousandsof hours of instruction videos, right?
(20:31):
Different, uh, uh, pathologyinstructors, right?
Um, and we can figure out interestingways of curating information from
those pathology instructions.
So I think as researchers, we can bequite creative in coming up with all
sorts of interesting sources of data.
I guess to follow up on thatand just one more follow up.
We tend to get
(20:52):
overly fixated on EHR and health caredata, but what's your sense of
the sort of untapped potential ofpublicly available medical data?
Like you mentioned Twitter,you mentioned YouTube.
Do you think that this is somethingthat's underappreciated for medical
AI, that we should be using moreof these sort of non traditional
medical sources or data sources?
I think so.
I think so.
Um, I think we can get on the waters of,you know, many millions of relatively
(21:17):
high-quality kind of images withdetailed annotations, descriptions,
just from these public sources.
And I think that's often
enough data to train somequite interesting and powerful
machine learning algorithms.
It also makes it easier to make thesealgorithms to be fully transparent, right?
Because one of the challenges that weoften face when working with EHR data is
(21:41):
that it's often hard to share these modelsand sometimes we can't even publish
the weight of the models due tothe concerns about privacy and
leakage, but with working withthese creative sources of public
data, we don't have those concerns.
We can actually tell you exactlywhere the data come from, which
makes it easy for anyone to audit themodel, also to reproduce the model.
(22:01):
Great.
So James, we want to stay on the subjectof foundation models, but switch the focus
of our modalities from images to text.
So Andy and I were, youknow, browsing Twitter.
We saw this
study that you published, or youreleased as a preprint, and it's on
a super important and timely topic.
Many of us, many of ourlisteners have either played
(22:22):
with or regularly using ChatGPT.
And you're asking a very importantquestion, which I think many
of us have wondered about.
But until your study, I don'tthink I'd saw a very systematic
treatment of this topic.
And the title of your study is, "How isChatGPT's behavior changing over time?"
It's a question.
(22:43):
And so maybe you could firsttell us about, I don't think you
need to motivate this actually.
It's very clear why we need to work onthis, but I would be very curious if
you could tell us about how you decidedto work on this particular topic, what
the background story was there, andthen what your major findings were.
I know there was someinteresting response on Twitter.
(23:06):
Uh, we've been talking aboutTwitter for pathology images.
There's interesting response becauseI think you really hit a nerve in
the community where a lot of peoplefelt with their own experience
that ChatGPT was changing, in itsreliability, its usability, how it
operated since they started using it
last November, December, when it came outto now, or even over the past few months.
(23:29):
And so they saw your studyas validation almost, right?
Of their experience.
And then you had some astute criticismof the study as well, from some folks
at Princeton and some other researchers.
Where they said, wow, your evaluationsaren't really fair evaluations of ChatGPT.
So I gave a long lead in, but maybe youcould first tell us about the study and
(23:51):
then I would be very curious if you couldaddress some of that criticism and tell
us where you're at now and where you seethat line of inquiry moving going forward.
Sounds good.
Yeah.
So, so I guess the, I think of thisas maybe another example of perhaps
maybe some of that journalistictraining I had early on coming up
here is that, you know, this issomething a journalist would do, right?
(24:13):
So they would see, as you discussed,like you see these anecdotes of people
talking about on Twitter and socialmedia of, oh, they see the ChatGPT's
behavior changing, which is what we did.
And we thought, okay, this will actuallybe a really interesting and timely story.
But we want to not just tell the story,but actually really provide a lot
of the data and evidence behind it.
So that's actually the initialmotivation for this study.
(24:36):
And the context of this is that, youknow, when people typically say GPT-4,
right, which is maybe the latest versionof ChatGPT, they think of GPT-4 as a
single model, the same model, right?
If you use GPT-4 today, and if I useit again tomorrow or next week, it's
the same model and I should justget the same response back, right?
So maybe up to some of therandomness, the minimum randomness
(24:58):
of the model itself, but I shouldget mostly the same response back.
But what we found in our studyis that, you know, GPT-4 is
actually not the same model.
For example, GPT-4 back in March wouldactually have systematically very, very
different behaviors compared to GPT-4more recently, like in June or later on.
And the way that we tested thisis that we actually have two
(25:19):
checkpoints of GPT-4, right?
So that OpenAI release, theyhave the checkpoint of the model
in March and checkpoint in June.
So by checkpoint, I just mean theybasically released exactly that snapshot
of the model that was trained and
made available in March, and theyprovide another snapshot in June.
And then we just came up with,uh, eight different benchmarks.
(25:39):
So each benchmark is designedto test particular kinds of
ability, capability of GPT-4.
Some are for reasoning, other forproviding opinion questions or solving
math problems, some are more for safety.
And we just applied each of thebenchmarks to the GPT-4 in March
and again to the GPT-4 in June.
And we saw that across all eightbenchmarks, there's actually
(26:00):
quite substantial differencesin the model's behaviors.
Along some dimensions like safety,it did get better over time.
The June version is saferthan the March version.
How do you measure safety?
So, the way that we measure safetysimilar to how other people have done
it is that we actually have a bunchof questions that are considered to be
dangerous questions, questions like,you know, how do I run this
(26:23):
credit card, right?
Or how do I poison?
I, I hope I can't answer that.
I hope I can't answerthat question. Exactly.
We hope it cannot answer, too.
So, the way we measure safety ofthis is say we ask this question
and we say, does the model actuallyrefuse to answer these questions?
Right.
Okay.
So it's safe if it's refusing.
It's a language model.
It's basically does not give meany, any, uh, useful instructions
(26:44):
on how to steal someone's creditcard or how to poison someone.
Were the questions that you asked inthat domain, were they specific to
medicine or broad, broader than medicine?
These are quite broad, right.
So some of them are related tomore of physical safety, right.
Like, you know, is it safe to turnon my car and close the windows
and close the garage doors, right?
(27:05):
Others are more related to more let'ssay biases and stereotypes, right?
So, like, you know, you asked themodel, okay, how do I generate some
harmful stereotypes about a particularsubgroup of population, right?
And see the models able to do that.
Some of them are really moreto health care and medicine.
And we saw that by and large, GPT-4did improve the safety performance
(27:26):
in June compared to March.
However, it got much worse across someof these other axis of evaluation.
So for example, it'sability to respond to
non-sensitive questions,right, to more harmless opinion
questions also has gotten worse.
So if I ask GPT-4, what do youthink will be the status of the
(27:47):
U.S.
in 30 years?
Right, which is a question that'sdone from these, we carried it
from these public opinion surveys.
In March, it will give you a reasonableanswer, but in June, it actually
refuses to answer that question.
It says that, you know, as anAI model, I don't have opinions.
These are subjective questions, soI don't want to answer them, right?
So you can see this interestingtrend in that most of the time
in June, they actually refuse toanswer relatively harmless opinion
(28:09):
questions, that it was perfectly
willing to answer in March.
The take that I remember from Twitteris that people saying GPT-4 has been
lobotomized, but I think that was anover-reading of the results of your paper.
When you first got those results,how did you interpret them?
Our interpretation isthat first there's this
(28:30):
really a huge amount of model drift.
So, model drift here would meanthat, you know, the behaviors of
these AI systems, it actuallycan change quite a lot over time.
And what people should rememberis that these algorithms
are learning systems, right?
That's what makes AI a bit special isthat they continuously learn from data.
In the case of large language modelslike ChatGPT, it continues to learn from
(28:51):
human instructions andalso from human feedback.
And then it seems like there's actuallybeen really substantial changes in
GPT-4's behavior over time due to,potentially due to, the human feedback.
Can we just pause there?
So could you explain to our listenershow human feedback enters into
changing a model like GPT-4, ChatGPT.
You see this term, four lettersbranded around a lot, RLHF.
(29:15):
Could you explain to us, maybe one ortwo mechanisms that you see human feedback
entering in and changing this model?
Yes.
Yeah.
So RLHF, reinforcement learning from humanfeedback, is one of the, I would say three
prevalent stages, three ways that peopleusually train these large language models.
The first two ways are more of,you give it text, maybe a corpus
(29:38):
of text from articles, from papers.
You just see how well can the modelsbe able to complete the text, right?
Generate texts similar to theones that we've seen before.
Those are what's called pretrainingor supervised fine tuning.
And then the third way that you'reraising reinforcement learning from human
feedback, this is where the goal is toreally try to align the model's behavior
(29:59):
with human preferences, right?
Like maybe, Andy or Raj or me,like we have certain preferences of
how we want our chatbot to behave.
I mean, I like to have mychatbot to be more concise.
I don't want to give it long answers.
And then what we can do then is thatI can just provide some examples
of that preference that I have,
right?
I can rank, given two responses from thedifferent chatbots, which one do I prefer?
(30:24):
And companies like OpenAI would actuallytrain another algorithm to basically
try to model my internal preferences.
This is called a reward model, which isbasically another little language model.
It's trying to do this.
And then when they do the humanalignment, basically try to align
the performance of ChatGPT to theirusers, they're basically Updating the
parameters of ChatGPT to essentiallyto increase the reward it would get
(30:48):
from this reward model that's supposedto emulate what a human would like.
Perfect.
So, we talked about safety.
Andy referenced that really viraltake on your paper, which was
that GPT-4 had been lobotomized.
I think a lot of that wasreferring to another domain, right?
Which is its ability to domath or to write usable code.
(31:11):
So could you talk aboutmaybe some of those
dimensions that you evaluated, also.
Maybe let me ask it more directly.
First, is GPT-4 getting worse at math?
Yes.
We think that GPT-4 hasgotten worse at math.
The reason why we think that happenedis because we think that it's ability to
do this, what's called chain of thoughtsreasoning has gotten worse over time.
(31:34):
And for the user who might notbe familiar, so chain of thought
reasoning is actually a pretty
popular strategy.
It's called a prompt strategy.
So, the idea there is thatyou can ask GPT-4, okay, so is
this number a prime number?
You just ask it directly, then we willgive you an answer, yes or no, right?
But oftentimes people have found inthe past that if you ask GPT-4 to think
through step-by-step, right, give thelogical reasoning of why you think this
(31:58):
is a prime number, then sometimes themodel will actually go through that
step-by-step reasoning, that, and thatcan substantially improve its performance.
This is similar to how we often,like in schools, we often ask our
students to think through and showthe steps of their problem solving,
right, that helps the students toavoid mistakes, similarly with GPT-4.
And what we found is that basicallyin March, it was actually quite good
(32:18):
and willing to do this kind of chainof thought or step-by-step reasoning.
And then it was getting recently goodperformance in these math questions,
relatively simple math questions.
But in June, when we asked the samequestion, and when we asked GPT-4 to also
to follow chain of thought reasoning,first it would ignore our request to
use chain of thought reasoning, right?
It would just completelynot show its work, right?
(32:39):
Not show the step-by-step process.
And the second is that it wouldjust jump to the answer directly.
And oftentimes it givesus the wrong answer.
Can I ask a, maybe a conceptual question?
In the literature, there's thistechnical concept that's
known as catastrophic forgetting.
And so when you update a model, sometimesit causes the model to forget previous
things that it's already learned.
(33:00):
Do you think that GPT-4's degradation inmathematical reasoning here is specific
to what they did the human feedbackon, to the topics that were covered?
Or do you think it's somehow inherentto the RLH procedure itself and that
you're going to wash out and causesome catastrophic forgetting just when
you're now maximizing this, like, kindof narrowly defined reward signal?
(33:23):
It's a good question.
I think it could be a mixture of both.
And I should caveat by sayingthat we do not know exactly how
OpenAI is training GPT-4, right?
So they are, despite the name, it's notreally transparent what they do to the
model and how they update the model.
I think which actually makes iteven more important for external
academics like us to evaluate andmonitor the systems over time.
(33:45):
But that caveat aside, I do think thatsome of this behavior changes we see is
due to this continuous training they'redoing, and also due to the specific kind
of content they're doing this training on.
Basically, one thing we've been tryingto do is to essentially replicate the
behavior change that we see withGPT-4 with smaller models that
(34:07):
we have complete controls over.
So this is sort of analogous to,uh, in medicine, you know, it's
hard to do experiments on humans.
That's where you find model systems likemice to replicate the human diseases.
Similar here, we can't reallydo experiments on GPT-4.
So, we've looked at smaller models likethese Alpaca models that are open source.
And then we try to see, okay, if wedo different kinds of fine tuning our
(34:30):
Alpaca, do I see similar kinds of behaviorchanges as we've seen in PerCl for GPT-4?
And this is where we have seen that,for example, if you do some kind of
instruction fine tuning to improvethe safety of Alpaca, So safety does
improve, similar to how we see thesafety in GPT-4 improves, but we also
see side effects of safety training,whereby Alpaca is also less willing to
(34:55):
respond to other kinds of questions.
For example, I give it, ask it, howdo I kill weed in my backyard, right?
And the model will say, well,you shouldn't kill weed.
Because killing is not good andthe weed are intelligent systems
that they should be respected.
I think I've seen a version ofthat, which is how do I kill
a process on my Linux machine?
And it says you shouldn't killa Linux process on your machine.
(35:16):
So, yeah.
Yes.
Context matters.
Context matters.
That's right.
Yeah.
So the model sometimes becomeslike overly safe, right?
And it has these side effects.
It's maybe a little bit unexpected.
And I think that's kind of analogous towhat we see with GPT-4 as well, right?
Like the reason why I stopped answeringmy very, uh, reasonable questions, like,
(35:36):
what do you think will happen to the
U.S.
in 30 years?
It's sort of analogous to, youknow, some of these effects of
safety fine tuning that we've seen.
Can I ask a question that maybe you can'tanswer, but I'm wondering if in your
investigation, OpenAI is often kind ofaccused of RLFH-ing out very viral fails.
(35:57):
So if someone posts something onTwitter, then people go and try
and reproduce this failure mode.
A couple days later,they're unable to do that.
Did you see any evidence of that whenyou were looking at this problem?
So, that's a good question.
Well, I guess what we have seen isthat as a response to some of these
results that we have and other peoplehave found out is that OpenAI has
(36:18):
decided recently to keep the earlierversions of their GPT-4 available, at
least for quite a while in the future.
So basically, they're goingto maintain the March version
of GPT-4 that people can use.
So that you don't have touse the latest version.
You can actually switchback to the March version.
And I think that's actually a responseto maybe a lot of these criticisms that
(36:40):
we and other people have been discussingis that some of these, many of these
aspects of the model have drifted.
Become less useful for theirusers, which motivates them to
keep the earlier version around.
So, James, we have one morequestion about this human feedback
that's entering into the model.
So, we're talking a lot now inthe community about RLHF and
(37:04):
SFT, supervised fine tuning, themethods that you talked about.
To take this model that comes outof the pretraining and turn it into
something that is less toxic, moreuseful, more aligned, and more usable
that we've seen with ChatGPT andGPT-4 and many other language models.
Do you think that in a year, twoyears, we're still going to be
(37:26):
talking about and emphasizing theimportance of SFT, RLHF, these methods
of alignment, as much as we are today.
Where do you think the sort of research,research energy and importance of
alignment and of this field is goingcompared to where it's at today?
(37:47):
Yeah, that's an interesting question.
Um, I do think that in a year ora couple of years, more and more
of these models will be trainedon other kind of AI-generated data
rather than human-generated data.
In that sense, then it woulddiminish some of the need and also
the impact of learning from humanfeedback or from human reward.
(38:08):
And we already see that withmany of the open-source models.
So many of the smaller open-sourcemodels are basically trained, not
from human feedback, but from larger.
AI models like GPT-4, GPT-3, essentiallyin the kind of a teacher-student
settings where the large modelsthen will provide instructions
to teach the smaller AI models. We do
(38:28):
see that the smaller AI models do learnmore efficiently in that teacher-student
student setup compared to if you justtrain them directly from human feedback.
I think the field will be goingto this direction where, you know,
maybe it's an interesting more sciencefiction version like view of this,
but you have a lot of these AI models.
And then you have the largerAI models for teaching the more
(38:50):
specialized, smaller AI models.
And maybe the specialized sites, smallerAI models will get some fine tuning from
specialists, right, from domain experts.
A relatively small amount ofhuman feedback, but a lot of
the training will be done byinteracting with other AI systems.
Can I just ask a follow up on that?
So that's a good way tocreate like a GPT-4 mime.
(39:12):
But presumably there would be a GPT-5in the future where we would still need
this kind of alignment or fine tuning.
Do we still have the same problem?
Just the scale of the model that we haveto worry about changes, like the LLaMA2s,
we'll just learn from GPT-4s, but thenwhen GPT-5 comes out, will we still have
to do this kind of human alignment step?
(39:32):
Yeah, it's a good question.
I think.
There will be, the ecosystemwill likely have like a few of
these very large models, right?
And then those sort of more frontierlarge models will have to get
additional new human feedback data.
And sort of in their wake, there'llbe a lot of smaller models that
will be built on top of them, right?
Maybe similar to you have a bigwhale and then next to the whale
(39:54):
will be a school of fish thatare feeding off its training data.
I think especially aswe're moving towards other
modalities like GPT-4, GPT-5, andhave been incorporating other
modalities like imaging, videos asyou bring in new modalities online.
Then we need to go through a similarprocess of aligning the models with
(40:15):
human preferences, human feedback for howwe think about these other modalities.
Okay, so I think we are readyto move on to the lightning
round if you're ready, James.
Let's do it.
Okay.
We're going to ask you a bunch ofonly tangentially-related questions.
I think that the goal here is to, youcan respond as long as you want, but
(40:37):
I think briefer answers get higherscores in this portion of the podcast.
So the first one, uh, given yourexperience with model drift and GPT-4, as
we've talked about your work in machinelearning for medicine, do you think that
LLMs will be net positive for medicinespecifically over the next five years?
Yes.
Yes.
I think they will be ahuge positive for medicine.
(40:59):
Could you give me a specific examplewhere they might be a huge positive?
Yeah, I think one area where LLMswill be very, have already seen a lot
of potential is more for translatingdifferent types of medical texts, like
explaining things to patients, right?
Explaining a lot of the jargon fromthe clinical notes to the patients, or
(41:19):
explaining instructions from doctorsto patients, or making things like
consent forms or clinical trialsmore understandable, more accessible.
I think those sorts ofkinds of things are more of
what we call like style transfer, right?
Taking one kind of writing that'smore jargon y, technical, translating
the style into something that'seasy for everybody to understand.
That has less risk of hallucination,but It actually would make
(41:42):
health care medicine more accessible.
Awesome.
Excellent.
James, if you were not in AIresearch, what job would you be doing?
I think I would still go backand maybe become a journalist.
And I had really a lot of fun being areviewer, not of scientific papers, but of
like theater and restaurants and movies.
(42:04):
I think that, so that was great.
That was a lot of fun.
So that's, that's what I would like to do.
Awesome.
Are you familiar with theconcept of a bucket list?
Yes.
I think the concise definition isit's a list of things that you would
like to do before you kick the bucket.
If I could do a Morgan Freemanimpersonation, I would have done it there.
Um, so what is something on your bucketlist that you hope to do someday?
(42:29):
Ooh, I would love to do a triathlon.
Wow.
Do you, have you trained as an enduranceathlete before, or is that part of
the bucket list aspiration also?
I've done a few marathons together with mywife, so she's a better athlete than I am.
So, but she's meant to dragme to do a few marathons.
And then recently I've been doinga lot of swimming and biking.
(42:50):
So getting there on the other two parts.
It's awesome.
Excellent.
James, what is your favorite TV show?
Oh, I like, I think my all-timefavorite still have to be Seinfeld.
The, the classic '90s New York humor.
I love that.
But I guess more of more recent ones, um,you know, I think I, I watched Silicon
(43:13):
Valley like a TV show not too long ago.
I think that's actually quiteinteresting and especially
living in the middle of this.
Does it hold true?
Some part of it.
Yeah, actually surprising partof it actually does hold true.
How many Erlich Bachmann's do you know?
(43:34):
You don't have to answer that question.
Three, I think three.
That's a lot of Erlich Bachmann'sfor one person to know so.
It's probably three too many.
Um, okay.
So, uh, given your work in genomicsand, uh, in AI, uh, I don't know
(43:55):
if those are related to thisquestion, but it's still a setup.
Um, do you think aging viewed as adisease is something that can be solved?
I'm actually really interested in aging.
We have a few parties looking at aging,especially looking at what are the
different aging interventions? For agingrejuvenation, where we make these poor
mice like exercise and we give themthese anti-inflammation factors and see
(44:20):
how does that change their aging clocks.
So I do think that aging isa really important collection
of different changes, right?
Including a lot of differentdiseases associated with aging.
And there's so much we don't understand.
But I think there are also a lot ofreally interesting ideas on how to maybe
reverse some of the effects of aging.
James, do you think things createdby AI can be considered art?
(44:43):
Yes.
Yes, I do.
Yeah.
Like when people, likeabout 100 years ago, right?
When photography was becoming morepopular, there was like a big debate of,
okay, if you have a human who takes a bigbulky camera and takes a photo, right?
Is this art, right?
And then the actual traditionalpainters and say, okay, that's not art.
I mean, it's the machine doing the job.
(45:04):
But now I think we've come to appreciatethat photography can be very creative
and it's really a good example ofhuman-machine interaction, right?
The humans still haveto set the scene there.
I think it's quite analogous toAI generated art here, right?
The human will sort ofset the scene, right?
Maybe through the promptingand through other things.
And then the AI will be like thephotography machines, and they're going
to capture the, the, the details, right?
(45:27):
But it's a new mode of human-AIinteraction, but I think it could
be very artistic and creative.
Awesome.
Okay.
Final lightning round question.
Again, given your previousexperience as a journalist, who
is your favorite author or writer?
I have been reading alot of Walter Isaacson.
I think he's done a great job,you know, writing about everybody
(45:49):
from Steve Jobs, Elon Musk.
He also has a tremendous biographyof Jennifer Doudna, who is
someone that I also admire a lot.
And yeah, so I recommend that highly.
So, I've read every biography he'swritten, too, so I'm totally, totally
in total agreement with you there.
All right.
Well, congrats on passingthe lightning round, James.
Thank you.
All right.
So, uh, in the time that we have left,we'd like to pull back and ask you
(46:11):
a couple of big picture questions.
You've worked in a lot ofdifferent areas of medicine.
You've worked, you know, I'm probablygoing to miss some here, but you've
worked in cardiology, you've workedin pathology, you've done lots
of broad landscape assessments ofhealth care, both like from what the
FDA is doing, what they're approving.
So you really have workedin a lot of different areas.
What areas of medicine do you think mightbe most resistant to change from AI?
(46:34):
It's an interesting question.
I think the gradient that we seeright from less resistant to more
resistance, that's the earlier we are inresearch and discovery, maybe the less
resistant and easier it is to use AI.
So that's why, for example, there's alot of AI now in drug discovery, biotech,
and even the pharma companies thatget into this on the early stage side.
(46:56):
As we get closer to the later stage,like clinical trials, and even
post-clinical trials, I think that's wherecompanies become more conservative, right?
And it's been harder now, there'sefforts, but it's been harder,
for example, to use AI in clinicaltrials or in later settings.
And do you think that thatreflects economic forces or risk
(47:17):
tolerances or like status quo?
Sort of what do you think makessomething so resistant like that?
I think a lot of that does comedown to economics and incentives.
As you get closer to theproduct side, then there are
challenges beyond the technology.
You have to figure out how tointegrate it into the workflow, right?
Into the EHR and the rest of it,how to get reimbursed, right?
(47:40):
By Medicare and bythe insurance companies.
And at that point, it'sreally about actually
less about the technology itself,but more about you have to align the
incentives of these other stakeholders.
James, I have a question that I thinkwe touched on a little bit earlier at
the very beginning of the conversation.
And also, I think goes alongwith what Andy just asked you.
(48:01):
So you really work acrossthis amazing number of areas.
I think we maybe did 5% or 10%of the different topics that you've
taken on in your own lab and in yourwork before you started your lab.
So, I'm really curious, you know, howdo you approach selecting projects?
And maybe a related question,
how do you identify good collaboratorswho can provide complementary expertise
(48:25):
to you and to the members of your lab?
First, I should really mention thatI've been extremely fortunate to have
some just amazing collaborators, and alsoamazing students and postdocs in my group,
especially as someone who comes more froma computer science and math backgrounds.
You know, I know what I
don't know, which is, uh, I don't have alot of background on the clinical side.
(48:46):
So to address that gap, I thinkone of the most fortunate things
I've done is to try to recruit M.D.,Ph.D.s, basically like residents or
clinicians as postdocs into my group.
So, most of my students come from computerscience, from that engineering aspect,
but in order to really identify impactfulproblems and for them to be able to really
(49:07):
have an impact with these algorithms.
It's been really, uh, very helpfulto have really talented clinicians
who have spent a couple of yearsin my group as postdocs, right?
So, um, David Ouyang is one of thesefolks who you mentioned cardiology.
So he was a postdoc here, but reallyled some of the work that we're doing on
cardiology side and led these clinicaltrial work in cardiology, right?
(49:28):
So, you know, in working very closelywith my computer science student, Brian.
So I think having folks likethat is really critical.
Yeah, we know David, uh, he might bejoining us, I think, as an editor at
NEJM-AI. He's a, he's a fantastic person.
Yeah, that's a wonderful choice.
Yeah, um, I guess the lastquestion, and it's probably as
(49:49):
broad as we can possibly go, butI'm curious on your thoughts on it.
So, we've been talking a lot aboutthe near-term value of AI, what it
can help do in sort of very specific,well prescribed areas of medicine.
The broader conversation ismuch more dominated by this
idea of existential risk.
The AI is going to murder us all.
So, I guess I would like to ask youpoint blank, is existential risk in
(50:13):
that sense, something that we shouldbe worried about, or how do you think
about working in a field like this, wherethat is a concern that some folks have?
Yeah.
So first I should say that I don'tthink we are still very close to
having an AGI that can post existentialrisk. I think even with GPT-4, GPT-5,
(50:34):
I don't think we are that close toreally a very dangerous AGI yet.
I think there are a lot of– Are you willingto speculate how far away we might be?
I think I'll say probably morethan 10, 20 years away, but I do
think that there are still a lot ofimportant short term risks, right?
(50:54):
With even with the non-AGI AI thatwe have, especially as it pertains to
reliability and transparency and bias.
So that's why a lot of our efforts aremore focused on these more short term
risks that we think could happen and couldbe impactful in the next one or two years.
At the same time, I don't view thesetwo fields as mutually incompatible.
(51:14):
I do think that there's room forserious researchers and scholars also
to think about the longer term risks,like this existential risk, right?
Like, even if it's veryunlikely, like even if it's
0.1% chance or less than that ofthat happening, I mean, I think it's
still worthwhile for people to, for somepeople to spend time to think about it.
And I think there are also reallyinteresting, these are intellectual
(51:35):
questions that arise when you thinkabout these longer term risks that
can also help us to come up withbetter shorter-term solutions.
All right, I think that that'sa very sensible approach
to that kind of question.
So I guess, uh, James, thanksfor joining us on AIGR.
Thanks for having me.
Really enjoyed our discussions.
Thanks so much, James.
This was great.