From Clinical Notes to GPT-4: Dr. Emily Alsentzer on Natural Language Processing in Medicine

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:01):
We use these cases from NEJMHealer, a medical education tool.
We wanted to look to see if we providedthese cases to GPT-4 and we swapped out
either the race or the gender of thepatient mentioned in that case, how
well would the model perform in actuallyidentifying the correct diagnosis?

(00:24):
What we found from this work was thatwhen you change the demographics for these
cases, it affected the model's abilityto provide the correct diagnosis in 37%
of the cases that we evaluated.
One case that is really striking,which was a case of pharyngitis in a
sexually active teenager where themodel got the diagnosis, which is mono,

(00:47):
correct for 100% of whitepatients, but only 64 to 84%
of the minority patients where itopted to rank the STD, gonococcal
pharyngitis, first instead in those cases.
Hi and welcome to anotherepisode of NEJM AI Grand Rounds.

(01:07):
Today we are delighted tobring you our conversation
with Professor Emily Alsenzter.
Emily is an Assistant Professor ofBiomedical Data Science and of Computer
Science at Stanford University.
Andy, she is an expert in manythings, natural language processing,
understanding the large language models and other AI tools.
And I think she really leaned on hertraining in this episode, both in

(01:32):
technical computer science, machinelearning work, but also in deep clinical
expertise and respect of the work ofphysicians and biomedical scientists.
So, she went through this trainingprogram that I happened to go through
as well, HST, and she talks aboutthat and she really connects a lot of
disparate topics in a very unified andinteresting way in our conversation.

(01:54):
I really enjoyed it.
I agree.
We talk about this in the episode,but I remember I met her when
I was a postdoc with Zach.
And I just remember thinking like, Oh,this is what it's like when you meet a
rock star, when they're a grad student.Like, I've just always known that
Emily was on a sharp upward trajectory.
I've had the pleasure of collaboratingwith her on a couple of projects
when she was a grad student. I'veworked with her a lot in the machine

(02:16):
learning for health community.
So, she's been a organizingforce behind a lot of the
conferences that bring togetherclinicians and machine learning
researchers. And just like, you know,of course she's a Stanford professor.
Like that just makes sense.
Like I've never had any doubts that she was going to just
kill it and be a total rock star.
So, it was great to have her on thepodcast and to talk to her about what
she's working on now and what she'sworked on in the past and like you said,

(02:38):
Raj, I think one of the most thoughtful,deeply technical NLP researchers
who also takes like theclinical side very seriously.
And I think that's a credit, bothto the HST program, but also, her
mom's a pediatrician. She's likekind of grown up in it and she really
blends all of those worlds and likea really, really nice kind of way.

(02:58):
The NEJM AI Grand Rounds podcastis brought to you by Microsoft,
Viz.ai, Lyric, and Elevance Health.
We thank them for their support.
And now we bring you ourconversation with Emily Alsentzer.
Alright, Emily, well, thanksfor joining us on AI Grand Rounds.

(03:19):
We're excited to have you today.
I'm really excited tobe here with you both.
Emily, welcome to AI Grand Rounds.
So this is a question thatwe always get started with.
Could you please tell usabout the training procedure
for your own neural network?
How did you get interested in artificialintelligence and what data and experiences
led you to where you are today?
That is a good question.

(03:40):
So, I've always been interested inmedicine, but I think I sort of
stumbled into computer science and AI.
I grew up in a family of doctors.
My mom and brother are bothpediatricians, so I'd really been
exposed to medicine from an early age.
And then it was in high school, Ithink, that I attended this program

(04:00):
called the School for Science and Mathat Vanderbilt, one day a week, instead
of going to my normal high school.
And it was during that experience thatI think I was first exposed to science.
So, we'd have scientists from acrossVanderbilt come and talk about their
work, and talk very interdisciplinary,talking about anything from ice cores

(04:20):
in Antarctica to mechanical engineering.
Really taught, I think, thevalue of interdisciplinary
thinking and the status quo.
But at this point I had never doneanything with computer science.
And then during my last year of highschool, I read this book called The
Most Human Human by Brian Christian.
And this was really myfirst exposure to AI.

(04:43):
So this book, oh, Andy has it there.
I love that.
This book, for those of you who areunfamiliar, it describes the Turing
test, which is this competitionthat assesses a machine's ability
to exhibit human-like intelligence.
So, a judge will conduct a series ofconversations over a computer with either
a real person or a computer program,and the AI can pass the Turing test

(05:08):
if it appropriately fools the
judges into thinking they'reconversing with a real person.
But what is really cool about this bookis that there's also this most human
human award for the human that doesthe best job of swaying the judges.
So this was just really cool forme, kind of my first exposure
to AI. And I think that was,
in hindsight, priming for whenI entered college at Stanford.

(05:32):
I immediately decided to take aintroductory computer science class.
Something like 95% of studentsend up taking a single class at Stanford,
and then just immediately got bit
by the bug, where there was reallysomething satisfying about being
able to build something from scratch.
And I also found the idea ofdecomposing like a larger task

(05:53):
into functions to be really likesatisfying with my way of thinking.
So, then ultimately decided to major incomputer science, but really I think
my interest in computer science werealways in the service of medicine.
So, at the time I was premed forthree years and really thought I
was still going to go to med school.
And then, again, in hindsight, Ithink there is this turning point

(06:15):
where I took Dan Jurafsky's NLP classat Stanford, and he was just so good
at explaining these really complextopics in an easy-to-understand way.
And
I ended up doing a Latin translator aspart of the class project for that class.
My secret, I guess, isthat I'm a Latin nerd.
So anyway, I ultimately decided I wasmore interested in developing tools

(06:39):
that could aid clinical decision makingrather than becoming a clinician myself.
Just to hop in here, please tellme that the first thing that you
translated was cogito, ergo sum.
Oh, yeah, definitely.
There's the joke that Latin kidsknow like 20 different ways how to
kill someone, but you don't know,like, my favorite color is purple.

(06:59):
But anyway.
After kind of wrestling with thatdecision, whether or not to go to
med school or not, um, I ended updoing a co-term at Stanford, which
is effectively a one-year master'sdegree in biomedical informatics and
then joined MIT for the Ph.D. program.
And the Ph.D. program was inhealth science and technology.
And this ended up being really the bestof both worlds where you take computer

(07:22):
science classes, but also take some of themed school classes at Harvard med school.
And even get this clinical experiencewhere you learn how to perform a
physical exam, you present patientson rounds, and write notes in the
EHR, all as part of this Ph.D. program.
So that's awesome.
We've had several alums of the HSTprogram, I think, on the podcast so far.

(07:44):
I'd like to, so I was, I was withyou there, for the listeners, what
Emily was reacting to is the book, themost human human she was describing is
literally on the bookshelf behind me.
So that was like a fun coincidence.
So I think spirituallyI was with you there.
You obviously found HST, whichis like a good blend of those.
The common motivation of like,how do we actually help people
in society was, was part of it.

(08:04):
And then really wrestlingwith what does it look like to
interact with the medical field?
Is that actually being a practicingdoctor, or is that helping
interact via helping developmedical decision-making tools?
Nice.
Yeah, that makes total sense.
Raj, were you going to ask a question?
It looks like you were.
Yeah, go ahead.
Yeah, I was going to say, so I thinkwe're going to get into it in a little

(08:25):
bit, but you know, you do a lot ofwork in natural language processing.
And so it just stuck out to methat you said that one of the more
important classes that you tookduring undergrad was an NLP class.
Do you trace back your interests oreven your work now to that initial
exposure in NLP as an undergrad?
Definitely. I think so.
It was just a really well taught class.

(08:47):
I also think from the clinical side,there's something really satisfying
about reading clinical notes whereyou understand the decision-making
process of a clinician that you getbetter insight into that decision
making than you do if you just lookat the structured data of an EHR loan.
So that was the other, I think, component
that brought me to that kind of work.
And you were interested in NLPlike long before GPT-4 and Chat

(09:10):
GPT and everything came out.
This is all pre-largelanguage model days, right?
Exactly.
So it has been a lot of fun tosee the field change. Everybody is now,
everybody is now an NLP researcher.
Right?
Exactly.
More the merrier.
Bring it on. Yeah, I do want to
come back to that at some point aboutyour thoughts on, or maybe we can just
hop into it now before your papers, butlike back in the day, there are these

(09:33):
like highly bespoke NLP pipelines wherenamed entity recognition, all of these
different parts, I think those of us whoare around in the early 2010s,remember
cTAKES, and like entity extraction,and mapping everything to codes.
And you were superhappy if your new model,
like, got an F1 score bump of likea 5% or something like that.

(09:54):
But there was maybe, especiallyin natural language processing,
I think the, the language part of NLP hasgotten lost a little bit in that, like you
actually had to think about the structureof language, grammar, syntax, not to go
to the big picture thing too early, butdo you think we're in a better place now?
Essentially, you just ask GPT-4to do all of your extraction.
Do you have any nostalgia at allfor the pipeline based approach

(10:16):
for how NLP used to work?
I do. I think that we need to bringback some of those principles into the
way that we continue to do NLP today.
What I do appreciate about ourcurrent approaches is that I think
we're much more closely aligned tothe ultimate challenge that we're

(10:38):
trying to solve in medicine. Wherebecause of where our methods are now,
we're thinking about how do yousummarize patients medical records?
You know, how do you do ambient dictation?
Whereas when I first started in thefield, I found, being motivated really
by the ultimate clinical challenges,found it a little harder to see the big

(10:58):
picture of how you go from named entityrecognition to a downstream clinical task.
So, I think that's the more excitingpart of where we are today in the field.
Awesome.
So I think that's a great transition.
So I'd like to talkabout one of your papers.
It's called "Publicly AvailableClinical BERT Embeddings."
There's a couple of things thatI'd like for you to highlight here.
Why are there Sesame Streetcharacters in the title of the paper?

(11:20):
So, a little bit history of, ofBERT models might be helpful.
What were you trying toaccomplish in this paper?
And, uh, yeah, let's, let's start there.
Sure.
BERT is the name of a model.Before BERT, there was ELMo.
So, there's been a long history ofSesame Street themed characters
that kind of, I think, ultimatelycreate in this community.
And it's been a lot of fun.

(11:40):
There's been ERNIE after BERT.
But, to set the stage here, youknow, we created this paper,
developed this paper in 2019.
So, this was prerelease ofanything close to ChatGPT.
Several papers, I mentioned BERT, hadjust been published demonstrating the
utility of self-supervised learning.
So for those who are unfamiliar, thisis the idea that you could pretrain a

(12:03):
model on a ton of unlabeled text data bytraining the model to predict a word from
the surrounding words in the sentence.
And in doing so, the model learns toencode the relationships in that text.
And then that pretrained modelcan then be used for a number
of different downstream tasks.
So, these BERT models were really powerful,but they were largely trained on general

(12:25):
domain knowledge from the Internet.
And we saw the, all these papers andthought we really need to develop
a model that is adapted to clinicaltext, specifically clinical notes in
the electronic health care record.
And there's a number of reasons,really, why you want to do this.
I'm sure many of the clinicians knowthat these notes contain medical

(12:46):
terminology and abbreviations.
They have incomplete sentences.
The text is often semi-structured,often written using note templates.
Interpreting the information in thoseclinical notes, such as the meaning of
lab values, requires domain knowledgethat may not be present in Internet data.
And then furthermore, the patientpresentations in EHR data don't

(13:09):
necessarily look like the classicpresentations in medical textbooks,
or the atypical presentations thatyou might see in PubMed case studies.
So, there were a number of reasons whywe thought, okay, we really need to
adapt this model to clinical data.
And I think the interesting part of thiswork was not necessarily the methodology,

(13:31):
but the fact that we were able to publiclyrelease this model on Hugging Face
model hub and make it available to thecommunity. Where we demonstrated as
part of the work that these specializedclinical models outperformed their
general domain counterparts and thenmade it available for others to use.

(13:51):
That's awesome.
I want to do a little bit how the sausagegets made here because I think that
there's a really interesting story.
So, like just to be transparentlyflattering to you, I think that this
has been a hugely impactful paper.
And when I go over to GoogleScholar,as of now, it's been cited 2,378 times.
So, it's received a tremendousamount of attention.

(14:12):
I know that a lot ofpeople use this model.
I think that what may be surprising forclinical listeners is that it wasn't
published in like a traditional way.
So, I think that this was published at aNAACL workshop, if I remember correctly.
And so, like not in a big fancy journal.
And it really is kind of like apublic good in the sense that you have
trained this model for others to use.

(14:33):
Could you talk a little bit about thesort of publication story behind this
and how the community has built on this?
Because I do think there's an interestinglike resource use message here.
Yeah, I think that, in many ways, itwas really good timing, where we had
seen this paper come out, and reallywanted to make, I think really the
motivation from the start was to makea resource available for the community.

(14:57):
This was my first exposure to opensource and the open-source community.
But a combination of Hugging Facereally just getting started at that
point that made model sharing of thesekinds of models really easy, combined
with just the huge leap in performance wesaw with BERT-based models that I think
led to this model getting a lot of use.

(15:19):
I guess some back story
there, too, is we wrestled alot with whether we could actually
make this model publicly availablegiven that it's trained on EHR data.
Ultimately, talked to the teamat Mimic from where the data was
trained on and decided it was okaybecause it's an encoder-based model,
meaning you're learning to represent text without necessarily

(15:42):
generating text, but that was aserious discussion point when we
were first releasing the model.
Awesome. Yeah, I think I wasstill a postdoc, right,
when you were working this. I rememberlike hearing how this all came together.
So, I remember the originstories of this paper fondly.
I think that transitions naturallyinto your next paper that I'd like
to talk about, which is, "Do WeStill Need Clinical Language Models?"

(16:04):
And I think it'd be fair tosay that your BERT model
is a clinical language model in thatit's a model that is specifically
trained on clinical text data.
The question that you're posinghere is, does GPT-4 obviate the
need for anything specialized?
And that's like the questionunder investigation here.
So maybe you could walk throughhow you looked at that and
what your conclusions were.

(16:25):
Yeah, sure.
So, yeah, we, like many other people inthe field at the time, were thinking,
gosh, do we actually still need thesemodels that we had created earlier?
Does ChatGPT just obviateeverything and do everything for us?
So that was really thefocus of this paper.
And in particular, we wanted to takethe perspective with this work of a

(16:47):
potentially resource constrained hospital
that wants to leverage thebenefits of language models using
as few resources as possible.
So, hospitals typically have a fewoptions for leveraging language models.
They can create a specialized clinicalmodel through pre training on their own
data, like ClinicalBERT, for instance.
They could fine tune or, you know, furthertrain a publicly available language

(17:11):
model, or you could do a prompting-basedapproach of a general language model.
So, we asked severalquestions with this work.
First of all, do specialized clinicalmodels outperform these general
domain models of comparable size?
Can the specialized clinical modelsoutperform larger general domain models?

(17:31):
And does training on clinicaldata actually produce more
cost effective models?
And then finally, do thesespecialized models actually
outperform prompting based approaches?
So, methodologically, we decided to tryto hold as much as we could constant.
So, fix the model architecture, andthen compare general and clinical
language models of varying sizesto try to answer these questions.

(17:55):
So, the real takeaways from this workwere that we found that pretraining
on clinical text allows for smaller,more parameter efficient models
that can either match or outperformthese much larger language models
trained on general domain text.
So, you could get better performance with a clinical language model that is

(18:16):
3.5 times smaller, for instance,than a larger general domain model.
And then we also found that thecomputational cost, again, of training
these models, these clinical models,is much smaller to achieve the same
performance as a general domain model.
And then finally, we looked at what ifyou only train these smaller clinical

(18:37):
models on a handful of examples?How does that compare in performance
to these prompting-based approaches?And it turns out that even by only
training these models on a handful ofexamples, you can actually surpass the
performance of prompting-based approaches.Now, I do want to caveat that the tasks
that we looked at with this paper were alllike classification-based tasks, or what

(18:58):
is called extractive question answering,where you try to identify the span in
the text that answers a given question.These were not generative tasks.
But I think the takeaway for me with this work was that before you immediately turn to
the largest, latest and greatest model, consider what smaller, more specialized
clinical models can do. We have one more paper that we want to discuss, but

(19:25):
I kind of want to, like, do a littleinterlude here because I realized that we
stopped your gradient descent, personalgradient descent procedure a little
prematurely. You are now faculty atStanford, in biomedical data science.
Am I getting, that'sthe correct department?
Yes.
That is correct.
So, I think like one thing I'd like toask you here is like, you and I worked
together a little bit during my postdocwhen you were a grad student, and we

(19:47):
were working on the USMLE problem.
And we were like, way underresourced to be able to do that,
it turns out in hindsight.
And I think that you have in yourrecent work shown how there's
still a place for small models.
But now that you're like startingyour own lab at Stanford, how do you
think about what projects to work on
given how quickly everythingchanges? Again, we, some, some of

(20:10):
the work we've done, we were doinga ChatGPT paper right when it came
out and then they released ChatGPT-4.
And we had to redo all the results.
So how do you think about what'sa six-month durable question
that you can ask in this space?
Yeah, that's a great question
and something I think we're allwrestling with a little bit right now.
I try to make sure that any of theprojects that we work on aren't dependent

(20:35):
on a specific version of a specific model,especially one that is closed source.
I'm particularly interested in there'sa class of models where they're not
only open source, but the training dataitself is also made publicly available.
And so that lets you do a number ofinteresting experiments where you can
try to tie the behavior of the model tothe data that the model was trained on.

(21:00):
You always run into the challengethere of someone coming to say,
oh, but maybe that doesn't apply tothese much larger language models.
But I find that that kind ofwork potentially leads to more
generalizable conclusions.
Do, do you think it's fair, Emily, tosay that you're trying to understand
something about how these models workwhen they fail, how brittle they are?

(21:22):
That it's about the kind of science ofthe models themselves, and of course
some engineering tasks along with that,as opposed to just staying in the
benchmarking and what is the latestand greatest current performance of
the new incarnation of the models.
I think that's fair.
I think there's a need for both.
Like we need,

(21:43):
even on the evaluation side, bettermethodology to be able to evaluate
these models in a scalable way.
And I think there is a role for thinkingabout how the models are actually deployed
in real world clinical settings whereyou're guided by the actual workflow
that these models will be evaluated in.

(22:03):
But personally, I want, I think,at least part of that research
to your point of looking at
the science of thesemodels is really important.
Right.
And I think even for that task of figuringout how to deploy the models and where
in the workflow they should be, howthey should collaborate with humans,
my sense is that the most interestingpapers that we would want to work on
are about trying to find durable sort ofscientific takeaways, insights that are

(22:29):
likely to hold up even after the nextincarnation of the model comes out or the
next version of the model emerges, right?
Like, I think what Andy'sgetting at is that the pace
of just performance gains has been so rapid that previous benchmarks
or evaluations that we thought werelikely to last for many years or

(22:49):
be aspirational targets very faraway have now been saturated, right?
And they've been saturated in a waythat makes it actually hard and
challenging to even plan as a guy,as someone who is running a lab or
someone directing a team, but I thinkthey are less of a problem if you're

(23:10):
focused on trying to do the science ofeither the collaboration of how these
models work or where they fail, or evenwhat nuanced evaluation looks like.
But basically, I think Andy actuallysaid this, even when we were starting
NEJM AI, I give Andy a lot of credit forarticulating this early and very clearly,
just something along the lines of evenwhat we're interested in publishing
are things that are durable, right?
That they're going to last longer, thatare going to be relevant even after the

(23:33):
next incarnation of the model comes along.
That's a very nice way of sayingthat I was unrealistically combative.
That was a very graciousinterpretation of that.
Yeah.
Yes.
So, Emily, I want to transitionto, another one of your papers.
This is the last paper we'll talkabout before the lightning round.
And this one is, I think also broadlyNLP related, but studying a different

(23:55):
topic and very important topic.
And this one was publishedearlier this year.
So, this is a paper that
you led that's publishedin Lancet Digital Health.
The title of the paper is "Assessingthe Potential of GPT-4 to Perpetuate
Racial and Gender Biases inHealthcare: a Model Evaluation Study."
And so maybe, I think the title isquite informative, but maybe just
to get started, you can tell usabout the motivation for the study,

(24:19):
the backstory, how you got started,and what question you're really
trying to answer with this paper.
Sure.
So again, to set the stage here, atthis point, I was doing my postdoc at
Brigham and Women's Hospital and hadgotten involved in operational work
within the hospital where Brigham,as well as many hospitals across

(24:40):
the country, have been
starting to deploy languagemodels into the clinic at really,
I think, unprecedented speed.
I think much of the current focushas been on automating administrative
tasks, but many clinicians alsoenvision using these language models
for clinical decision support as well.
There was actually a survey thatcame out recently that it was

(25:02):
something like one in five respondentsreported to using generative AI
tools in their clinical practice.
And of those, like 28% saidthey were using it actually to
suggest a differential diagnosis.
So, we saw all of thisuptake in language models.
I have to say just it's just basedon my straw poll and sample of
residents and clinical colleagues,that sounds totally believable.

(25:24):
I think it might even be higheramongst people I'm talking
to at the hospitals here andresidents at the Harvard hospital.
So, it's unsurprising.
It's already being used indifferential diagnosis regularly.
Yeah, completely agree.
So, we've seen all of this usage ofthese models as part of clinical decision
making, but we knew from all of thissubstantial research in the past that

(25:46):
language models can encode societal biasesthat are present in their training data.
And so, our goal with this work wasreally to evaluate whether a language
model perpetuates those racial and genderbiases, specifically focused on a number
of different clinical applications.
We focused on GPT-4 because it's beenone of the most popular models for the

(26:06):
clinical community, and we looked at howthe model performs for a medical education
task, for diagnostic reasoning, forclinical plan generation, and for this
sort of subjective patient assessment.
So, I'm happy to go into the detailsabout some of those, if that's helpful.
Yeah, no, that'd be great.
So, I, I think there, the caseseries that you used as sort of

(26:28):
the primary focus of the paper werethese NEJM Healer cases, right?
So maybe you could tell us aboutthat data set and like what the
cases look like and what that initial task was
and then how you varied it tostudy racial and gender biases.
Sure, so we use thesecases from NEJM Healer.
NEJM Healer is a medical education tool.

(26:49):
It will present an expert generatedcase, and then it allows medical
trainees to compare their differentialdiagnosis for that case compared to
some expert generated differential.
And so, we wanted to look to see if weprovided these cases to GPT-4 and we
swapped out either the race or the genderof the patient mentioned in that case, how

(27:13):
well would the model perform in actuallyidentifying the correct diagnosis
for that case.
And something that we did was selectcases that actually should have a
similar differential by race and gender.
So, for example, we would excludecases of lower abdominal pain,
which you would expect should havea different differential for female

(27:36):
and male patients, for instance.
What we found from this work was thatwhen you change the demographics for these
cases, it affected the model's abilityto provide the correct diagnosis in
37% of the cases that we evaluated.
I think there's this one case that is really striking
which was a case of pharyngitis ina sexually active teenager where the

(27:59):
model got the diagnosis, which ismono, correct for 100%
of white patients, but only 64 to 84% of the minority patients where
it opted to rank the STD, gonococcalpharyngitis first instead in those cases.
Do you think the behavior of the modeland the biases and problems associated
with how it performs and how it can be steered is this largely

(28:23):
a problem with the training data,diversity, or bias in the training data?
Or do you think these typesof biases enter into other
phases of model development?
It's a good question.
I think most, this is speculative, butI would imagine that most of the data is
actually coming through the training data.

(28:45):
We know that even the, for instance,medical education cases that are in
the literature are themselves biased.
So, the model is learning to pickup on these co-occurrences that
we're seeing in the training data.
Now, there is another potential source ofbias that hasn't really been investigated
that much yet, which is the humanpreference training process as well.

(29:07):
So, there's this notion of reinforcementlearning from human feedback or other
kind of ways of training these models.
And right now we actually don't have alot of insight into who are the people
who are generating these preferences?
What are the kinds of tasks that thesepreferences are being evaluated on?
And the whole goal of that processis to try to steer the model

(29:29):
to produce favorable outputs.
But you have to ask, favorableto whom or by what standards?
And I think that is a verymuch an underexplored area.
Yeah.
I think this whole area is like, soimportant, but so difficult because
like when something like thissurfaces, I could think, okay, well
it's read too much of the Internetand therefore that's why it's biased.

(29:52):
Or okay, the human annotatorsthat they did use to do the
alignment had systematic biases.
Or I think it could be like structuralbias in the way that doctors are
educated or in the, you know, themedical data that the model has seen.
And so like, I don't
have a sense of like nihilism here.
But I kind of like when I seesomething like this, like I don't

(30:13):
have an immediate, okay, well, here'show we fix it because it could be
coming from so many different places.
So, I wonder how you thinkabout the difference between
description and prescription here.
So, like there are descriptive studiesthat can at least articulate what
the problems are, but how then dowe actually think about fixing them?
Yeah, it's a good point.
I think we first need thedescriptive studies before we can

(30:34):
address the prescriptive studies.
There are a couple of ways that the NLPcommunity has tried to address this.
One is by selecting your pretraining data in a better way.
That is much harder to control.
There's also another class of where,how you change the loss function that
your model is actually trained on.

(30:55):
And then I think this third category, I'mparticularly interested in, which is at
inference time, how can you change theprobabilities that the model is outputting
to change the prediction of a model?
So, we as end users who don't necessarilyhave exposure to the entire pretraining

(31:15):
process, what can we do at the end whenwe're using these models to try to de-bias
the model for a different prediction task?
I think it's much easier to thinkabout this in the context of a
particular application of a modelrather than trying to de-bias any
potential use of these models.
And that's why I think we often talk abouthow these models may be used in a human-

(31:39):
in-the-loop process, but I think it'svery unlikely for individual clinicians
to be able to spot these biases, whenthey're only looking at individual cases.
Part of what I'm arguing is that weneed targeted fairness evaluations
for each use case of these models,even if they can be used, you know,
for a number of different applications.

(32:00):
That actually gets exactly to the sortof final question I want to ask you
about this paper, which is there's workhappening to try to mitigate these biases.
There are new incarnations of themodels that are being built, and
then we're exploring various scalinglaws still, right, as a community at
both training time and at test time.
But in this sort of intermediate or the immediate time, right?

(32:23):
We also just talked about this a momentago, these models are already being used.
And I think they're being used potentiallyand likely at massive scale, right?
Today by both patients and clinicians.
And so, I'm wondering if youcan, maybe just direct a few
concluding thoughts,particularly for clinicians.

(32:44):
So, from this paper, what do youthink are the key clinical takeaways?
Takeaways for clinicians as they're usingthese models, as they are writing either
hard cases or their own experiencesor their own medical data, and they're
looking to GPT-4 and to other modelsfor advice. What do you think paper's sort
of key takeaways are for that crowd?

(33:05):
I think if you're a clinician usingthese models, it's really helpful to
just put your mind in the context ofwhat these data are actually trained on.
So, we know that these data aretrained on Internet data. That could
be Reddit data, that could be Twitter,or any sort of social media data,
that could be random blog posts.

(33:26):
And so as a result of that, these modelsare learning all of those associations.
So, any sort of bias that is present couldbe present in these outputs as well.
So, I would approach these outputswith a healthy dose of skepticism, and
especially related to more sensitiveareas of how these models should be used.

(33:49):
And I'll also just remark thatthis work and others since it
have focused on evaluation ofthese models in structured output.
We focused on diagnosis,which is a structured task.
We also focused on evaluatingspecific demographic groups, but
there is very little work rightnow in terms of evaluating uses of

(34:12):
these models that are generative
outputs or looking at potentialpopulations where that population may
not be defined by a specific group.
So, for instance, like healthliteracy, how could that impact
performance of these models?
And so, I'm at least not aware of any
audits for these real-world clinicalapplications that are being deployed.

(34:34):
For instance, tools like the draftingresponses to patient messages,
or Epic is releasing a tool tosummarize patients medical records.
So, I think clinicians should recognizethat that kind of evaluation hasn't
happened yet, and again, approach thesewith a healthy dose of skepticism there.
Awesome.

(34:54):
It's time for the lightning round.
So, the first lightning round question, Emily, I'm going to give you a rare
exemption because it's going to require some setup for you to be able to

(35:16):
answer it in a coherent way. You'll understand why once I asked it, okay?
What did your experience with a newpair of blue jeans teach you about the
challenges of automatic diagnosis?
Wow.
I was just thinking about this, Andy.
So, context, because Andy likesto bring back shameful moments.

(35:41):
Many years ago, when I was still a Ph.D.in Zak Kohane's lab, I approached Zak,
you know, somewhat timid at the time,because my fingers had turned blue,
and I was thinking to myself, you know,this has been happening for a couple
days in a row, you know, Googling,figuring out what could be going wrong.

(36:01):
I thought, oh, there's a syndrome calledRaynaud's Syndrome. Maybe I have
that. Let me talk to Zak, you know, hehas a lot of connections in the area.
Am I going crazy? And he, he looked atmy fingers and he was like, yeah, that's
kind of weird, I can, sure, I can putyou in touch with a clinician friend
of mine. And then, it wasn't until afterthat connection that had been made that

(36:23):
I realized that also during that timeI had just gotten new blue jeans, and
those blue jeans had blue dye that wasapparently coming off into my fingers.
And so there was actually, this totallyunrelated reason out in the world
of why my fingers were turning blue.
So actually, Andy, in preparation forthis call, I put that case into ChatGPT.

(36:47):
And not knowing that you were going toask this question, the model did not do
very well at determining that it could bedue to my blue jeans and their dye.
I actually tried to do the same thing,but I couldn't find the picture.
And I feel like if I just describedit, I wouldn't describe it correctly.
But to me, that was like such asalient example of the sort of like long
tail of diagnostic tasks that we mightlike want a human to do or an AI to do.

(37:12):
I thought what was amazing is that you putit out on Twitter and you got responses
from like world class rheumatologists
and like diagnosticians.
And I think no one said blue jeans.
Like it was like, so I actuallythought it was like an amazing test
of like human diagnostic ability.
And like, what are these like great,like teaching moments and like
what the, the edge cases might be.

(37:34):
I think the fun thing too,it's kind of reverse causality
because my fingers were cold.
And so that's why I put myfingers near the blue jeans.
Um, so yeah.
I will say I got it right, but that'sonly because I had seen it happen to
Kristen, like the week before, likeliterally the week before, I had
seen the exact same thing happen.
I was like, did you geta new pair of blue jeans?

(37:54):
But again, like I think aboutthat case all the time when we're
thinking about AI and medicine.
So, this was just a complicated casewhere Andy got the diagnosis right, and
most of the human experts missed it.
Yeah, so, Rogers, Crick,and Diagnose, why?
All this setup!
Thank you.

(38:14):
Oh man, was Andy actually pointing thisout, how you figured out that it was
the blue genes and not anything else?
I don't remember.
I think you had alreadyfigured it out by then.
I think I said it on Twitter or somethingand you or Sam Finlayson sent me a
message with a winky face or something.
Like you guys had alreadyfigured it out for sure.

(38:34):
Wow.
Amazing.
Alright, Emily.
The next question is, if you weren't inacademia, what job would you be doing?
Ooh, good question.
Alright.
This is still academic adjacent, but avery different field. During college, I
spent a summer in Monterey, California.
Stanford has a marine station there.

(38:57):
I was doing research relatedto disease modeling and a
network, very computational,but a lot of the other students
at the marine station, were all doinglike marine biology where they got to go
scuba diving for their work every day.
And I think that would be a prettycool job of just getting to be out
in the water as part of your work.

(39:17):
Nice.
This one is a reflective question.
So what is the thing that youhave changed your mind the most
about since you were younger?
Ooh, that is a good question.
Okay, two, two things.
One, I used to hate olives.
Now, I love olives.

(39:40):
Two, you know, going back to thedecision about whether or not to go to
medical school, one of the reasons whyI decided I didn't want to go ultimately
was because, oh, I thought I wouldn'treally like interacting with patients.
I was like, you know, quieter,like I'll, I'd rather just think
about the decision making process.

(40:02):
And then as part of HST, we did theseclinical rotations where you had a
lot of time to talk to patients andI ended up loving that experience.
It introduces you to people who wouldbe totally outside of your bubble.
You get to learn about their stories andthat was a really humbling experience.
What changed with respectto olives for you?
Like what bit flipped there?

(40:25):
That is a good question.
I think I was having bad olives maybe.
Had them on pizza and with otherflavors and it just totally switched.
Alright.
So I think it's funny that I get to
to ask this question.
I did not write it, either alanguage model or another type
of intelligence, like my co-hostwrote it, but I'm entertained that

(40:48):
this is the question I get to ask.
Emily, which do you prefer,Harvard or Stanford?
Oh, that's not fair.
I'm gonna plead the fifth on that one.
Alright, fine.
I love Zak.
Yes, fine, fine, we'll allow it.
Yes, there was a misaligned AI, AndyIntelligence, that wrote that one, so.

(41:09):
They're both great.
They're both great.
And great people at both institutions.
Next question.
If you could have dinner with oneperson, dead or alive, who would it be?
Ooh.
Alright, I'm gonna take a cop outanswer and describe an imaginary,
or like a fictional person thatI would like to have dinner with.
I think it would be really fun to havedinner with both Sherlock Holmes and

(41:34):
Willy Wonka for different reasons.Like Willy Wonka, the creativity
of that, that world, being partof that world would be really fun.
And then Sherlock Holmes, likeI am very curious what he would
intuit from our conversation.
If you were to round it out andadd a third person, who from
The Wheel of Time would you add?
Oh, I mean, Egwene.
Obviously, Egwene.

(41:57):
I don't even know what youguys are talking about right
now, so I'm just gonna say it.
Alright, last questionof the lightning round.
Will AI and medicine be driven moreby computer scientists or clinicians?
I unfortunately don't have a hottake for you here, in that I think
it has to be the combination of both.
Or, the cross trained scientists.
I think the HST program

(42:18):
really taught me that, especiallyas a more on the computer scientist
myself, understanding what clinicalworkflows look like, understanding
that data generating process has beenreally valuable to think about how I
shape what kind of questions I ask.
And then on the flip side forclinicians to understand what is
actually possible from a technologyperspective is, is really useful.

(42:42):
Congratulations on passingthe lightning round, Emily.
Whew!
We threw you some curveballsthere, so you did great.
Okay, so now we're going tozoom out and ask you some big
picture questions to wrap up.
So, we touched on this a little bitbefore, but I just, like, want to kind
of explicitly drill down on, as anacademic, as someone who's focused on

(43:03):
clinical research questions, how doyou feel about beholden is probably the
wrong word, but the fact that there's afew labs on the planet who are pushing
the frontier of AI, and a lot of whathappens in academia for us is that we are
interrogating and or using those models.
Is that something that we shouldhave mixed feelings about?

(43:25):
Or is this just a new like researchparadigm where they're building the
big sort of like microscopes and we'reusing the microscopes to ask a different
set of questions than methodologistswould have previously asked?
Yeah, I think it's this paradigm makes itreally challenging for us as a community
to understand what the generalizableand lasting contributions are.

(43:48):
You know, anytime I'm a reviewer,that's kind of the, the hat I have on,
especially looking at when you'releveraging models that are closed source.
I think a lot about what is the roleof academia in this landscape and We
touched on this a little bit earlier,but we need careful evaluation of these

(44:09):
models for real clinical use cases.
And we also need methods tobetter evaluate in a scalable way.
I also think that if we consider thesemodels as black boxes, thinking about
development of methods that allowyou to quickly tailor these models

(44:29):
to particular clinical settings aswell is also really interesting.
And then I think, broadly, asa community, I think we should
encourage more transparency in thetraining data for these models.
There was actually a new law thatwas recently passed in California
that says that from January 2026,developers of AI systems must

(44:52):
publicly post on their website certaininformation, at least, about the
data used to train these systems.
I think it's unclear exactly howmuch of that data will be revealed
because there's a real business case.
But I think as a community, can we putpressure at least to, and figure out what
is the most useful information about thetraining data that needs to be revealed

(45:14):
for us to do these, clinically usefulevaluations and to help us understand
what are the associations that themodel might be learning and how
that impacts their downstream use.
Got it.
And then finally, I mentionedthis earlier, but encouraging
research using open models as well.
Like, there's an example of Dolma,which is an open-source dataset.

(45:37):
And a corresponding model called OLMo,which is trained on that data set.
So, leveraging resources like that.
So that's great.
And like a follow up question to thatis that there are some indications
that performance may be saturating.
So at least the benchmarks we useto test models like GPT-4 and Claude,

(45:57):
that there's diminishing returnsfor the current training paradigm.
Do you think that that's happening?
Or do you expect like GPT-5
to be significantly betterthan what we get with GPT-4.
Like, are we saturating the scalingcurve that we've been on by training
bigger models and more data?
I think the new paradigm, for lackof a better word, for inference

(46:18):
time scaling is interesting.
I also think though that we stilldon't have the best evaluations to
actually probe the understanding of thesemodels in a way that is not prone to
training on the test data, for instance.
Especially in the clinical domain,I think we've only scratched the

(46:39):
surface in terms of evaluation.
Most evaluation is synthetic cases,as opposed to real world cases.
Time will tell when we actually havebetter evaluations to probe the utility
of these models. I guess final questionfor me before I hand it off to Raj,
I think what has been exciting foracademics and folks, you know, not
at one of these frontier labs is theevolution of the open-source ecosystem.

(47:04):
So, you know, Meta has gifted usin some sense with a 5 billion gift
in the Llama3 models, because that'sputatively how much it costs to train,
those large language models.
And we can use them in arelatively unrestricted way.
Is your sense that open sourcewill eat closed source in that like
the, it will be hard to keep upwith how vibrant the ecosystem is?
Or do you think that closed sourcemodels because they're so well resourced

(47:27):
where are forever going to be infront of the open-source community?
I think generally that the open-sourceecosystem will continue to stay just
behind closed source, but continue tostay just behind it and catch up with it.
I think it will ultimately comedown to what resources and pre
training data also you have licenseto as another key component.

(47:51):
I do want to comment a little bit onthe medical open source AI ecosystem,
which is a much smaller subset.
Right now, all of the medicalgenerative models are trained on
datasets like PubMed, textbooks, orother sources of medical knowledge.
And there are still a few generative
AI models that aretrained on clinical notes.

(48:13):
And as we discussed earlier, you know,clinical notes are very different from
the text in PubMed or in textbooks.
One reason for this is dueto privacy concerns, right?
Our algorithms for de-identificationaren't perfect and sharing
data across hospitals is hard.
If I had to speculate, I think we will seein the future clinical generative language

(48:36):
models trained on synthetic data andreleased into the open-source community.
We already have that a little bitwith GatorTronGPT, which is the
closest example I've seen to this.
Amazing.
Emily, last question.
And it's actually a pairof questions for you.
Just thinking ahead for the next fiveyears or so, what are you most optimistic

(49:00):
about and what are you most
pessimistic about, aboutthe use of AI in medicine.
Alright, I'll startwith the most pessimistic.
To end on an optimistic note I'm overallan optimist, but I think to reflect
back to our earlier conversation, whatI worry about the most is automating
the biases or propagating errorsin medical data into these models.

(49:22):
I don't think we have theappropriate evaluation yet to
fully understand the harms of
using these models at a larger scale.
And the concern is that we won't actuallydetect them until they're already baked
into our systems. So just to give youan example of this, there are a number
of companies trying to pilot tools tosummarize medical records, but I think

(49:45):
there's a lot of nuance in terms of howto do this correctly, all of the medical
data that these tools are summarizing hascopy-and-paste errors, different sources
within the EHR have different levelsof trustworthiness, all of that sort of
information that I'm not aware of beingcurrently baked into these existing tools.

(50:05):
I'll give you another example wherewe've been evaluating language models
for diagnosis tasks that leveragegene names, and we found that the
model performs a lot better whenthe gene names are gene symbols
compared to Ensembl IDs. So, for thosewho are unfamiliar, gene symbols often
have A through Z characters in the name.

(50:25):
Ensembl IDs are just a string ofnumbers and language models are
notoriously bad at representing numbers.
So, it turns out if you use the EnsemblIDs, models get worse performance.
I mentioned that as one example thatthere are many different potential
gotchas that require really careful,rigorous evaluation that I think
we've only scratched the surface on.

(50:46):
And I think our appetite and thepossibilities in this space are outpacing
our ability to conduct these evaluations
in a rigorous way.
And then going back to yourquestion on optimist, a couple
answers to this question.
From an application perspective,I'm really excited about the idea
of language models enabling patientsto be more active in their care.

(51:08):
I know you all recently had a momand doctor on the podcast where
the idea that you can have languagemodels as an advocate in your care.
I've been very fortunate to haveclinicians in my family, which has been
a huge privilege where they can cometo any meeting and say, are you sure
you haven't thought about X, Y, and Z?
They serve as advocates.
Can we imagine what a language modelbased advocate when you're in a room

(51:32):
would look like for enabling patients?
And then I think underappreciatedapplication of language
models is phenotyping.
I think there's a lot of opportunitiesfor these models to change the
way we define cohorts, performmedical research in the future.
I'm also, from a research perspective,excited about what are the questions
we can ask now that we have audiorecordings of patient visits?

(51:56):
As an NLP person, I am very awareof the idea that we're looking at
this data through the lens of what aclinician has already interpreted and
what billing processes are already needthere's certain things to be documented.
So, these audio recordings give us thisreally unique insight into what the
patient is actually saying in the room.
And then finally, there's anincreasing number of training programs

(52:19):
in this space that really recognize theneed for interdisciplinary training.
HST is one of them.
The AI and Medicine programat Harvard is another.
The Computational Precision and Healthprogram at Berkeley and UCSF is another.
And I think those are training the leadersthat we need in this space to drive
this technology further in the future.
Amazing.

(52:40):
I think I said that would be mylast question, but your answer
was so interesting that I have toask one more and this, I promise
will be the final question.
So, you know, you juststarted a lab, right?
You just started your labat Stanford, new professor.
Congratulations.
I think it's, it soundslike it's going very well.
How do you think, language models,GPT, you know, developments in,

(53:03):
in just the tools that we all haveaccess to now is going to change
your job over the next few years.
And a perfectly valid answeris it's not going to change
it over the next few years.
But I'm curious, cause I'm just talkingto a lot of people about this and I'm
getting very different answers from, Ican only be a professor for a few more
years it's going to automate everythingand science is going to be discovered

(53:25):
by LLMs to ehhh... it's going to helpwith my grants a little bit, but like
not really change things that much.
What's your take on how thesemodels and how the technology
that we have access to is going tochange your life as a professor
let's say over the next five years?I think in the next five years,
I lean closer to the, it'll makeme more efficient at my job,

(53:49):
but I don't see it replacing anysignificant aspects of my job.
I already use language models to help withcareful wording of individual sentences,
or writing code I think is probably themost significant change, that it just
accelerates anyone's ability to learn anew programming language or new skill.
I think there's so much intricacies interms of how these models are deployed and

(54:15):
going back again to the data generatingprocess that we know by knowing the data
and the application really well, andI don't see a language model replacing
that component of the research process.
Excellent.
Alright.
Thank you so much, Emily.
That was fantastic.
Thanks, Emily.
This was a lot of fun.

(54:36):
Great to chat with you guys.
This copyrighted podcast from theMassachusetts Medical Society may
not be reproduced, distributed,or used for commercial purposes
without prior written permission ofthe Massachusetts Medical Society.
For information on reusing NEJM Grouppodcasts, please visit the permissions
and licensing page at the NEJM website.

All Episodes

Episode Transcript

Popular Podcasts

Dateline NBC

Stuff You Should Know

Intentionally Disturbing

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}From Clinical Notes to GPT-4: Dr. Emily Alsentzer on Natural Language Processing in Medicine