APA Innovation Hour: Clinical Reasoning, Bias, and the Future of LLMs

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:05):
hello, everyone, and welcome to the American Psychiatric Association's Medical Mind, theInnovation R.
My name is Manu Sharma, and I'll be your host for today's episode.
I'm the Assistant Medical Director for Research and Innovation at the Institute of Livingin Hartford, Connecticut, and I am also an Assistant Clinical Professor of Psychiatry at
the Yale School of Medicine.
uh I chair the Committee on Innovation for American Psychiatric Association.

(00:29):
The main thesis of this podcast series is that technology and artificial intelligence arepoised to change the practice of psychiatry over the next decade.
We want psychiatrists to be educated and prepared for these changes.
In these episodes, we will talk to innovation leaders from across the world who will giveus a preview of how today's technology will change the future of psychiatric practice.

(00:53):
Large language models, or LLMs for short, like chat GPT are everywhere.
from drafting emails to summarizing research papers.
But what role, if any, should they play in clinical decision making?
In today's episode, we explore how LLMs are beginning to influence the way we think,diagnose, and take care of our patients.

(01:14):
And it gives me great pleasure to welcome our guest on today's podcast.
Dr.
Ethan Goh is a physician scientist at Stanford, where he leads multi-centered studies onhow
AI is used in real clinical settings.
His research has been published in Nature Medicine, JAMA Network, and focuses on buildingand evaluating large language models for healthcare.

(01:37):
He has been featured in the New York Times and CNN.
He also serves as the executive director for the Stanford ARISE program, which is aresearch network advancing physician AI collaboration, and serves on the editorial board
of BMJ Digital Health and AI.
And having read the thoughtful work Dr.

(01:58):
Goh has been doing, I can guarantee the listeners that this is going to be a veryinformative hour.
So welcome and thank you so much for joining us, Dr.
Goh.
me.
Yeah.
Excited to chat with you.
Wonderful.
So let's begin at the beginning, as we like to say.
So again, I have been amazed at the pioneering work you've been doing at the intersectionof LLMs and clinical medicine.

(02:22):
ah Could you start briefly by explaining what LLMs are for some of our audience and whatinitially drew you to this area of research?
Yeah, maybe it could be helpful by sharing a bit about the current research we're doing,which, as you mentioned, has to do a lot with how doctors could be using AI tools like
large language models.
um So the short version to your question about what large language models are, it'sbasically think of it like autocomplete on steroids, right?

(02:50):
So when you know you go into Google, you type something, it always seems to guess the nextword or question or phrase you're looking for.
And with LOMs, it's just doing this, but with immense amounts of data compute that wentinto training it so that instead of um figuring out and guessing the next word or next
question, it can compose full poems for you.

(03:11):
It can write full paragraphs.
It can mimic the tone of a favorite writer or singer you enjoy.
So that's what the LOMs are.
As for what I've spent most of the past three years of my research time doing is that overat Stanford,
I think back in 2022, I think the whole world was in physicians, right?
We're just beginning to see how chatbots could be or could perform really well on themedical exams as well, such as the USMU exam.

(03:39):
And so how the ARISE network started was a bunch of clinician reasoning experts, educatorsall came together and said, hey, know, obviously we all know as clinicians that multiple
choice questions are not the most representative.
So when you see a new patient, they don't come in and say, hey, here's my problem, hereare ABCD in terms of what I could have, pick the single best answer.

(04:04):
And so that was the summary of how a lot of the initial study started, but happy to gomore into whichever direction you want to take this to.
So let's start with the here and now.
So it's like, I'll admit, I use statGPT almost daily.
It's like I use it for writing emails, like answering questions, uh like helping meunderstand some complex research papers that I'm, especially around topics that I'm not

(04:31):
very familiar with.
But I still hesitate, as we were talking a little bit before we started recording, I stillhesitate to use it in a clinical setting.
So as you've engaged in this research and as you've interacted with physicians, is thishesitation a common thing that you come across or do you feel like they're very open to

(04:52):
using LLMs as a part of their clinical practice?
What's your experience been when you work with them?
Yeah, that's a great question.
I think the answer is really both, right?
I think that doctors are really open to the idea of tools that could help us, mainlybecause we are overworked over stress.
There tons of things that we don't want to do that LMS would obviously be amazing at.
Yet they are also very understandably and rightfully hesitant, right?

(05:16):
Because so much of new technology, even the AI labs, they don't quite have the answers to,even researchers don't have the answers to, right?
So if, for example, we were to ask you today, how often would a patient using a chatbotencounter knowledge retrieved from good versus bad sources, that to me is still quite
unclear.

(05:36):
And that's because of how these models are trained, so much data goes into it that a lotof that is really kind of untraceable or less explainable.
A lot of these technical things could get addressed over time.
And I think that researchers like myself are trying to answer a lot of these openquestions.

(05:57):
And the more of these questions we could answer over time, the less hesitant and the morewilling physicians would be to adopt them as well as health systems.
So I think that's a good segue into us going into some of the papers that we've publishedrecently.
So I know you've done some randomized trials where physicians have partnered with almostuh large language models and you've seen in realistic scenarios and you've kind of studied

(06:23):
how it affects the decision making.
uh Can you tell us a little bit about the study itself, like how the methods of it alittle bit and what were your main findings?
Yeah, so maybe I'll start by talking a little bit.
That was a continuation, right?
So back in 2022, chatbots were demonstrating amazing performances on the US MLE exam,right?

(06:44):
Not just passing, but crushing, I think, the median of what a medical student, a medicalresident even attending might score on really challenging but multiple choice exam
questions.
And so what we wanted to do, we meaning, you know, we had PIs as well as clinicalreasoning, medication.
to education experts across Virginia, Minnesota, Beth Israel, Stanford, right, cametogether and said, all right, know, what could be more representative case vignettes, more

(07:12):
open-ended questions, as well as a more customized grading rubrics we could use toadminister these tests, right, to chatbots, doctors using the chatbots, aided by the
chatbots, as well as doctors, we like to say, using whatever they are doing today, right,which really means using up-to-date Google search and all of that.
And so what these sort of questions look like, they're all based on real cases, right?

(07:34):
Obviously, it's too quite simulated and artificial because they are all presented quitenice and tidied up in the form of a text vignette.
um But unlike some of the earlier multiple choice or US MU exam questions, you the realstrength of being open-ended questions, these were all people, real people, taking the,
you know, answering the questions, as well as physicians grading and uh

(08:00):
sort of deciding, right, who actually had a good response to these questions.
And what we found obviously, what got all the headlines was that on the diagnosticreasoning um study, AI alone did the best.
But what was most surprising to us was that doctors using AI like CHAT-GPT-4, right,didn't actually do better than doctors using, sorry, doctors using CHAT-GPT-4 didn't do

(08:26):
better than CHAT-GPT-4 or the AI alone.
which is a super unintuitive finding.
em So that was the diagnostic reasoning paper.
The management reasoning paper, at least we did find that the doctors using an AI chatbotdid do about as well as the AI alone.
But once again, it didn't outperform the AI alone, right?
So obviously a lot of directions are moving towards, all right, know, doctors are clearlynot going anywhere, right?

(08:50):
Clinicians are not going anywhere.
So how do we measure and continuously improve the performance of, you know, doctors usingor clinicians using AI like tools?
em I probably should mention a bit about as well about what we mean by reasoning, right?
So reasoning is basically the process before a clinician comes to an answer.

(09:12):
before, you know, yourself, Dr.
Manu, before you decide to give a treatment plan or diagnose a patient, right?
They're probably like a detective, right?
They're probably things for or against either a diagnosis or something you might want todo for the patient.
So as much as we could, we tried to design the questions, the rubrics around capturing asmuch of these intermediary steps, uh because that would be the best way to try and capture

(09:38):
how physicians actually think about when presented with a patient case.
Can you kind of discuss maybe using an example so that it becomes a little clearer?
So if I'm understanding this correctly and what I read from the paper, these were clinicalvignettes that were created using actual data from EHR.

(09:59):
ah Or were these clinical vignettes that were available online somewhere that werepresented to the physicians and then they were asked a set of questions to test their
reasoning?
Is that how it works?
Yeah, those are all good questions.
Give me one moment.
I'm just trying to pull up one of these cases so I can give you a more complete example.
One moment.

(10:19):
Yeah, but I mean, addressing part of what you said first, right?
Obviously, a concern with a lot of these studies is that these test cases cannot bealready indexed on the internet.
Why?
Because the way that LOMs are trained, they ingest almost a corpus of the internet,meaning if it's published online, most likely it went on to the training data, which would

(10:40):
invalidate because a very obvious editor peer review response is going to be...
Look, guys, it's obvious I did better because it's really seen this question.
It's really seen the answer, right?
So to the first point, these cases were actually taken from what is called NEGM Healercases.
They were, I think, originally written or produced by Beth Israel.
So everyone based on real challenging cases, I think there was something like 100 or 200cases.

(11:03):
We basically sat down with a bunch of investigators again to try and figure out whichwould be good cases to include.
This was some combination of diversity of cases, different specialties.
complex enough so that not everyone would score 100%, yet not super rare like zebras,things that physicians would never see at all.

(11:24):
um Obviously, this is a bit of an art as much of science, because it's impossible to coverthe corpus of medicine just within a one-hour exam or five or six cases.
So one example of a question I have right here is, so the vignette is much longer.
I'm just going to give you a summary of it.
So 72-year-old man.

(11:45):
history of hypertension and smoking presented with shortness of breath goes over, thehistory goes into the family history, the past medical history, the medications.
And so one of the questions is gonna be, for example, taking all the information you haveinto account, what are the three most likely diagnosis for this patient's primary
complaint of shortness of breath?

(12:06):
And it further goes down into, right, you said this is one of the top differentials.
What are...
things that support that, right, that was given to you within the vignette.
What are things that goes against it?
And some of these could also be quite complex.
So for example, subsequently for um some of the management reasoning cases, right,management is obviously a lot more nuanced than just diagnosis alone, right, because doing

(12:32):
nothing is sometimes the best action as well.
So, and I'm trying to recall here, you know, one of the examples was, you know, patientpost-op,
atrial fibrillation, bleeding risk.
Should you or should you not start anti-coagulation?
How long after should you start, right?
So as you can see, lots of nuances there.
um And yeah, know, surprisingly again, AI alone did really, really good on these sort of,of, um of the NIAID-based studies.

(13:01):
And so I already mentioned a few of this moving a lot to, you know, how do we upscalephysicians, clinicians using such AI tools, because clearly we want them to perform way
better than how an AI alone would.
perform.
I would say one of the challenges as well is presenting the right information, right?
Because so much information is hidden within PDFx reports, hidden within EMR notes, hiddenin a way that only human can click and access, right?

(13:28):
Online, offline, right?
So that is one of the big challenges in terms of how to capture the right information tofeed into the context of an AI sort of system.
if we allow for the fact that, OK, an AI, if given the right information, can do quitegood at coming to a lot of reasoning, diagnostic management steps.

(13:49):
Now, like, did you notice, so I know there were the participants in the study wereresidents and then experienced physicians as well.
Or were they all experienced physicians?
Like, did you feel that amount of experience of years of practicing uh medicine affectedtheir ability to answer this or that did not play a role in your particular study?

(14:11):
Yeah, I think we did do a sort of secondary sub-analysis there, trying to tie theresidents and attendings, right?
I think I believe that the results were a bit mixed.
It was not possible for us to draw any firm conclusions of saying, know, obviously thereare interesting questions there, right?
Like, oh, does a resident with AI do better than attending using AI, for example, right?

(14:36):
Or maybe not.
We weren't powered to calculate that.
And I think in a sort of secondary analysis, there was mixed findings.
was mixed findings.
And one interesting thing that I saw in one of the papers where you kind of actually, solike one of the biggest concerns when you think about AI and using LLM says because
they're trained on all this pre-existing data that's out there on the internet, which wasalready filled with biases.

(15:01):
oh Our worry is that like models like this will just perpetuate that bias even when itcomes to clinical decision-making.
ah You kind of looked into that a little bit in one of your papers.
Do you mind sharing what you found there?
Yeah, happy to share.
And I think that's definitely a very important question to address, right?

(15:22):
em Mainly because if we allow for the fact that data is biased or overrepresented in someareas and under others, then any model trained on it is going to reproduce these existing
human data biases.
So that was one of the directions um one of our earlier studies was looking at.

(15:42):
Specifically, what we wanted to do was
Similar to the earlier studies we described, how this was different was that once again,there was vignette studies uh and we asked doctors what the answers were.
But this time, doctors did it initially with no AI support.
um And what they did was that they saw, there were two groups.

(16:03):
There was group A and there was group B.
So group A actually saw a black female and group B saw a white male.
And both,
patient actors had the same clinical chest pain-based vignette.
And the doctor had basically to answer a series of questions based around triage, know,risk of MI management and things like that.
So what happened was they initially answered by themselves.

(16:27):
And then we gave them the use of an AI chatbot.
And we were curious, right, would this, first of all, change their decision and would itmake them more biased, right?
Because maybe, you know, an AI would say, hey, you know, let's give more resources, right,to a minority race or gender, for example, less, right?
again, because of the way that some of these models have been trained.
It was a small study.

(16:47):
It was about 50 doctors participating.
ah The really interesting thing to me was, well, the headline first.
Thankfully, we didn't find any sort of biased decision making after the sort of AIintervention.
But the most interesting thing for me was that we found that doctors were willing tochange the answers, right?
Because cognitive bias and anchoring and all of this, they are real things, right?

(17:08):
Meaning to say, when you
you know, when you've made up an answer in your head, right, about buying something, it's,you know, you're kind of fixated on that, right.
But we did find that the doctors were surprisingly, you know, after presenting, beingpresented, right, with contrary evidence or maybe an AI telling them something else,
right.
They actually changed, were willing to change their minds and doing so actually made themabout 18 % on average more accurate, right, in terms of that, in terms of, you know, how

(17:35):
correct the answer was.
So I think that was reassuring to some degree.
But to me, I think it was really interesting that they also were willing to change theanswers after being presented with AI chatbots advice.
If I read the results correctly for that paper, it read to me like ah when they were notallowed to use the chat bot, they were less accurate.

(17:59):
And I think race and bias did play a role.
But after chat bots there, results improved.
Did I read that correctly or there's one nuance there?
That is completely correct, right?
So all the doctors answered alone, answered with our AI initially, right?
And then we captured that score and the intervention was then they could then consult anAI chatbot.

(18:24):
And then afterwards, their scores increased, which means that an intervention helped themto become more accurate.
How we knew it then inject biases was that for both groups, the score improvementincreased the same amount, right?
Whether for group A,
or group B.
And that's how we know, minimum, in our instance, in this instance, it didn't exacerbateany difference in treatment decisions or responses between both groups.

(18:53):
Wonderful.
it is given the evidence, they were equally likely to update their preferences andpre-existing bias.
At least biases that are introduced through health disparities did not affect thatimprovement.
Yeah.
Perfect.
So just to summarize, think the state of the evidence is at this time in carefully curatedclinical vignettes.

(19:21):
uh that are very representative of a real world scenario, right?
ah In a controlled environment, AI chatbots alone in your study did better ah than uhhuman doctors, physicians at different levels of experience, ah either alone or while

(19:42):
using the AI chatbot, right?
So that's the summary this time.
So what do you think the next...
So again, we know that real world is very different, When I see a patient, I'm kind ofreviewing at least all their history, their charts, like going into the lab sections.
Sometimes it doesn't come out as a very neat clinical vignette, right?

(20:08):
Real clinical practice is messier.
So have you thought about how do we kind of try and capture that?
Or when you're the way you're thinking about it.
Are there best practices there?
Like how can we move the field forward and say, okay, fine, hey, we need the AI to help uswith some more real world scenarios.
So how do we kind of go from where we are at now to that point?

(20:31):
Yeah, no, that's a very valid sort of critique, right?
That doctors reveals obviously ask, hey, these are all simulated.
They were simulated because like any, you know, testing any new technology, right?
No one is going to give you the permission to test it on real patients at the start.
So how this field needs or will evolve, you know, lots of researchers in the space, lotsof labs, they are going to increasingly use real clinical data.

(20:56):
So one example would be testing on retrospective data.
what's called a retrospective validation study.
What that looks like is that you can take real, for example, e-consult specialist answer,PCP question pairs, run it through a system, and check how often an AI chatbot produces

(21:17):
what could be graded as well by a separate specialist to be an accurate answer to what wasa real question, given that sort of context.
Lots of other ways around this, but basically, the goal is how to inner
regulatory HIPAA compliant fashion, right?
Start injecting real patient data into a lot of these sort of vignettes.
And the second way is what I'm most excited about is going to be obviously prospectivestudies, right?

(21:42):
Could we design a safe human in the loop way to give a physician like yourself, right?
Potential use of AI chat board that could maybe support some of your decision-making.
And that's due that end-to-end clinical expert in the loop.
to sort of review the answers, edit, amend.
And that would be a way that could more accurately capture, not in a perfect way, butstill a lot better than just using a pure text-based sort of exam-style questions.

(22:10):
um So that's better.
Obviously, the other one is bigger.
Bigger is always going to be, how can we run it more cases, more specialties, more diverseinstances?
I would say, though, that even for all studies, there was
a one hour exam, it was about five or six clinical vignettes because it gets tooexpensive, too long, right?

(22:31):
If you wanted to say, hey, know, Dr.
Manu, can you take 24 hours?
You'd probably say, no, right?
I don't have that time to sort of take an exam.
I'm done with exams.
I've done too many exams throughout my medical career, which you, which all of us havedone.
And so I think a lot of how to pull off these studies, a large part is to sit down topragmatism, logistic, while considering, you know, obviously the HIPAA.

(22:54):
how to pull in enough physicians to draw enough conclusions um to sufficiently power.
um So yeah, those are, I think, some of the directions that we are going to start seeingevidence over the next one or two quarters about, which I'm quite excited about.
Now, I would also like to extend that a little bit and explore what your thought processis.

(23:16):
Now, the way I think about AI is the way I think about sometimes, again, it's calledartificial intelligence.
So we try and humanize it a little bit.
But I think the human example is a good way of evaluating AI's capabilities as well.
For example, when we think about our knowledge and our understanding our skill sets,like...
there are a few generalists and then we specialize.

(23:38):
Even if you just take the case of physicians, right?
Like all of them have like general training through medical school and then pre-medtraining, but then they go into residency and specialize, right?
I would do horrible trying to manage a patient in the ICU, right?
But I'm pretty sure an intensivist would have equally tough time managing an acutelyagitated.

(24:04):
patient with psychosis.
So there's some specialization that happens over a period of time.
oh Do you think LLMs will have a similar future where, of course, you have these great,like, state-of-the-art models, frontier models, through the GPTs, through the Geminis and
others.
Do you think we will need to have more specialized models to be used in medicine?

(24:29):
And if yes, like, or do we just wait for the models to get better and then it won'tmatter?
Yeah, I think this is a fantastic question.
And it really depends whom you speak with.
So was recently on a panel with some of the leading AI labs.
And a lot of them believe that over time, it will really be just one ring through themall.
Basically, one general, generalizable model is actually going to do as well on medicineand finance and law, so on and so forth.

(24:55):
Personally, I would almost think to reframe it, I would
think more from the use case and how you think about evaluating these things.
Because if you can define that well, kind of like how we came up, we focused most of ourtime around the grading rubrics, right?
How do we decide a doctor or chatbot's caught well?
I almost feel that a technology part of things could be solved, right?

(25:18):
And now there's obviously lot of exciting tools we could use, right?
Whether it's agents or fine tuning or open and close, there's so many things to do, right?
Rack, for example.
But the hard part is really, like I said again, the really intensive labor, intensivecostly part of how to think about definitions.
OK, we see this is good output.
What do you mean by this is good output?

(25:38):
In the context of a psychotic patient, what does it mean to give him good management?
Is it that the patient doesn't readmit to hospital?
Is that he continues taking his medication and doesn't stop?
It's so hard and it's so nebulous, so much so that even if you take two psychiatriststogether, that's what we found.
two experts who get all three, right?
It's so hard to sometimes get them to agree on some of these sort of definitions.

(26:01):
I think at a high level, they definitely know what's a correct thing to do.
A lot of it's experience guidelines, but a lot of that, there's so much judgment andnuance.
And I guess where I'm going off is because I know I'm sort of going slightly off tangentto your question, Which is, I think what you asked me was more from a technology, right?
Hey, you know, do we believe that we need specialized models?

(26:25):
I think it's less clear to me, at least from my point of view, which ultimately will winout.
And I would almost say I'm kind of agnostic about that, right?
Because I think that it's just a tool that would mean to an end.
And if we can get the evaluation and all that parts right, then that's a hard part.

(26:46):
The easier part is picking which tool to use.
And I think that you brought up a very interesting point, right?
How do we even start thinking about evaluating the outputs, right?
I know there are some brave efforts out there.
know OpenAI just came out with the health bench, which is an evaluation tool.

(27:08):
I know MedHelm is something that you were a part of as well.
oh So how are you thinking?
So how are you thinking, as you said, right?
It's like oh where...
How do we even start evaluating where is this model pulling the information from?
Is it accurate?
Are there hallucinations?
How valid is it in its current context?

(27:30):
How useful it is?
Can you talk a little bit about evaluating LLM outputs oh or at least in the efforts thatyou've been involved in?
Yeah, maybe starting at a high level, right?
There are a lot of great efforts, as you mentioned, OpenAI, know, Methelm to producebenchmarks in Valation, because, like I said earlier, right, clearly it's super important

(27:54):
to try and understand, you know, the performance capabilities, weaknesses of thesechatbots.
I think a good question also to think about is the why.
Different benchmarks have different sort of purposes.
So as a developer, I want maybe just a good enough signal check.
I want something that can run my APIs in a semi-automated fashion to give me somedirectional signals whether it's good or not.

(28:19):
That would be different from, a mad help, which is clearly more hospital focused.
They took a lot of real health system tasks.
and basically try to come up with a matrix, example, questions to test, so that it couldanswer or begin to answer how good a chatbot does on more realistic things that are done

(28:40):
in a health system, which, once again, would be different from the benchmark as well asthe reason that an AI lab like OpenAI or Google might produce.
And then that would look, again, quite different from someone like yourself.
Let's say there was a chatbot trying to either to help psychiatrists like yourself.
do scribing or summarization or maybe to help patients get educated.

(29:03):
I don't think at all that you would apply the same benchmarks that OpenAI has produced oreven at Stanford.
MedHelm has produced for large health system to what you're trying to do.
um And that's why said it's important to think about the why.
What is this benchmark for?
How could it be helpful to what I'm trying to do?
And what are things that matter to me and what I want to use it for as a clinician?

(29:27):
Sometimes, a lot of these benchmarks, I'm a fan of using them as much as you can to getwhat's called a signal check.
It gives you some sort of directional sense.
But ultimately, there's no getting away with the fact that, and increasingly, that's afunny thing.
You start to see a lot more clinician experts like yourself, myself, psychologists, dothis really menial effort of looking through hundreds and hundreds of rows of AI outputs.

(29:52):
Because that's the best way you know for sure.
Again, there's only so much like all these automated scores can tell you about how goodit's ultimately doing for your specific use case.
Yeah.
so just so that I can kind of explain, like bring the level down a little bit maybebecause some of our listeners might be thinking, might be wondering what we're talking

(30:12):
about.
and again, Dr.
Goh, you can correct me if I'm going wrong anywhere.
So whenever all large language models will produce some kind of an output.
So if you ask chat GPT a question, there's an output.
Now, if you think about it in the medical scenario, if you're kind of working with chatGPT to kind of figure something out,
with a clinical case or something else, it'll produce an output.

(30:33):
At this time, are not many or at least no good ways of kind of evaluating how good thatanswer is, right?
So I think there are, so if the traditional uh one way of doing it would be to kind ofhave experts kind of review it one by one, which will be very resource intensive and
there'll be limitations to the number of outputs, answers we can kind of have them check.

(30:57):
But then there are some automated ways of doing that as well.
which are in development.
That's why I said Brave Efforts, because I feel their hearts in the good place, but Ithink this is just the beginning.
ah So they're OpenAI, and I think Stanford has come out with some models where you cankind of check these outputs in an automated way or check the function that the LLM is

(31:19):
designed to do in an automated way.
So, yeah.
Yeah.
I mean, what it really means is it's, they are basically question answer pairs, right?
There are questions that are then produced and these benchmarks also have the answers.
What that means is it then becomes a lot easier, right, for developer building a chatbotto run all these questions into the chatbot and check the answers of producers and check

(31:42):
how close it is to what the benchmark's answer or response actually is, right?
So I think that's a goal behind some of these benchmarks.
then there are obviously a lot of kind of clever shortcuts as well.
could do a lot of these, again, AI labs, a lot of these research papers.
What we're doing is this technique called LLM as a judge.

(32:03):
What that means is that you basically get a second large language model to read quoteunquote, read all these outputs and assign scores to it.
Now, obviously, the answer, the sort of rightful concern there is, how do we know thatthese LOMs are actually grading in the same way that a human expert could?
And that's a fair concern.

(32:24):
Some ways to mitigate that are then taking a sample of scores and going to Dr.
Manu.
Do you agree with what this LOM proxy grader, do you agree with the scores that it wouldhave assigned?
And then we kind of check concordance and things like that.
But yeah, so that's just one of the methods that
some of the research teams are using to skill up and significantly reduce the amount ofmenial grading efforts.

(32:48):
Because beyond cost, doctors don't enjoy annotating.
It's super arduous.
It's super labor intensive work to actually have to fact check every row in line.
Having annotated datasets, can attest to the fact that it's very tedious and not a funthing I want to be doing on a Saturday afternoon.
So I can attest to that.

(33:10):
So again, I think what this conversation is reinforcing for me is we are in very excitingtimes, but we are at the very beginning of this.
ah At this time, the way things are going, in the near future at least, there is noreplacement happening.
there might be increased collaboration, uh but these models, especially when it comes tohealthcare related tasks or tasks of clinical decision-makings are good in synthetic

(33:39):
environments, but we don't know how would they do if we kind of just unleash them on thereal world.
And I think there'll be an art and science to that.
And I think there'll be lots of amazing work that still needs to happen before we arecomfortable doing that.
So that's what it has reinforced for me.
So let's kind of jump ahead a little bit and imagine that we've kind of figured out waysof like solving some of those problems that we just discussed, right?

(34:04):
And we think to the future where like just like Google, like I remember, when I was in netschool in 2005, when attending used to ask me a question and they say, read it up and come
talk to me tomorrow, right?
That was the instruction.
Then we would go and read an actual book and then come back and give the answer the nextday.

(34:25):
ah Then fast forward like five, 10 years, like everybody just Google the answer right thenand there, and then give me the answer.
So in the near future, I'm pretty sure like LLMs and chatbots will become ah like a bigpart of medical education, teaching and other things, right?
So when we think about integrating these chatbots and LLMs in the clinical setup, ah whatdo you think about the medical legal responsibility there?

(34:52):
Right, if a doctor chooses to oh agree with the suggestion or ignore the suggestion, howdo we think about, say, again, of course, we're not there yet from by any means, but maybe
10 years from now, we will be there.
And then we'll be stuck with, I think we should start thinking about thinking about thoseanswers now.

(35:13):
So have you thought about this?
It's like, what, like, have you talked to experts about this?
What did they say?
Yeah, definitely not an expert, haven't thought too in depth into it.
I think that a lot of these questions you're asking are super relevant and they areunanswered, right?
Basically from a regulatory point of view, there's of course some foundational blocks wecould lean on to look at, know, status quo today, right?

(35:37):
If a doctor gets advice from a colleague, you know, a fellow psychiatrist and hedisregards it, something comes to the patient, right?
Almost certainly there's a lot of nuance, context.
probably lawyers who look at case laws, how is the judge ruled in the past?
I do think to me, what's going to be so different in some of these discussions I have withcolleagues is that it's going to be super pervasive, meaning to say you today get to

(36:01):
decide when you feel out of depth for a patient, when you want to knock on the door andsay, hey, can I discuss a case with you in future?
And this is already happening in some specialties, for example, for radiology.
These AI suggestions, whether you like them or welcome them or not, they are going to bepervasive in the sense that they are going to be deeply integrated, deep into the EMR,

(36:25):
every single patient you see.
Dr.
Manu, do you want to consider ordering this cancer test?
Do you want to consider studying this medication?
And it's different from today, because you don't have a doctor peeking over your shoulder,at least not when you're attending.
when you're resident, but not when you're attending, peeking over your shoulder andactually tapping you up and saying, hey, have you considered this or that?

(36:47):
So I think that is something open-ended and really interesting to consider.
I would say, though, that regulators, physicians, everyone look a lot to, I think,initially the signals evidence.
And that's why a lot of research directions right now are so important and compelling.
Because once we can start to see where these AI chatbots alone

(37:08):
do well on what sort of tasks where we, AI plus doctor, would and should do better orshould and do together, then it starts to become more clear on, I think, who should be
liable?
Where should the burden lie?
What should the regulation say?
Because regulation is actually a good thing.

(37:30):
If it's there, people feel more confident to adopt things, and they start to understandwhere, how, why they should be practicing.
but yeah, I definitely think that we'll start to see signals in the sort of questions youdescribed over the next two, three years.
Hopefully not when we had the first bad incident happen, right?
Because sometimes that is what causes more urgency to, look into a matter, right?

(37:53):
But I wouldn't be surprised if, you know, there's going to be a similar case, right?
Oh, you know, the AI scribes it to do this thing, but doctor, you know, you didn't do itor the AI scribe captured a patient saying this thing, right?
But it was not picked up on.
And then I think there will be a lot more attention drawn to these sort of issues and lotsmore thought there.
Yeah.
I think about all.

(38:14):
So again, anybody who uses EMR knows of all the annoying pop ups that show up that kind ofaffect your workflow, right?
Med interactions and stuff like that.
And you have to it's almost at every time point the hard stop that you have to choose.
For example, the med interactions one, which are very common and based on like primitivelike if then rules and guessing.
I don't think there's a lot of intelligence behind that.

(38:36):
But the idea being that, it's like, you're prescribing the this patient.
medicine A and medicine B and this is the probable interaction and you have to pick a boxsaying it's like, hey, this is okay for this patient or it's it's an inaccurate warning or
doesn't apply in this case.
I have a feeling that that will happen with AI more and more.

(38:58):
And my worry is, so again, EMR today might be one of the leading causes of burnout amongphysicians.
If we don't get AI implementation right, my worry is uh
Like it will again not be a source of help and collaboration, but more of an annoyance anda source of burnout.

(39:19):
So I just feel like, again, I'm scared and excited at the same time because the next 10years it will be an art and science, as you said, around how do we implement these tools
and how do we think about that.
Totally, think it's around the fully agree with you, right?
It's around the what we apply it to and then the how, right?

(39:42):
I think the what, clearly you know, right?
And you mentioned your clinic using Ambion 2.
Is the reason why they had such good uptake is clearly doctors are voting with theirusage, right?
Saying, hey, you know, we actually like this sort of use case.
does help, you know, our pajamas time reduce it.
It does help reduce, you know, the sort of annoying things, right?
Who likes to type notes all the time while you're talking to a patient, right?

(40:04):
But if it starts getting into clinical decision making, I know that some clinicians wouldwelcome having a of quote unquote safety net.
But at the other times, I had doctors come out to tell me, decision making is what I enjoydoing.
I don't want that to be taken away from me.
So that's one aspect.
Sort of what tasks we apply AI to.

(40:27):
Because we all stake hold this as clinicians.
We definitely have a say in how we want to.
to be implemented.
And the second part is, and you alluded to this, how these sort of interfaces are going tobe designed.
It can easily be super annoying in the case you mentioned.
medication, notification fatigue is a very real thing.

(40:49):
We don't want a million more of these buzzing us whenever.
So that's why I do think that a lot of, as well, improving human AI performance is goingto be about very thoughtful interaction design.
um I know training will likely play a role as well, but to me, it plays a smaller rolerelative to how much this interaction and a similar way to pilots and cockpits, right?

(41:13):
You know, you don't train pilots to learn how to fly when you're tired, right?
You take these factors into consideration in designing the cockpit around pilots beingfatigued, right?
Having two pilots, for example, so they cover for each other.
So that's the sort of thing that I think has huge potential.
And again, I love that doctors have very enthusiastically taken part in all their RISEstudies, right?

(41:35):
Because we do need a lot of representation, you know, not just to give signals onperformance, but really to give thoughtful feedback on how the system should be designed,
what are things to be considered moving forward.
Yeah.
So how important do you think prompt engineering is oh as we think about large languagemodels as they get integrated into clinical practice?

(41:59):
Do you feel like the kinds of questions we ask or how we ask them or what order we askthem in would be an important enough thing that we might need to teach residents and
medical students is like, if you want to, this is like standard practice.
So this is a good.
oh
standard operating procedure as you kind of interact with these LLMs and chatbots.

(42:22):
Yeah.
So what you're asking is, we need to train students in prompt engineering, right?
How to ask the question.
The answer is yes, today, right?
It's definitely important.
How you prompt chatbots is important to how good the output is going to be.
So a very recent example, was trying to get Vail, Google's newest AI generator.

(42:45):
I was like, wow, look at all these amazing videos that I'm seeing on social media.
It's so impressive.
I tried myself.
I tried prompting it to design what an AI consult could look like in the future, or AIdoctor consult.
It was quite hilarious, I think, the results in that I tried saying there was a patient,there was a doctor, and it was all jumbled up.

(43:05):
The doctor was basically saying the patient line.
And all the computer would be saying the patient's line instead of the patient actuallyreading out his symptoms.
So today, those things are definitely a problem.
I think increasingly, the same way or the same reason, Google typically kind of knows yoursearch intent or search query.

(43:25):
Using a similar probability statistic model or predicting the next word, a lot of thatlogic can go into guessing what is the real intent, even if you mistype something.
put in something correctly or the queries in a domain, right?
These systems are going to get better at understanding your intent and giving you a betterresponse.
But I would say for the next at least two to three years, I think from engineering orknowing how to promote, at least having a basic foundational understanding of how to do

(43:53):
this.
And there are some good free causes at Stanford as well.
know some of my colleagues have put out as well.
So yeah, I think it's definitely good to understand, start playing around and...
That's the best way to get familiar with some of these things.
Almost like no one says, hey, you should be learning or teaching your kids how to useGoogle search.
It's super intuitive.
You don't really need to teach them that.
Yeah.

(44:13):
Yeah, I think so.
So the last question, and I think something that kind of I have thought about and havespoken to lot of people about is as and this comes from the context of using ambient AI to
write my notes, right?
What I realized was like there's a trade off, right?
When I'm actually typing a plan out or typing an assessment out, I'm also thinking aboutthe patient thinking about what needs to be done and how I'm understanding the

(44:43):
the pattern of symptoms that the patient presents with, right?
ah As chat bots and LLM integration becomes more central, more of these tasks getautomated.
Do you feel ah that like we'll be like, we do a lot of cognitive offloading where ourexpertise might be at risk?
ah Like how do we kind of protect ourselves against that?

(45:08):
Especially when we think about trainees and med students.
We're just starting to learn about these things.
Yeah, that is a fantastic question, right?
I think it's unclear at the moment.
Even Stanford recently launched Hyatt, right?
Johnson Chen, my colleague into a sort of AI and medicine faculty to basically decide,right, what should we teach, be teaching students, right?

(45:29):
When should we let residents use ambient tools, for example, when structuring, dictatingwriting, right?
So much a part of medical reasoning.
I think it's a real valid concern that we start deferring a lot of these thought processesto, you know, and then we become weaker at it.
I'm not sure I have the answer to that right now, but I do think that if and when AIchatbot becomes very good at reasoning tasks as well, right?

(46:00):
we could potentially focus more for efforts on other things.
Now, what that tasks look like, what those critical patient care tasks look like, seem abit unclear.
And that's why I think that this is one of the more interesting sort of researchdirections, to again understand where AI alone is going to do really well, which could we
offload, which we not offload by ensuring that our residents don't hijack this learningprocess and start deferring all their thinking to AI chatbots.

(46:27):
Yeah.
I think we are up on time, Dr.
Goh.
it was wonderful chatting with you.
Thank you so much.
And I think what I would like to summarize, and I think some of my preconceived notionswere reinforced.
As I said before, I think these are very exciting times, but we are nowhere close to beingreplaced.
So it's kind of, I will have a job in the next 10 years, and that's very reassuring.

(46:52):
And I think we...
we need clinicians on the frontline.
And whenever I see a med student resident, I tell them, we need clinicians playing acentral role in the development of these tools and evaluating these tools, because at the
end of the day, we'll be the end users.
And without us, like, we might go down the route of EMR, where it might more be a sourceof burnout rather than an actual tool that helps us.

(47:17):
So the message is, get involved.
It's exciting time.
Lots of work needs to happen.
Yeah.
hope we were able to kind of excite a few people, excite a few medical students atresidence to think about maybe contacting Dr.
Goh and getting involved in one of his studies, right?
uh But thank you so much for taking time out and talking to us today.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

Dateline NBC

The Joe Rogan Experience

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}APA Innovation Hour: Clinical Reasoning, Bias, and the Future of LLMs

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

Dateline NBC

The Joe Rogan Experience

All Episodes

APA Innovation Hour: Clinical Reasoning, Bias, and the Future of LLMs

Stuff You Should Know