All Episodes

July 16, 2025 67 mins

Dr. Andy Beam has trained models, mentored scientists, and used data to quantify the value of treatments. In this episode of NEJM AI Grand Rounds, Raj Manrai turns the table on his co-host, reflecting on how Andy’s childhood misdiagnosis, and the failure of human recall, revealed the diagnostic promise of machine learning. As a Harvard professor, he mentored hybrid thinkers and built tools to evaluate safety, not just performance. Now CTO of Lila Sciences, he’s building an experimental AI system to generate its own hypotheses and test them in the real world. This conversation is a front-row seat to the next evolution of science.

Transcript.

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:01):
Are these robots in a room?
What is the experimental side?
Yeah, they are robots in a room.
They are disembodied robot arms in a room.
We have a system.
We have an automated experimentalplatform where if you're
familiar with how experiencework, they often work on plates.
So, either like a 96 wellplate or a 384 well plate.
These plates magnetically levitate overthis planar motor system that we have,

(00:24):
and they can zip next to this big rail.
There are benches with experimentalequipment on it, and the robot arm will
pick the plate up off the rail, putit in the piece of equipment when it's
done, put it back on the rail, and thenthe plate can zip off to the next stop.
So, the abstraction that I have forthis is that, actually, we're building
this new kind of computer. And thatthis planer motor system, this

(00:45):
rail, is essentially like a PCI bus.
And what we're doing is hookingnew devices onto this generalized
PCI bus in the real world.
And the idea is not to have a couple ofthese stations that can do what you can
do, it's to have buildings of these stationsthat can do experimentation at scale, and
then it really does start to feel like anew kind of experimental cluster that we

(01:06):
can pair with a traditional GPU cluster.
Hi, and welcome to anotherepisode of NEJM AI Grand Rounds.
I'm your co-host Raj Manrai, andfor this episode I had a lot of fun
because I got to turn the tables onmy good friend and co-host Andy Beam,
who is the guest of today's episode.

(01:26):
Andy and I have known eachother for a long time.
We were postdocs together at Harvard,literally sitting in cubicles
next to each other, and then wewent on the academic job market.
And started our labs aroundthe same time. Until last year,
Andy was a professor at the HarvardSchool of Public Health and now he's
the CTO of Lila Sciences, a companyworking on scientific super intelligence.

(01:46):
So, I know Andy really well, but I stilllearned a lot of new things about him
during this conversation, including abouthis early experiences in health care.
I was struck by how he gotinterested in medicine, his decades
long fascination with artificialintelligence, and his predictions for
AI, both in medicine and more broadly.
The NEJM AI Grand Rounds podcastis brought to you by Microsoft,

(02:10):
Viz.ai, Lyric, and Elevance Health.
We thank them for their support.
And with that, I'm delighted to bringyou my conversation with Andy Beam.
Alright, Andy Beam.
Welcome to AI Grand Rounds.
I get to say that this time. So,Andy, let me, let me first say I'm

(02:31):
truly excited that I get to pose thisquestion for the first time to you, and
I think I could probably simulate thisreasonably well given how much time we've
spent together over the last decade.
But Andy Beam, could you pleasetell us about the training procedure
for your own neural network?
How did you get interested inAI and what data and experiences
led you to where you are today?

(02:53):
Yeah, it's funny.
I'll try and give you some newinformation here, Raj, but you probably
can predict a lot of this trajectory.
So, you know, I was always kind oflike an engineering nerd as a kid, I
wasn't really interested in medicine.
My mom tells this story that inkindergarten they were doing this
new experimental setup where theyhad stations and the kids could
rotate through to reading arithmetic.

(03:14):
And I literally spent my entirekindergarten year at the Lego
station to the point that I couldn'twrite my name after kindergarten.
So, I've always just been interested intinkering and building and engineering,
and it was really in high school whenI started to think a lot about computer
science and computer engineering.
I got this Dell. Uh, dude,you're getting a Dell.
For people who remember those ads,uh, with a Pentium three, like 733

(03:38):
megahertz, and I just completelygot rabbit holed by all the things
that you could do on a computer.
I got into, like, hardware hacking alittle bit. So, I made like a little
side hustle in college modifyingXboxes. So, I could take someone's Xbox,
you could solder two jumper pointson the motherboard that would let you
flash the bios and essentially turnit into a general purpose computer so

(03:59):
you can install a bigger hard drive.
You could run super Nintendo games on it.
And I made like a decent amountof beer money my freshman and
sophomore year modifying Xboxes.
You'd come into like my dorm room andthere would just be like a stack of
Xboxes, like floor-to-ceiling. Becausepeople in my dorm would bring it by,
I'd charge 'em like 50 bucks
and modify their Xbox. In high school,I also took my first programming class.

(04:21):
So, I took a Qbasic course at a localcommunity college, and I think after
that experience I was just like, computerscience is like what I want to do.
It spoke to, like, so many of the thingsI was interested in. And really, I think
that's sort of been the guiding principle,
that's the field thatI most resonate with.
I'll keep going through mytrajectory here, but I
really have changed fields a lot.

(04:42):
And I think that though I still viewmost things through the prism of computer
science but informed by some of theseother things that I've been working on.
I will say I have this formativememory, that got me interested in
medicine and the intersection ofcomputer science from my childhood.
I was actually, I got sick a lot as a kid.
I had strep throat like four times a year.
I had chickenpox andshingles at the same time.

(05:03):
So, I was like in and out ofdoctor's offices a lot growing up.
I also got like a stick stuck threeinches into my quad playing wild
goose chase in my neighborhood.
So, I like, I had like lots—. I don't
think I, I don't think I knew the stick in the quad, I think.
No.
Wow.
Yeah.
So, you have these strong, strongexperience with the health care system
even growing up very, very young.
Yeah.

(05:23):
And there's one of these that I think cameback to me later in life that reinforced
the potential for AI in medicine.And it was in sixth grade, so it
was my first year of middle school.And like most nerdy middle schoolers
I went to space camp that year.
I actually went to also something calledSpec Camp, which in North Carolina
was like the academically gifted nerdcamp that you went to in the summer.

(05:44):
But I got home from space campand I started barking like a dog.
I had this cough that was, like,very much like barking like a dog.
It was the strangest thingmy mom had ever heard.
We were at the beach, and it continuedand continued and eventually one night,
I coughed so much that I became asasphyxiated and couldn't breathe, and
it eventually just went into vomiting.
Sorry, that this is kind of a gross story.

(06:05):
Oh my God.
Oh my God.
And it was just traumatic.
Like, it was traumatic.
And so, my mom took me to the ER.
They had no idea what it was.
They took me to the pediatrician the nextday. And in the pediatrician's office, I
went in the exact same full spell, likefull coughing, bronchospasm, emesis,
right in front of the pediatrician.
And the pediatrician was like, you know,

(06:26):
uh, I think you have a sinus infection.
My mom was, like, this kid doesnot have a sinus infection.
Like, I don't know what he has, buthe does not have a sinus infection.
So, I, like, had this for severalmore days and my mom woke up in the
middle of the night and had thisflashback to when she was a kid.
And she remembers being in the car withmy grandparents, her mother and father.
And the same thinghappened to my grandmother.

(06:47):
And so, they had to pull the carover and my grandmother vomited
on the side of the road and mygrandmother had whooping cough.
And so, I was actually having textbookpresentation of whooping cough,
but the pediatrician had never seenthis during their entire practice.
They had never seen this in their life.
Whooping cough had mostly been eradicated,and so, the next day my mom called the
pediatrician, and she was like, do youthink Andrew might have whooping cough?

(07:10):
And they're like, well,it's funny you say that.
We had just got this random callfrom the CDC and there've been a
couple documented cases of whoopingcough in other parts of the county.
And so, turns out I didhave whooping cough.
I got these huge horse pills that wereterrible to take but remedied it.
My dad's a dentist.
And so, they actually shut down hispractice for a couple weeks due
to the CDC, came to my dad'spractice, and came to our house, and

(07:34):
kind of like did a full canvas.
But what this told me was that, you know,I had, like most people growing up, this
sort of like reverent view of physicians,that they are some mix of being
members of the clergy, but also havesome non-trivial amount of omniscience
and can correctly diagnose everything.
But the reason why my pediatricianwas unable to diagnose is that
they had never seen it like itwas a textbook presentation.

(07:55):
So, if you're looking at the conditionalprobability of whooping cough, given my
symptoms, it would've been close to one.
But the fact that there was thisrecency bias with the pediatrician,
they just had a blind spot.
And there's a very, like humanwell-studied like cognitive
bias, like recency bias.
So, that stuck with me for a long time,that there was this flaw in the way
that people think about diagnosis.
And so, as I moved through undergrad, Iwas studying computer science, computer

(08:18):
engineering, electrical engineering,trying to decide what I was gonna do.
I thought about being a networkengineer and going work for Cisco.
I was interning at Qualcomm doingvery large circuit design verification
for the Snapdragon processor.
So, this is, this is all
in all, all in NorthCarolina at the time, right?
All in North
Carolina, yeah.
At NC State.
So, I went to undergradat NC State. And so,

(08:40):
I thought that I was gonnado one of those two things.
And then I took an AI class, theundergraduate AI course at NC State.
Uh, it was the green, uh, like,modern AI book from Russell and
Norvig, like the classic textbook.
And it was just like the completely,the most mind-blowing thing that
I had subject I had ever seen.
We talked about the Ship of Theseus,we talked about, like, all of these

(09:02):
philosophical issues, like what doesit mean to be conscious, but then
also the, like, very practical things,like, how do you search over large
spaces with things like A*?
How do you do theorem proving?
And it was like just essentially anamalgamation of all the subjects that I
found super interesting and really was ahard fork for the trajectory of my life.
So, I decided that I wanted to do AI.Didn't wanna do those engineering things.

(09:25):
And so, then I tried to work backwardsfrom that realization and figure
out what were the most excitingthings that I could work on.
And ai, and again, I had this likeflashback to this whooping cough
episode in sixth grade and said, like, medicine has to be one
of the most impactful things thatyou could work on and also has
all these interesting properties.
So, I then decided, spoke to a lot of myprofessors, asked them for their advice.

(09:47):
I got very sage advice from the sameAI professor who taught the course.
This was in like 2006 or 2007, and said,you know that this thing called machine
learning really seems to be important.
It seems to like be onan upward trajectory.
So, if I were you, I would go likereally understand probability theory
and reasoning under uncertaintyand really get deep in that.

(10:08):
And so, I took his advice.
I ended up staying at NC State andgetting a master's in statistics.
Also doing some research at the EPA.
Learned a whole bunch aboutthe foundational theory
behind probability theory.
Lots of super interesting stuff.
Finished my Ph.D. at NC State inbioinformatics doing Bayesian neural
nets for genome wide association studies.
This was Bayesian neural nets,
before auto grad was a thing.GPU computing had just started.

(10:32):
So, I was, like, writing
CUDA kernels by hand. Writingthe back prop by hand.
There was no auto grad.
Anytime you made a change, youhad to go back to your code.
This is a, this is a reallygood, back in my day.
Yeah.
It really is back in my day.
And so, I again learned alot about low level deep learning
'cause these were Bayesian neural nets.
Yeah.
And just like a lot of the sort ofnitty gritty about how to train those

(10:54):
models. I was part of a two-body problem.
Long-time listeners of the show know thatI'm married to a physician, and so, she
had gone out on residency, interviews andpediatrics sort of all across the country.
I kind of tagged along and looked forpostdocs for the places she interviewed.
Boston seemed to be the clearwinner in terms of the
two-body problem optimization.

(11:15):
And my Ph.D. advisor was someone namedJohn Doyle who came from MIT and had
worked with this guy named Zak Kohane.
So, towards the end of my Ph.D. I startedtalking to John about, like, who's working
on the frontier of AI in medicine.
He's like, you should reallygo talk to my friend Zak.
So, while Kristyn was up here interviewing,I went by DBMI, what was actually

(11:35):
CBMI at the time, the Center forBiomedical Informatics, and had a talk
with Zak, and he was just awesome.
He was like exactly the perfect postdocmentor that I was looking for.
Also, like, completely understandingabout the two-body problem.
So, he was like, Iunderstand how this works.
Like, we'd love to have you up here.
If you guys match in Boston, let me knowand I'd love to post you for your postdoc.
So, we went to the match day, youknow, really feels like the NFL draft.

(11:58):
We actually like brought hatsfrom the different cities with us.
Her family was there, my familywas there, got up on stage, got
the envelope that said Boston.
And so, I immediately got on my phoneand I told Zak that we were coming.
Came to Boston, spent like threeawesome years with Zak doing a postdoc.
Really in the early days of medical AI.
I remember I showed up on the firstday of my postdoc and it was like,
Zak, we need to get some more GPUs.

(12:20):
This was like 2014, and he was like,why? I was like, neural nets are like
going to really change almost everything.
He's like, I trained neuralnets in the year 2000.
What's different now?
I was like, well, these GPUs,essentially, each one of them has, like,
more computing power than the nationallabs had in the early 2000s.
Like, so, it's, it's just very different.
So, he immediately got it, and we startedworking on lots of stuff in the space.

(12:41):
So,
maybe just a few more points before,I actually don't know what you
have in store for me, so, I'm excitedto see what questions you have.
Yeah.
Let me redirect it now,
'cause I think we're gonna talk abouta lot of your work after your postdoc.
Yeah.
For the next part,
I have so many reactions.
So, the first thing is, I thinkI could simulate a decent amount

(13:03):
of that, Andy, but you definitelyput in some new content there.
And I have so many reactions.
One of them that is really, I think,really important is, you know, you're
talking about this story where youhad eventually you would go on to be
diagnosed with whooping cough and,you know, very sort of strong memory
that stuck with you and that led you to medicine and led you

(13:24):
to work on problems in medical AI.
There's so many things about this, right?
Your mom was very involved in this.
It's a loved one of theperson who's suffering.
This has been a persistent theme, andthen you were misdiagnosed first, and
now I'm sure the current versionof you is wondering – if you haven't
already done this – you know, howwould ChatGPT or similar models

(13:44):
have done with that presentation?
My guess is it.
It would've been, uh, quite high on itsdifferential diagnosis, but—. For sure,
I definitely have, and it was actually,
I mean, you know, some of the earlymodels that we did during our during
my postdoc, like we would give you theword-by-word change in probability.
Yeah.
Whooping cough was always onethat I would use to test it.
So, so even in the
primitive era beforecurrent large language

(14:06):
models, it was, it was high up there.
Even the LSTMs trained on smalldata could get it right.
Yeah.
Yeah.
So, this is what I want to dig into next.
So, this is your academic work.
You know, I think you'reabout to go here, and so, maybe
I'll try to briefly, I think, pick up where you left off.
And I do have to say, I think that was a fantastic answer for,

(14:26):
also, the genesis of, your interestin AI and medicine as well.
So, you did a postdoc with Zak.
Very successful.
Actually,
maybe I'll just give my comments on this.
You know, we spent a lot of time togetherwith neighboring cubicles to the point
that I — maybe we've mentioned this onthe show before — but we were so loud
and having so much fun, just joking anddistracting each other during the day

(14:47):
that we eventually did get separated. We eventually, I mean, we
essentially just took our postdocand put two mics in front of us and
started talking into the podcast.
Yeah,
exactly.
So, that is now AI Grand Rounds.
Yeah.
And it's a lot of fun.
And so, yeah.
So, we, you know, we had a lot offun talking about everything right.
From LeBron James to AI to deep learning.

(15:09):
Zak, Zak Bingo.
Zak Bingo.
Zak Bingo was always very fun.
Uh, well, some of our listenersknow what that means, but do
you wanna, do you wannatell them what Zak Bingo is?
Uh, Zak had a lot of phrases thathe would commonly use in talk.
One was using an analogy of Netflixknowing everything about you, but
the health care system knows nothing.
So, that would be a square onthe Bingo card. Information

(15:29):
Theoretic Criterion was another one.
So, we had a whole bingocard for Zak-ism's.
Oh, amazing.
You just took me straight back to postdoc.
So, one of the things that I wanted to bring up from your postdoc work, so
I think you were very prescient withthis, and I think it's been a through
line that in your work and your academicwork afterwards, you were saying things
that now we would take for granted.

(15:50):
But you were saying them very early on,which I think felt much more like science
fiction when you were saying them back in,
I don't know, less than10 years ago, even
2018, 2017. So, you were tryingto solve this problem, as I
remember it, of designing a neuralnetwork to pass the USMLE.
And of course, we know now that everyother large language model can do

(16:11):
this, but at the time, what was thesort of motivation and also, what were
the reactions to some of this workwhen you would tell either doctors
or machine learning researchers?
Yeah, I mean, even going back to likewhen I was in grad school and Kristyn was
in medical school, it just felt obviousto me that computers were gonna be able

(16:32):
to do diagnosis better than people.
And I would tell her friends thatI was super unpopular at the med
student parties because I felt likethe bringer of the apocalypse is kind
of like, I'm sure how they viewed me.
Why was it obvious
to you?
Why, like, what was, like, what are the sort of the deep reasons
that you saw it as inevitable?
Well, so you can go full firstprinciples here and just say, like,

(16:54):
if you think that cognition is,if there's not any non-physical
component to that, then surely, we canrecreate those processes in computers.
And computers have scalingproperties that human brains do not.
So, it was just a question of if, notwhen. This is also, like, I was reading
lots of futurist literature at the time.
I always tried to be slightly more soberthan like what you would see in that.

(17:15):
But it just, if you plotted out wherewe were in like 2009, 2010, it just seemed
like, I didn't know exactly when, butit just seemed inevitable that computers
are gonna be better at doing these typesof deductions than humans could ever be.
Computers don't get tired, theycan read the entire Internet.
They have perfect recall.
It just seemed like a completemismatch in terms of capability.

(17:38):
So, that was something thatwas core to my belief, like in
early grad school in like 2010.
And I think that still, it'smainly a core belief that I have.
So then during the postdoc, Zak wasactually very supportive of like these
types of like semi-heretical views.
Like Zak loves to, Ithink —. He loves these,
right?
He loves, he
loves it.
Yeah.
And so, we started working on thisproject, and Kristyn was, had

(18:02):
just passed step one and was working on,like, step three I think was what you
take during your first year of residency.
She had done step two at the end ofmed school, and so, like, I always joke
that it was due to a complete lack ofimagination that I was just, if I wanted
a computer to be good at medicine, Iwas just gonna have it do what she did,
like take, and I, Raj, you have done agood job at articulating this point, too.
Like, step one is a necessary but notsufficient condition to be a doctor.

(18:26):
We would get these questions all the time.
Like step one, I gave a talk at GTC in 2017 that, like, step one
should be a benchmark for medical AI.
Step one has all these properties.
People were really interestedin getting computers to do
differential diagnosis, but it'svery hard to grade a differential.
It's very hard to get the data, like,there'll be disagreement as to like
what the correct differential should be.

(18:47):
But these questions are cannedwith unambiguously correct answers.
There's a whole bunch of humanperformance that you can get as to
how well humans do on this test.
There's exactly the kind of datathat you would need to train
a model to be able to do this.
And so, from my perspective, step onejust felt like an obvious benchmark for
not only medical AI, but AI generally.

(19:08):
So, I kind of laid this out in a 2017 GTCtalk, and then we started working on it.
We got some funding from the RobertWood Johnson Foundation, and I had lots –
I'll talk about this later, butwe got bitter lessened, and it
was my sort of one of my firstexposures to the Bitter lesson.
The idea was like we could train LSTMson data that we had curated from the
Internet where you can kind of likemake example step one questions, and you

(19:30):
train it just like the correct answer.
And that was the plan.
And we were training bottles on tight Nexis,like, tiny GPUs by today's standards.
And the models weren't bad.
They were able to get like 40-ishpercent of these questions correct.
Which, to my knowledge, at the timewas like one of the strongest results
someone had had on that. They would dothese cute things, too, where they could

(19:50):
like, they would give you the sort ofword-by-word probability of diagnosis.
So, you could feed in a patient case andyou could kind of watch it think in real
time about, you know, one of the onesthat we always used was Kawasaki disease.
And it would be like a 1-year-oldpatient has had a fever for four days.
Evidence of strawberry tongueand as soon as strawberry tongue
showed up, like the differentialjust collapsed in terms of entropy.

(20:11):
Like Kawasaki would jump way up andall the other things would go down.
So, it was kind of neat that you couldkind of like get at some of these,
see how the model was reasoning.
But ultimately, they were very, verysmall models trained on limited amounts
of data and therefore the ceiling ofthose things were, like, pretty low.
So, I think what it taught me fromthat was one, that the medical AI
intuition that I had had for a longtime was right, but also, that if you,

(20:36):
in this sort of new era of AI, youshould always be working on the
most general form of the problem.
I thought that I was working on arelatively general version of the
problem, but it actually, it turns out,that the more general version of that
problem is predicting the next token.
And so, when GPT-3 and GPT-4 cameout, they kind of solved these problems
outta the box, like you alluded to, likepeople are now completely unimpressed

(20:57):
when a model — It's amazing.
— came to nine. Yeah.
Yeah.
It's amazing how much the goalposts havemoved just in the last few years, right?
Yeah.
Both for what it means for thequote unquote intelligence that's
in the computer models, but alsofor what it means for humans, right?
The whole conversation even aroundthe significance for a human passing
these tests has changed once AI has cleared them with ease.

(21:19):
So, you were doing all thisinteresting work, right, on the
USMLE benchmark on other tests, and Iremember we went through this together.
Then you went from postdoc
to starting your own lab.
And so, you became a professorin the Department of Epidemiology at
the Harvard School of Public Health,and you were continuing, I think,

(21:39):
your methodological work, but youalso started to work very focusly.
And you can tell us about the sort of origin of this, although I can
guess some of this, right, on problemsspecifically within neonatology, right.
Applying AI to neonatology.
So, maybe tell us about Beam Lab and,you know, life as a junior faculty

(22:01):
member, how you got it off the ground,what the philosophy was for the group.
Yeah.
So, I was excited to join theDepartment of Epidemiology.
Again, motivated from the AI perspective.
So, this was like 2018.
So, I started my lab July 1st,2019, and this was still very
early this, so this was pre GPT-3.That we had run out of steam.

(22:24):
For the, like AlexNetsupervised learning, everything
is a Cognet problem paradigm.
And so, there really was a sense that, like,we were looking for the next paradigm.
And what the Department of Epidemiologyat Harvard does like better than almost
anywhere else is causal inference.
And so, I was excited to jointhe department to learn from folks
like Jamie Robins, Miguel Hernan,and folks like that, who are world

(22:47):
leaders in causal inference, to seehow we can get some of that type of
causal reasoning into AI systems.
And we had a couple like reallygreat papers on that, some of which
were at NurIPS and ICML workshopsabout sort of blending causal
inference with deep learning.
The applied side of my lab has always been focused on
neonatal perinatal medicine.

(23:07):
Again, due to a complete lackof creativity on my part,
Kristyn went on to be a neonatologist.
And so, we've ended up working together a lot,
collaborating a lot. And, I think,
to the credit of you and toher have been big influences
on my academic career, we did a lot ofmore traditional epi health care,
data science kinds of things under thisumbrella of neonatal perinatal medicine.

(23:31):
One of which that I'm really proudabout is we looked at, we have a series
of papers looking at this drug that'sthought to prevent preterm birth.
So, preterm birth is babiesborn before 37 weeks of gestation.
About one in 10 babies inthe U.S. are born preterm.
It's one of the biggest sources ofneonatal morbidity and mortality
there is.
So, like, preterm birth is a big problem.

(23:52):
And historically there's only beenone drug that you can use to treat it.
It's called 17 alpha hydroxyprogesterone, or 17-OHPC or 17P for short.
The efficacy for this drug wasdemonstrated in a 2003 in ICHD trial that
maybe I'll come back to in a little bit.
But it was, it was NIH trialthat was run, administered
for a long time as a compound,

(24:13):
compounded medication, and kind of, like, was standard of care for
women who were at risk for preterm birth.
So, the indication is actuallyrecurrent singleton preterm birth.
So, if you have a history of pretermbirth and you're currently carrying a
singleton, you're eligible for the drug.
So, as part of like my interest in getting into AI and machine learning,
in Zak's lab we had access to this amazing clinical insurance

(24:37):
database that had the lives of 40million Americans over eight years.
When we got access to this data, I waslike, we're gonna like machine learn
the crap out of this, and we're gonnapredict all of the things, and we're
gonna like, create the like, world'sbest AI system using this huge database.
So, it was instructive to understandhow misplaced that enthusiasm was

(24:58):
for this kind of data.
So, one of the things that you learn inhealth care is like, not all data are
capable of answering all questions.
And so, I spent about a year of my life,really going deep and figuring out,
like, where all the warts were on thisdata, and trying to figure out what
types of questions it could support.
I started collaborating with amaternal fetal medicine doc at Beth

(25:18):
Israel just to try and like get someclinical feedback on these ideas.
She's like, you know, this machinelearning thing is great, but really
there's this question that we have noidea how to think about in maternal
fetal medicine around the 17P drug.
In the year 2011, a drug manufacturerhad acquired the rights to 17P

(25:39):
and started reselling it under thebranded name Makena, which at first
people were pretty excited aboutbecause it would increase access.
But they started essentially chargingan arm and a leg for something that
previously had been essentially freeunder this sort of brand name Makena.
So, she's like, we would loveto, like, understand more about
the economic impact of this.
And also like there's a lot of controversyaround like, does this drug even work?

(26:03):
So, we ended up writing a series of papers.
The first paper was in JAMA InternalMedicine, just, like, looking at how much
patients are being chargedfor this medication.
And so, we found, on average,the price per pregnancy for
Makena was something like $11,000.
And on average the price per pregnancyfor the compounded version was $200.
So, something like a 5000% increasewith plausibly, like no meaningful

(26:26):
benefit given to the patients.
Like, the differences andoutcomes between the compounded
and brand name version of thedrug were essentially identical.
There's no difference.
So, then we did a follow-up paper where we
used ideas from causal inference, andthis is where it was super helpful
to be in the epi department to dosomething called target trial emulation.
So, this is where you write down theinclusion criteria, the study design,

(26:49):
just like you were doing in RCT,they use observational data to try
and emulate that in your dataset.
And so, there was a parallel RCT goingon now that the manufacturer had to do
to be able to get the approval renewed.
And so, we followed that inclusioncriteria, we followed that study design,
we did the target trial emulation andfound essentially no evidence of benefit.
And this was like a very robustfinding across lots of different

(27:12):
kinds of sensitivity analysisand just felt like very solid.
So, we published that
in a perinatal journal.
The FDA, then after this trialcame out to the second, subsequent,
RCT for Makena was actually negative.
There was some maybe subgroup effects.
The FDA reviewed this and decided toremove authorization for this drug in

(27:32):
the marketplace and cited our paperas one of the key pieces of evidence
that they used in this decision.
So, like, I'm never super excitedabout patients having fewer treatment
options, but I think that this wasan instance where we could actually
use some of these data sciencemethods to have clinical impact.
Because if a drug doesn't work one,we shouldn't be paying $10,000 for it.
And two, there's obvious side effects,too, with a lot of these drugs.

(27:55):
So, there's so much there,
maybe one of the things that I'd love for you to just dig into a
little bit more is, you know, you saidsomething along the lines of knowing what
data can support what questions, right.
How to align different datasets with different questions.
And, in some sense, I think this iswhat really separates the quality of a lot

(28:16):
of research, which is not that you're, youknow, of course there's, there are data
sets that just in general are superior andwonderful and useful for a lot of things.
But I think knowing that marriage betweenthe data and the question, and maybe
we can also add the compute to the mixof this, is really the sort of art of
setting up a student for success, right?

(28:36):
Or working with a student to comeup with an idea that is likely
to be fruitful and interesting.
And so, you know, you've got your laboff the ground, you're publishing these
interesting papers in neonatology.
Continuing your methodological workaround causal inference and AI and then
growing your lab, recruiting students.
Maybe you can just reflect alittle bit and then I wanna
transition to your work now andwhat you're up to these days.

(28:59):
But you can reflect just before that onhow you approached recruiting students
and then mentoring them and designingprojects for students in your lab.
So, like, what was your philosophy?
I think a lot of people who are, youknow, junior faculty are interested
in this kind of stuff as well.
Yeah.
Let me first qualify and saythat when I started my lab, it

(29:23):
was a particularly crazy time in mylife, personally and in the world
generally.
I started on July 1st, 2019.
We had our first daughter,July 25th, 2019.
So, full 25 days into starting my lab.
Seven months later, COVID happened.
Daycares shut down.
Complete insanity.

(29:43):
My wife got conscripted into a lot ofIC service at MGH in Brigham 'cause
she was still a fellow, so she wascovering a lot of the pediatric ICUs
while the pediatric ICU docs gotconscripted into the normal IC.
So, I have a partial and fuzzy recollection of essentially
the first two years of my lab.
But let me try and give you asense of how I thought about it.

(30:04):
I viewed my lab to be
a place where computer scientists who aredeeply interested in health care could come
and work on important clinical problems.
So, again, that, I think that the 17P project
is a good example of that.
That was led by a student in my group,Joe Hakim, who is an HST student.
So, HST has been featured alot on this podcast already.

(30:27):
But a bioengineer by training.
And so, he was interested inmaking a clinical impact.
He would go and meet with the MFM doctorsand really, like, dig deep into how can
I map your clinical definitions ontowhat the data can actually answer.
And it was always kindof like a 50-50 split.
So, folks who are coming from purelycomputational backgrounds also

(30:49):
supervised a lot of residents,medical students and people like that.
I do think that there oftenhad to be like a very sincere
interest in AI and machine learning.
So, I would sub-select on folks for that.
That if we just wanted, like, weweren't doing a lot of like RNA seek
analysis in my lab. It really hadto be something, a clinical question

(31:10):
that you could answer with a largehealth care data set and ideally some
type of machine learning approach.
I tend to be relatively handsoff when it comes to day-to-day.
We would do some things thatwould be organized in a much
more structured kind of way.
So, we have a NeurIPS paper onsomething called proximal inference,
which is a subset of causal inference.

(31:30):
And we ran that very much like intwo-week sprints where like, here's
what we're gonna do for the next twoweeks, we're gonna check back in.
That project, I think was —.
the machine learning conferences are greatfor encouraging those sprints, right?
Yeah, exactly.
Yeah.
Yeah.
But for the most part, I tended to also start graduate students on a shovel
ready project that was, like, here's theproject, here's what success looks like.

(31:51):
Go and execute it.
Project number two is much moreof, like, here's a general theme of
things that might be interesting to
look at. And then the idea was by thethird project, they'd be able to just
ask and answer their own questionsthat they found interesting.
So, I did try and sort of ease folks intoresearch by giving them like a little bit
more structure in the beginning but thenbeing, like, less structured at the end.
Yeah.

(32:12):
And I, I think you're verythoughtful about that.
And you know, we just spokewith Anil, who's now at Google.
Actually, this hasn't aired yet,but the episode will air soon.
And Anil was one of yourfirst Ph.D. students, right?
Mm-hmm.
And I think the thought and the care that you put into sort of the arc
of their career while sort of beinghands off, but also so letting them

(32:32):
grow, letting them develop independence.
But also giving them a little bit ofstructure, semi-supervised so that they
can succeed, I think is very clear.
And it was very clear in what hesaid other than us both trolling
you, of course, as necessary.
Alright, so I want to diginto your work now at Lila.
And so, okay, so let me try to frame this.
So, last year you went on leavefrom your Harvard professor job to

(32:54):
become the CTO of a new company.
I think the company was in stealth atthe time, now out of stealth called Lila.
And maybe we can start with yourthought process behind the move.
So, given what we're going throughat Harvard right now, some would
say you look like a genius.
I know you have a very good crystalball, Andy, but I think your decision to

(33:15):
move preceded the current funding crisis.
You were doing great academic work.
You're mentoring students, you'rebuilding a research vision around
AI for neonatology with thesuperior, Dr. Beam, your wife.
Why leave?
Why move from Harvard?
Yeah, it's a good question.
Let me first preface by saying that itwasn't my first time to go into a startup.

(33:38):
So, again, something that I owe to you isbefore I started my faculty job at HSPH,
I took a year off to help start a company.
And this is actually,again, advice from Raj.
I had been interviewing for faculty jobs.
I had had an offer to join a startupfrom a company called Flagship.
Flagship is a venture capital firm that

(33:59):
instead of deploying capital in externalcompanies, they used that capital
to incubate and spin out companies.
So, I'd been part of an incubationprocess at Flagship as a consultant.
The thing that I had been consultingon got funding from Flagship and was
gonna go get started as a company.
It was centered on using machinelearning for protein engineering.
So, can we use machine learning models tomake protein therapeutics better, faster,

(34:22):
cheaper, in a more targeted kind of way.
Super interesting.
Hadn't thought about proteinengineering before but got to do that.
And so, I was like, kind of torn.
I was like, this is like a reallyinteresting idea, but it's hard
to turn down this faculty job.
And, you know, to your credityou're like, why not both?
Why not both?
Uh,
why not both?
And so, I actually delayed the start dateof my faculty job for a year and joined

(34:42):
what is now known as Generate:Biomedicinesas the founding head of machine learning.
Helped build the team. Helped build a lot of the early
models. Helped build the strategy.
Was there full-time for a year.
Remained in a part-time capacityfor four years after, I think my
title was like professor in residence.
So, I got to do the fun, like do thestartup thing for one day a week
and then the professor thing forthe other four. Generate has gone

(35:04):
on to be I think pretty successful.
They have 300 people.
They've raised something likea billion dollars to date.
They have two drugs in clinical trials.
And that to me is the most importantvalidation of the technologies,
that they actually have madereal things that seem to work.
So, I had a super pleasant experience.So, I think that experience was
de-risking for me to go join Lila.
So, with that preface, letme answer your question.

(35:27):
So, I was out on paternityleave in February of 2024.
And it was, like, our second childby comparison was much easier.
So, no COVID, no starting lab.
I was also helping start Generate at the time our first was born and
it was actually just kind of like apeaceful time in our lives and that gave
me, like, kind of a chance to reflect.

(35:49):
Going back to our, like, earlierconversation about my motivations,
AI has always been the thingthat I was interested in.
Health care has always been a superimportant and interesting domain, but has
always kind of been the sandbox versus thething that has been my primary motivation.
I mentioned before I started my facultyjob that it was a pre-GPT-3 world.

(36:09):
We still hadn't seenthe benefits of scale.
It still seemed plausiblethat you could do frontier AI
research in an academic setting.
And you know, in 2024 when I wasreflecting, it became hard to make
that case that you could do frontier AIresearch without significant resources.
It could also be that, you know, I haddone the faculty thing for five years

(36:31):
and I was getting the startup itch again.
And so, I started to ask around.
I do remember some texts along thelines of, I can't believe how much
fun it is to actually be able tocode and to just spend some time.
I think you were, you weredoing some coding again, right?
I was actually, through that period,
I was actually doing some woodworking too.
The desk that I have now is also amazing.

(36:51):
Amazing.
Amazing.
Yeah.
So anyway, I startedlooking around and there
was a company called FL97.
So, Generate was FL57.
That just means they give themlike serial numbers at Flagship.
So, FL97 is the 97th companythat they've incubated.
So, there were 40 in between Generateand —. Over the five-year period.

(37:12):
Yeah, yeah, yeah, exactly.
Yeah.
And so, not accidentally, two ofmy Ph.D. students were at FL97.
And I had been advising FL97for a little bit, so I kind
of had an idea of what it was.
But Flagship companies have these veryinteresting evolutionary trajectories
where they start in one place andover time they tend to evolve and
change and adapt, and then they end upsomewhere potentially very different.

(37:36):
So, FL97, and I'll just call itLila from here on out, started to
converge on something that wasreally, really, really interesting
and really, really compelling to me.
And what I wanted to understand isFlagship is known for making biotechs,
you know, is this going to be likeanother biotech AI company or is this
like actually an AI-first kind of company?

(37:57):
Meaning like, is the primary goal ofthis company to create AI or is it
to use AI in service of creating anasset, a molecule, something like that.
And so, I got to go spend some time atLila, got to meet more of the team that
they had built, got to meet the leadershipteam, and just became convinced that
this was a really exciting AI company.

(38:18):
And I'll talk a little bit moreabout the thesis behind Lila that
was going to be less constrainedfrom a resource perspective.
So, we were going to commit significantresources to GPUs, serious resources
to creating the data that you needto create new kinds of AI models.
And it just, it felt like kindof the culmination of a lot of the
different things that I had beenthinking about over the last 10 years.

(38:42):
And so, I always say that like I had afantastic job, you know, I technically
still do, I'm on leave, but likebeing a professor is a great job.
What's happened in the last threemonths, notwithstanding, as of
March 2024, it was a great job.
Had a wonderfully supportive department,wonderfully supportive school, great
colleagues, world class students.
And so, this wasn't that I was unhappy,it was just trying to be honest about if

(39:06):
the kinds of problems that I want to workon were accessible to me in academia.
And I think when I was clear-eyed,it just became hard to argue that
I could work on the problems thatI wanted to in an academic setting.
So, the message I'm hearing is thatour ability in academia to retain
Andy Beam scales with the numberof GPUs that we have access to.

(39:30):
It's another scaling law for talent.
But, so you're, okay.
So, from that description, I understandthat you're focused on AI first, which
means not applications of AI or notjust applications of AI, but AI itself.
and that you need a lot of compute.
You need a lot of GPUs toaccomplish your mission.
And

(39:50):
maybe you can tell us whatthat mission is, right?
What are you trying to accomplish?
Where are you and sort of where do you seethis going for the next couple of years?
Yeah.
And just to preface or circle back onthat last point, there are interesting
problems that you can work on in academia,everyone has their own utility function.
So, I'm not saying that there's nothinginteresting happening in academia.

(40:11):
It just happens to be the onesthat I find interesting are hard
to work on in an academic context.
So, what, so then whatare we doing at Lila?
We recognize that the scalingparadigms of the last five years
have been enormously successful.
Again, talking about passing theUSMLE, as a sort of an accident
of these scaling paradigms, butthey're probably also saturating.

(40:36):
So, I think it's clear that pre-trainingor maybe the scaling law still works.
So, power laws are kind of a hell of athing that to get the same amount
of benefit you still have to scalethe compute by an order of magnitude.
So, it might just be that wecan't keep scaling it up to 10
million GPUs, and so, people arelooking for new scaling paradigms.
We think that models need the ability toessentially generate their own tokens.

(41:02):
And so, the models need the abilityto ask and answer questions that
people have not asked before.
Large language models, one way tothink about them is that they're a
wonderful index into human knowledge,so everything people have created is
accessible to a large language model.
They're able to access it in thisvery fuzzy kind of way where they
can do fuzzy pattern matching.

(41:23):
And they're really great ataccessing human knowledge.
Again, putting my causal inference hatback on though, we know that there's limits
to what you can do with what amountsto a big pile of observational data.
So, if you actually want to make a claimabout how the world works, the best
thing that a model with observationaldata can do is tell you kind of

(41:44):
like what hypotheses are compatiblewith the data that it has seen before.
And the only way to essentiallypick from a set of hypotheses
is to either make strong assumptionslike we would do in causal inference,
or actually do the experiment.
And so, that sort of key insight is whatwe're developing at Lila is how do we take
these very powerful, large language modelsthat have been trained on the entire

(42:05):
Internet and pair them with a scalableexperimental platform that will let them
break ties that exist in the literature.
And ask questions that have neverbeen asked before in the literature.
So, again, like you and I both knowthis, I mean, what you did during your
postdoc was really focused on this.
The scientific literatureis not a record of facts.
It's a record of a debate undervarying incentive structures.

(42:27):
So, people are incentivized to publish themost charitable version of their findings.
They're incentivized to downplaythings that are inconsistent with the
hypothesis that they're trying to support.
And then there will be paperspublished that sort of rebut that.
So, I think it's obvious to me thatyou're not gonna be able to derive
what science is happening in 2050

(42:49):
if you have just read those papers. Thatyou're gonna have to do incremental
experimental steps that builds uponwhat has been done in the literature.
But we're not gonna have some oracle, GPT-6is not gonna be some oracle that can just
reason from first principles conditionedon what we know currently in science.
And so, if you buy that sort of, like,basic premise, then to your sort of

(43:09):
immediate conclusion is, okay, howdo we connect this with a scalable,
experimental platform so that themodel can push beyond what we know now?
So, that's in essence what we're buildingat Lila, where half of the house
is focused on scalable experiments.
The other half of thehouse is focused on AI.
But again, we view the experimentalplatform as a new token generator

(43:29):
for the models that we're training.
So, are these robots in a room?
What is the experimental side?
Yeah.
They are robots in a room.
They are disembodied robot arms in a room.
We have a system. So, we have anautomated experimental platform where
if you're familiar with how experimentswork, they often work on plates.
So, either like a 96 well plateor a 384 well plate, these plates

(43:51):
magnetically levitate over this planor motor system that we have, and
they can zip next to this big rail.
There are benches with experimentalequipment on it, and the robot arm
will pick the plate up off the rail,
put it in the piece of equipment when it'sdone, put it back on the rail, and then
the plate can zip off to the next stop.
So, the abstraction that I have forthis is that actually we're building

(44:12):
this new kind of computer. And thatthis planer motor system, this
rail, is essentially like a PCI bus.
And what we're doing is hookingnew devices on this generalized
PCI bus in the real world.
And the idea is not to have a couple ofthese stations that can do what we can
do, it's to have buildings of these stationsthat can do experimentation at scale.

(44:33):
And then it really does start to feel likea new kind of experimental cluster that we
can pair with a traditional GPU cluster.
Do you think, um, actually, do youlike the characterization, because it
occurs to me that it kind of feels likeyou're looking for a new scaling law or
you're searching for a new scaling law.
Do you agree with that?
Do you like that characterization?

(44:53):
Is that fair or not?
Literally, exactly how
I describe
it.
Okay.
Amazing.
I've probably heard you say that tome and I'm just regurgitating stuff.
Yeah.
Literally, exactly how I do it.
Yeah.
And
yeah, again, like it's just builton the recognition that, we've
seen this a lot in large languagemodels over the last three months.
Like, they rely on verifiers and forsome class of verification tasks,
nature has to be the verifier.

(45:14):
And so, we're building a big,scalable, nature-based verifier
so that these models can learn tohypothesize and reason about things
that we don't really understand yet.
And we think that that will unlock a newscaling paradigm in the same way that
Pure compute trained on Internet dataunlocked the first scaling paradigm.
You know, just to sort of rephrase, likescience is subject to the Bitter lesson,

(45:38):
and we're trying to figure out in whatways it is subject to the Bitter lesson.
So,
one of the other things you said, and sort of the motivation for what you're
doing is that our existing paradigm,our existing large language models, you
know, they can do so many things, right?
So many things that and Ithink the word you used, or you
might've used was byproducts, right?

(45:58):
Mm-hmm.
They're almost, you know, there'sno intent by the creators of these
auto complete models that they'd beable to solve differential diagnoses
that are very tricky or pass theUSMLE's or other things, right?
This just emerged from the sort ofscale of compute plus the other training
that that was applied to the models.
But in describing that existing paradigm,think you said that we are saturating

(46:22):
or that it's getting saturated. AndI wonder, is that an empirical
observation that you have of the sort ofperformance of these models over time?
Or is it more of a sort of inevitabilityfirst principles, um, deduction that
you're making from the way in which themodels are trained and the procedure

(46:44):
that goes into to creating them?
Is it a, they're sort ofsaturating, like they're not
getting better at the benchmarks,it can only get so much better.
Like, where's that sort ofinitial spark of LLMs will not be
able to do X, Y, Z coming from?
They were very bad at classesof benchmarks that required
long-term reasoning and planning.
an example of this is solving complicatedmath problems, solving complicated

(47:07):
programming problems by and large,
simply pre-trained models arenever best in class at those
things that have what people calltest time, compute, or reasoning
capabilities have taken over this.
It's
like the O series, right?
Like the O series from GPT or equivalents
a again, like putting my host hat backon to explain to some folks who aren't
as technical in this pre-trainingis simply predicting the next word.

(47:30):
So you, you know, or the next token.You can do this with unstructured data.
Reasoning models are trained whenby giving feedback that indicates
how good their solution was.
In some sense, pre-trained modelsare trained to predict the average
response reasoning models are trainedto produce the correct response.
And so, the fact that we've alreadyshifted from one paradigm in pre-training

(47:52):
to reasoning slash test time compute,I think is, is a good base case for
saying that pre-training
has saturated.
Alright.
And then one of the other pointsthat you brought up that I'd also like
you to just talk about a little bitmore kind of reflects, I think, or
resembles some of our conversation, uh,with Vijay Pande a couple episodes ago.
And so, you know, he had this verysuccessful academic career and

(48:15):
then he transitioned to industryand to venture capital, right.
And I think he made a very compellingcase for, despite himself moving,
for why, uh, there are certainproblems in academia that are likely
only solvable within academia.
And so, maybe my challenge for you, Andy,is like, can you still man the case

(48:35):
of sorts while having yourself gone onleave and gone to industry and at Lila.
Can you still man the casefor staying in academia?
What are the types of problemsthat you should stay in academia,
current funding crisis notwithstanding, to be able to solve?
Yeah, I think there's acouple answers to this.
One is the classic, which areproblems that have no immediate

(48:56):
or obvious commercial value.
So, like AI is kind of the opposite ofthat now, which is why it's so resource
intensive where there's a gold rush tocommercialize all things AI so classes of
problems that have no immediate commercialvalue and a more long-term horizon.
Things totally are in scope and thatwould include a lot of theory both machine
learning and other kinds of theory.

(49:17):
I think a place in medical AIspecifically that is uniquely well
positioned for academics is evaluation.
So, like actually, doing the evaluationto see if AI results in patient benefit.
NEJM AI, you, and Zak have been atthe forefront of this, obviously.
I think that there's a lot ofperverse incentives for that
once you get outside of academia.
And so, having trusted auditorswho can know whether or not the

(49:41):
technology actually works is alsoobviously a great thing for academics
to be working on that has like hugepublic health and patient benefit
that goes along with it.
Alright.
And maybe one last question beforethe lightning round. Oh no. Which
I am so, so excited about, is youdescribed Lila as having these sort
of two different components, right?
Like there's an experimental side robotsthat are moving plates around on these

(50:04):
magnetic, what are...? Planar motor systems.
Yeah, planar motor systems.
Planar motor systems.
That's the, that's the term.
And then you have a sort ofmachine learning side, right?
Mm-hmm.
That is developing models andtraining and doing computational work.
What do you see as sort of the biggestchallenges that you faced already at Lila
and what is the sort of the, the sortof key task in the, you know, the thing

(50:26):
that's keeping you up at night, maybe, tofocus on for the next year or two and
growing Lila and achieving your vision?
Building stuff in the real world ishard, like actually building hardware.
I mean, this goes back to likeearly days of my life when I
was an electrical engineer.
And actually getting stuff towork in the real world is hard.
And there are all these like edgecases, like moving these plates around.

(50:48):
They have liquid in them, whichmeans that they slosh, which
means they could be slightly off.
So, when the robot arm goes to pick it up,it's in a slightly different position.
And so, there's like thousands of last milechallenges like that, that we're solving.
I think the like philosophical challengethough, is that all of automation
is actually created for people.
And so, one of the things that we'rereally focusing on is, like, what does
automated experimentation look likewhen there are no people in the loop?

(51:11):
The fact that I said that we putbenches next to this planar motor
system is a hint that these wereactually still designed for people
because people need a place to stand.
They need a place that needs to beapproximately, like, shoulder height.
And so, really there's like a secondorder set of challenges about
how do we actually refactor a lotof these experimental workflows?
If they're just gonna be run by an AI inthe cloud and you actually don't have to

(51:32):
have humans in the lab standing there,that's probably the biggest challenge.
And we're making lots of progress on that.
We spend a lot of time thinking about it.
But if I was thinking about, reallythe core challenge, it would be
like, how do we rethink thesethings from first principles given
that we're doing something thatreally hasn't been done before?
On the AI side of things, it'sall the traditional things.
Like we're not on O2 anymore.
For those folks at Harvardwho use the computing cluster

(51:53):
there, we're not using Slurm.
We're doing very complicated trainingflows on Kubernetes clusters that
have all these orchestration things.
We're scaling up to thousands ofGPUs now, and just like training
at scale is very difficult.
We are building like a unique set oftraining capabilities that gives the model
access to a wide set of tools to use.
And actually, orchestrating all of thattogether is also pretty challenging.

(52:16):
I feel relatively better about the sortof AI challenges versus the challenges
posed by the real world, but I'm confidentthat we'll be able to solve both sets.
Are you finding that the folks thatyou recruit or the way in which you
recruit is very, very different than your academic lab, or are
there, are there some similarities?
There are similarities.

(52:37):
I would say that the, the missionof AI for science does a lot of the
recruiting for me when I talk about, weare trying to get an AI that can run the
entire wheel of science to come up withhypotheses, test them and then update
its understanding based on the result.
That's a pretty compellingmessage to a lot of people.
We also are recruiting with industryresources and compensation packages

(53:02):
versus academic compensation packages.
That also makes things easier.
I still think that we get a lot of thesame phenotypes though, of people who are
cross-trained, neither some hard scienceor medical science who are also very like
deep on the technical side of AI and canreally make, again, like the recurring
theme on this podcast is having multiplesets of expertise live in the same brain.

(53:24):
And, you know, we found that thatphenotype has also been good for us and
also finds the mission pretty attractive.
Alright.
I think that's a great momentto transition to what— Oh boy.
—I'm super, super excited for,which is the lightning round.
Oh God.
So, and Andy, you know all the rules.
Uh, so let's, let's dive in.
Are you ready for this?
I'm not, but let's do it.
Alright.

(53:52):
So, this first one is for yourbrothers, Andy, who is the GOAT—that
already, that already got you scared—
who is the GOAT that is the greatestof all time of basketball?
LeBron James or Michael Jordan.
Oh man.
I, I, I feel like when you say greatest of all time, this is
not just a statistical consideration,it's a cultural impact consideration.

(54:16):
And I think by that,I'm gonna have to go M.J.
I think that M.J. cha—. Wow.
I think that M.J. changed basketballboth globally and in the U.S. in a way
that, like LeBron, while having astatistical claim to greatest of all
time, I feel like doesn't have the cultural impact that Jordan had.
I'm gonna disagree withyou, but that's fine.

(54:37):
We can move on to the next question.
What is the single biggest barrierpreventing large language models from
becoming trusted frontline decisionsupport tools in clinical medicine?
I'm
gonna say that it's a mix of reliability.
So, the obvious, likeproblems with hallucination.
And that they still onlyrepresent a partial solution in

(54:59):
a way that a person does not.
So, this is getting better, butthey can't use tools like if they
have to pick up a phone and callsomeone, they still can't do that.
So, there are still like capabilitygaps that are unrelated to accuracy
and reliability that I think stillneed to be filled before they could
totally replace a lot of frontlinedecision-making services.

(55:20):
Alright, our next question, whichis the hardest job, and this is
one of my favorite ones asked
now since we, we've done it to Zak, andI think also to Larry Summers, but

which is the hardest job (55:29):
being tenure track faculty at Harvard, founding deputy
editor of NEJM AI, or CTO of Lila.
Oh man, trying to get, tryingto get me in trouble here,
Raj.
I'm

(55:49):
gonna go with,
I think, I don't, I feel like tenuretrack faculty just because it's not
only the weight of your own ambitions,it's you're meeting people at this
very vulnerable stage in their career.
And I always felt like I internalizeda lot of that if a paper gets
rejected whatever, I have papers.

(56:11):
But for students, those feelvery like monumental decisions.
And so, I, I feel like that therejection hit me harder for that
reason than like day-to-day challengesin the other two jobs you mentioned,
Great answer.
And I think, again, reflecting howthoughtful you are as a mentor too,
that you can separate sort of yourperspective from your students.
And I totally agree.

(56:31):
It's a very, very important time and
each thing feels, feels very,very important each outcome.
So, that is challenging to navigate.
Alright.
If you weren't in AI, whatjob would you be doing?
Think outside the box here.
Well, so, I can tell you what I saidin kindergarten and in kindergarten.
I told my mom that I either wantedto be a Ghostbuster or a trash
trashman. Both noble professions,

(56:52):
but I don't think that'swhat I would answer now.
If I wasn't in AI,
I actually think some kind of writer.
I always liked writing in undergrad.
I always liked writing essays.
I blogged a little bit during my postdoc.
I think some type of like substack
writer, something like that would besomething that I would naturally enjoy.

(57:14):
Yeah.
Nice.
I don't think I would've
guessed that.
So, I like it.
Very great answer.
Pro smash.
The last
professional smash rally.
That's what, that'swhat I would've guessed.
Yeah.
Alright, next, next
and our, our last question,if you could have dinner with
one person, dead or alive,
who would it be?
I've also thought aboutthis, and I have two answers.
'cause I knew this one was coming.
The first one is just anintellectual one, and I think it

(57:36):
would be David Foster Wallace.
I've read every book he's everwritten except for The Pale King.
I've read multiple lots of his stuffover and over again, and I would just
be dying to know what he thinks of thefuture that he largely predicted in
a lot of his fiction and nonfiction.
So, I think that would be it.
The sentimental answer is mygrandmother, my nana. My mom's
mom was the matriarch of our family.

(57:59):
She died about 15 years ago andwas always the one that would
like, tell you exactly how it was.
And like when you got Nana'sapproval, that was like the
best approval that you could getbecause she was like a tough lady.
You know, a child of the depression,lived through two world wars, went to
college at a time when a lot of womenweren't going to college, and was just
like a sort of the bedrock of our family.

(58:20):
And so, I would just love to havedinner with her and kind of be
like, so what do you think, Nana?
And then she would tell meexactly what she thought,
so.
Nice.
Well, congratulations Andy Beam.
You have survived the lightning round.
Passed it with flying colors a great job.
Alright, so Andy, I just have maybeone or two last questions here.
More big picture, kind of someconcluding thoughts that, words of

(58:42):
wisdom that, that you can leave us with
maybe. The first is, we talk a lotabout this on the podcast and
listeners will know that we like toinvoke the scale hypothesis as a way
to think about large language models.
And we've already talked about it in thecontext of LLMs, but also in the context
of the work that you're doing at Lila.
And maybe I can restrict this forthe sake of this question just

(59:03):
restricted to medicine and applicationsof language models in medicine.
So, there's this sort of currentstate of the models, right?
Like, if we were to just freezetime, freeze the technical
capabilities of the models.
And ask what they can do, whatthey'll be able to do within medicine.
We all have predictions for where theyare, with respect to the things that we

(59:24):
have to do in diagnosis and treatmentand other applications in medicine.
And then there's another version ofthis, which is, how will these
models sort of continue to evolve?
Will they continue to evolve?
How much better will they get?
And my question is, can you openup your crystal ball for us again
within medicine and just forecast,invoking, you know, where you need to,
the scale hypothesis, what is goingto happen with LLMs in medicine

(59:49):
over the next, the next few years?
Yeah.
So, again, just to come full circle, Iconsider the class of problems that I
was talking about when Kristyn was in medschool and during postdoc to be solved.
So, the estimating, the correct conditionalprobability of disease given symptoms,
even if the symptoms are expressed
partially, you know, in anincomplete way, even if they need

(01:00:10):
to be elicited from the patient forthat problem to be largely solved.
I think I'm going to put twoclasses of problems that we should
think about when thinking about thescale hypothesis for health care.
It's automating what we already knowhow to do and then doing things that we
don't know how to do yet.
So, I would again say diagnosisis automating things that

(01:00:31):
we already know how to do.
My pediatrician missing thewhooping cough thing, like we,
someone knew how to do that.
He just happened to get it wrong.
The big thing that will changehealth care over the next one to three
years is generalized computer use.
So, we've seen tools like this alreadyfrom operator, from OpenAI, from Claude
agents, but the ability for AI to reliableuse a mouse and keyboard solves probably

(01:00:55):
like 90% of the remaining unsolvedproblems in AI because you can just have
it sit on a workstation, enter orders.
And the question that you asked awhile back about what stops it from
being a frontline decision tool,I think that's solved when AI can
use a mouse and a keyboard reliablyto do long-time horizon tasks.
So, the sort of like operational aspectsof medicine, health care that AI can't

(01:01:18):
currently do, I think will be solvedby continuing to scale what they're
currently doing with computer use.
Go ahead.
Would you include being able to,uh, operate ScholarOne that
might be a GI hard list of tasks?
You know,
like, just like, like envy, envy hard.
Yeah.
Yeah.
The, the final frontier.
The final frontier, the operating, out-of-date web software that was

(01:01:40):
written in the 90s over a weekend.
So, um, yeah, so I think that generalized computer use. It
would probably one or two ordersof magnitude of computing power
away from making that reliable.
But I'm guessing that willbe solved over the next year.
And when generalized computer use issolved, just like step one was solved as a
byproduct, many other of these operationaltasks will also be solved as a byproduct.

(01:02:03):
So, I imagine that being like, andI know a lot of the Frontier Labs
are pushing pretty hard on that.
Then there's like the
unknown, unknowns.
So, like there are some diseases thatwe don't know how to diagnose that we
don't know how to treat, that we don'teven actually know how to classify.
There are, you know, many things inpublic health and medicine generally
that are just kind of like dark matter.

(01:02:24):
I think those are gonna have tobe unlocked not by scaling, but
something more akin to what we'redoing at Lila and what other people
are doing, where we're actually justusing AI to make science go faster.
I think that like that is gonna bemore of a five-to-10 year time horizon.
We're gonna need new measurement devices.
So, you know, I know that you knowthis deeply, Raj, but the resolution

(01:02:45):
that you have on a patient's physiologyfrom the electronic health record is
like the difference between, blackand white television in the 1940s.
What we actually need is like ahundred foot 8k a picture.
Now we just don't have that.
So, we don't have high resolutioncharacterizations of patient
physiology and we'll need newdevices that will enable that.

(01:03:06):
I always think about one of thethings that we worked on but never
finished in the lab was the abilityto do non-invasive measurements
via visible light spectroscopy.
So maybe that's not the righttechnology, but some other sort of
like mass characterization of patientphysiology that you can then feed to the
ais that are being trained in the currentscaling paradigm feels like the next
like big unlock. AI making science gofaster will indirectly make medicine

(01:03:32):
and health care better over like a fiveto 10 year period, but it's hard to
know exactly how that's gonna play out.
Going back to one of your earlieranswers, do you think that
evaluation is gonna remain sort of thecritical frontier critical academic
task for, for the next few years.
Yeah, I think so.
I think it has to just 'cause there'sgonna be a lot of stuff coming online.
Integration andimplementation science, too.

(01:03:54):
Like how do you either retrofit epic withthis stuff or do a gut reno so that
you can get this type of technology in.
That also feels like a supernecessary thing to have happen.
Another area that I thinkis also gonna take off.
It's, of course, a very old discipline,but I think human-computer interaction
and how humans and machines will worktogether, is also sort of poised

(01:04:16):
to really become very, very important for AI and
medicine and for actually getting thesetools safely and, and effectively into
the clinic in the next couple years.
Alright.
Last, last question, Andy.
So, we both give a lot of talks aboutAI to doctor audiences, to various
academic grand rounds kind of settings.

(01:04:36):
And one of the questions I getasked the most in these settings is,
physicians come up afterwards andthey say, man, this is moving so fast.
You know, great to hear your talk,but what should I study?
What can I. Arm myself with whatcan I learn so that I'm ready for
this in the next couple of years.
And then this other sort of skepticism,which I honestly really like, which

(01:04:59):
is like, okay, some of the stuff youshowed was cool, but there's a lot
of like hype and there's a lot oflike, you know, BS that's out there.
How do I tell what'sreal, what's not real?
And so, maybe thinking about thephysician, the clinician listeners in
the audience what's your advice forstaying up to date, other than listening
to this podcast, but staying up to date

(01:05:21):
with AI for providers, forclinicians specifically.
Pick one of the frontier modelsand use it every day of your life.
So, pay the 20 bucks for ChatGPT.
Pay the 20 bucks for Claude.
Pick one.
Pick both, but use it for tasks.
And see where it breaks.
So, if you're gonna make a taco recipe,ask ChatGPT for a taco recipe.
If you're looking for things todo on your next vacation, ask

(01:05:43):
the model how you would do that.
If you're trying to generate an imagefor a talk, use the image generation
capabilities in these models to do it.
I think that there's no single
source of pedagogy that's gonna behelpful here, because the technology moves
so fast and you're gonna get a sense ofwhat it can do and what it can't do just
by getting the muscle memory of usingit for tasks in your everyday life.

(01:06:05):
So, I've had friends ask me thisand I'm like, just use ChatGPT.
Like if you think you can't use it.
Try and use it for the task thatyou wanna do, and either you'll
learn that it actually can dothat, or you'll learn that,
okay,
so, here's a blind spot
in these models. You'll learn whenit hallucinates, you'll learn when
to trust it and when to not trust it.
And then, yeah, you'll kind of get asort of an intuitive sense of how the

(01:06:26):
models work and when they don't work.
I think it's probably not like thebest use of time to go read, like
the attention is all you need paperor the, you know, the RLHF papers.
I think it's much better to have likean intuitive-like folk understanding of
how the models work and the best way todo that is just to practice every day
with them and see when it breaks.
Amazing.

(01:06:47):
Alright, I think that'sa great note to end on.
And Andy just gottasay, this was fantastic.
I know a lot about you already,but I learned a lot more on
this episode and thanks somuch for coming on AI Grand Rounds.
Yeah,
highlight of my career.
Thanks for having me on, Raj.
This copyright podcast from theMassachusetts Medical Society may
not be reproduced, distributed,or used for commercial purposes

(01:07:09):
without prior permission of theMassachusetts Medical Society.
For information on reusing NEJM Grouppodcasts, please visit the licensing and
permissions page at the NEJM website.
Advertise With Us

Popular Podcasts

Stuff You Should Know
Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Special Summer Offer: Exclusively on Apple Podcasts, try our Dateline Premium subscription completely free for one month! With Dateline Premium, you get every episode ad-free plus exclusive bonus content.

On Purpose with Jay Shetty

On Purpose with Jay Shetty

I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.