All Episodes

September 6, 2023 35 mins

Facing the tough decisions of a serious health threat brings the need for information and analysis into a sharp and personal focus. Computer scientist Regina Barzilay was an expert in natural language processing when she joined MIT; her cancer diagnosis led her to collaborations in healthcare, where she has advanced imaging, prediction, drug discovery, and clinical AI. She joins Munther Dahleh and Liberty Vittert to talk about issues from data collection and privacy to bias and “distributional shift” – when an algorithm is used on a dataset with key differences from the data used to train it.

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
[AUDIO LOGO]

(00:03):

How do clinicaltrials really work?
And more importantly, dothey work for everyone?
Are you being givenmedical advice really meant
for someone else?
We touch on this andmore today on Data Nation
from MIT'S Institute ofData Systems and Societies.

(00:24):
I'm Liberty Vittert.
Today, my co-host, MuntherDahleh, the founding director
of MIT'S Institute forData Systems and Society,
and I are speakingwith Regina Barzilay.
So I wanted to kickus off by talking
a little bit about yourhistorical trajectory.
When you came to MIT, youwere an expert in NLP.

(00:45):
And now, you're doing all thiswork in cancer, and in general,
and in health.
And maybe you can take us alittle bit into your trajectory
and what got you there.
And I know that you had your ownpersonal struggles with that,
so we'd love to hearmore about that.
So thank you.
You did know me when Istarted at MIT 20 years ago.

(01:05):
So when I came to MIT,I was working primarily
on natural language processing.
And at the time, naturallanguage processing
was not what it is today.
That now everybody knowwhat is ChatGPT, and used
Google translation,and speech recognition,
and all the chat bots.
At the time when Ientered MIT, it was still

(01:27):
a really young budding field.
Translation was notsomething that people used.
The tools were only known toexperts, and most of the time
when I say I'm doingnatural language processing,
I had to explain what doesit mean because people really
couldn't visualize anytools in this area.
And I was reallyfortunate to be part
of growth of thisfield, which moved

(01:49):
from this very kind ofexperimental laboratory science
to something that became acommodity that is used today
in so many products,and academic research,
in our personal life,in so many industries.
And it was really exciting time.
But at around 2014,so I personally
became sick with breast cancer.

(02:10):
And I remember clearly thatone of the big surprises
for the first timebeing really sick
is discovering that whenyou go to a hospital,
you actually don't see anymachine learning in any form,
even data science, oreven barely statistics.
None of what you see,even though I was still
at MGH, which is just onesubway stop away from MIT,

(02:35):
none of this great technology isnot part of patient experience.
And when you are treatedor your loved one
is treated for somereally serious diseases,
information is reallykey because you
need to make various decisions.
And there is no clearanswer, and you really
want to know what happenedto patients like me.
What is the likely sideeffect that I'm going to get--

(02:55):
all these type of questions.
You cannot get answers.
You still cannot get answersto many of these questions.
But at the time, it wasreally eye opening for me.
And when I came back toMIT, my first inclination
was that I absolutelyhave to change it.
And that's how I did mytranslation into life sciences,

(03:18):
and I've been working in thisfield since 2015, for the eight
years.
And today, I workprimarily in drug discovery
and molecularmodeling, but I still
do some clinical AI, likeimaging and other things.
I think that's suchan interesting way
to dig into so many importanttopics that you're working on.

(03:40):
And one that I wassort of fascinated by
was something that I feellike people have talked about,
but a lot of people don't reallyunderstand in the same way
as what is NLP, or what arethese large language models.
And it's that people talk abouthow there's a lack of diversity
in data sets that are training.
These AI or ML models orwhatever you want to call it.

(04:02):
And could you givesome real life
examples, even fromyour own experience,
of what that even means?
What are the consequences ofhaving a lack of diversity
in these data sets?
So the lack of diversity--
I would say in majorityof areas of clinical AI,
before we even starttalking about diversity,
we really lack datasets to start with.

(04:23):
So even if you're lookingat the most basic areas--
let's say you want tolook at mammograms.
You can say what'sso special about it?
The vast majority of women age40 go and do their mammograms
across the country,in rural hospitals,
in high end hospitals.
Everybody do mammograms.
There is no publiclyavailable data set

(04:46):
of mammograms that can be used.
So if you are a machine learningresearcher, data science
researcher, you want to tryyour algorithm on this data set,
it doesn't exist.
So the only way foryou to get to this data
is to connect to acollaborator in the hospital,
and then you need to go throughextremely challenging process
of actually getting data.
And for vast majorityof people who

(05:08):
work in computer vision, datascience, machine learning,
there is no clear pathwayto get to this data.
And we're not only talkingabout mammograms, about variety
of other diseases.
Now, in some areaswe do have data set.
And here one example relates toa disease, which risk assessment
diagnostic is crucialis lung cancer, given

(05:30):
that it's the highmortality of this disease.
So in that particular case,there is a really big rich data
set of CT scansor CT scans, where
it was part of a clinicaltrial, where you have an image
and you know what happened tothis patient within six years,
so you can train the model.
And the interesting part isthat the way the patients

(05:53):
were included, theinclusion criteria
were such that 95% of thecohort are White people.
So which means that otherAfrican-American, which
have high mortality fromlung cancer and Hispanics,
Asians are not representedin the data set.
And then you're trainingyour model, which

(06:16):
looks at particularcharacteristics of the data,
and there is reallyno way for us
to say would they generalizein the other population.
But we can go evenone step further.
Not only necessarilythink about the race,
but even if you're looking atpeople who are non smokers--
because one of the thingsthat we observe, and it's

(06:37):
true around the worldand in the United States,
there is significant increase,actually frightening increase
in the population whowere never smokers, never,
who get lung cancer.
However, the cohort, theway it was constructed only
has people who arevery heavy smokers.
So the models thatyou train-- and there
were significant investmentof National Cancer Institute

(06:58):
in creating this data set.
At the end, it doesn'treally deliver the goods
because you optimize forone type of population
while you really want to applyit to the general population.
So most of thesedata sets that we
have access todaycollectively were not
designed with machine learningand data science in mind.
They were createdfor other purposes.
And all these different kindof bias, collection mechanisms

(07:23):
have direct impact on themodels that we are developing.
This is actually greatand maybe a segue a little
bit to what you mentionedearlier about drug discovery.
So my understandingin that field
is that you're workingat multiple scales.
On one scale, you're looking atchemical and molecular biology
and interaction atthe detail level.

(07:45):
But then you also haveexperimental data,
and then at some level, youhave observational data.
And this data is to beintegrated in an interesting way
to propose new drugs thatthen we continue the cycle.
So one, just it's complicatedto do at this scale,
integrating these multi scale.
And the other one is, again, thequestion of bias and diversity

(08:09):
and really looking atthe different parts
of the population and so forth.
So since you arein this field, now,
what are yourthoughts about this?
So data actually penetratesall the levels of discovery
because whenever you are tryingto predict basic thing-- when
you're thinking Itake small molecule.
The vast majority of drugsthat people are taking,

(08:30):
it's a small molecule.
That's something that, forinstance, you take it in.
And many of them, the waythey operate in our body,
there is some kind ofmisregulated proteins,
for instance, then moleculebinds and puts a patch, and then
we get the behaviorthat we want.
Of course, thereare many mechanisms,
but this is one of them.

(08:51):
And you need to understand,if I put the small molecule,
where does it go?
Ideally, it wouldgo to the place
that it should go to helpwith misregulated protein,
but sometimes that's why wehad side effects because it
goes to many other places thatwe didn't expect it to go.
And it also dependon our own chemistry.
So the basic step for manypoints in drug discovery
is really understanding thisis a small building block.

(09:12):
If I give you the proteins andmolecule, where would it go,
and where and how it connects.
So you learn itbased on the data.
People couldn'tsolve this problem
for decades using moretraditional physics based
approaches.
While deep learning came in,where you gave the protein,
the small molecule howgeometrically they're connected,
and then you can train themodel to put them together.

(09:34):
So data is everywhere.
But like with aclinical data, there
are certain questions forwhich we don't have data today
because it's easy in thepharmaceutical companies, who
don't want to share orfor many other reasons.
Like, for instance,if you don't only
want to know where itconnects but you want
to know the affinity--how strongly it connects--
it's actually really non-trivialto get data for that.

(09:56):
It doesn't exist.
So again, we're not solvingthe problem where we really
need machine learning.
Many times we'resolving the problem
because the data isavailable or not available.
But talking aboutbias there, there
are many interesting ways howthe bias come there into place.
And let me give you an examplerelated to data availability.

(10:17):
So as I said, thelowest level, you just
want to understand howmolecules interact, but then you
have another--
how do you know whatis a good target?
Like, you now need to decidewhich protein you want
to connect in the first place.
And one important data resourcethat is used around the world
is UK Biobank.
UK Biobank is a bigcollection of medical data

(10:40):
about the patients,which is de-identified.
You can apply.
You can get it.
It is used inpharmaceutical industry.
It is used in academia.
And one interestingquestion here
is that we are now learningand designing drugs
based on one population,on UK population.
And I'm sure some of the drugswould work equally well across,

(11:02):
but there will be somedrugs which will only
work for certainpopulation, but we
don't have access to this data.
And this is really a funnysituation when on one hand,
people reallyemphasizing the privacy.
And you say, we wantto keep our data.
We don't want to share ourdata, but on the other hand,
it's fine.
You don't share, but thenwhatever is developed

(11:22):
may not be developedfor people like you.
And we know that thereare a lot of diseases,
which really are very muchingrained in your genetic pool.
And if it is underrepresentedin the set, most likely
it's not going to bevery effective on you.
That brings up so manyinteresting questions about bias

(11:44):
to me because as youmentioned, when I ever
think about bias in AI, theexamples that come to my mind
really have to do withethnicity or gender.
And I never thoughtabout smokers
versus non / or I imagine age--
young kids.
The treatment for aseven-year-old is going to be
potentially very different forthe treatment for a 40-year-old.
And so how is it possible--

(12:06):
I mean, I'm sure there'slots of different angles
that you could comeat to fix this.
I mean, is one option to haveregulations imposed on declaring
sources of data used to trainthese algorithms, or what angle
would you come at to fix this?
So I think that thedanger is unknown unknowns

(12:27):
because if you let's sayyou mentioned gender,
age maybe like a history,clinical history.
If you look at-- an evenmost traditional papers that
is published in medicine, ifyou use normal statistical model
with predictive power,they will break it up
and say, this is what happenedto women, this is to men.
This is for differentethnicities.
This part you candirectly apply it and see,

(12:50):
and now, there is a lot of,obviously, emphasis, making sure
that all the different parts ofthe population are represented.
But the problem is that mostof the time when we are getting
these data sets-- andnow, it's even funnier
because people whocreate data sets
are not the one whoanalyze data sets.
We really don't knowwhat is the bias.
And this is one of the thingsthat I find particularly

(13:13):
troubling with the recent FDAregulations that just recently
came out and we as a public aresupposed to comment on them.
Relates to the fact that the waythey think that we can address
the problem of biasis if I give you
all the statisticson my training,
and I'm going to give you allthe statistics on the test,
and then you miraculously wouldknow whether you can apply it
to your population.

(13:34):
But the population is not justdescribed by eight variables.
And it can be something thatwe haven't even measured,
and we don't knowthat it's a bias,
or there is someparticular skew.
And that's why I think thatthinking that we can explicitly
put the tables to a humanand we with our mind
can identify the biasesand abnormality, I

(13:56):
think it's reallymisguided idea because it
is a statistical question.
If I now give you apopulation, can you tell me?
Is it distributionally similarto that other population?
Can I identify in mydata that was collected
by many different centers?
What is the portionof the data that's
somehow statistically abnormal?

(14:18):
Can I have a toolthat will tell me--
I train my modelon this population.
Now, I'm applyingit in my hospital.
That will tell me you shouldn'tbe trusting me on this patient,
on this specific patient,because it is different.
And the examplethat I always give
in this case, whenwe are thinking
about our cars orour microwaves.
They all have thisdevice that tells you

(14:40):
the car is not working.
You need to bringit to the garage.
It's not becausewe understand it,
because we can open itand see what happens.
There is somethingthat alerts you and say
you shouldn't be using it.
We don't have it today inour clinical AI models.
So I think, of course,there is an effort,
and we should put effort incollecting and making sure
that it is as representativeof our population as possible.

(15:01):
But I think anotherbig part of it
is actually algorithms andstatistical models that
can identify thistroublesome population
and prevent them from happening.
So Regina, I thinkwhat you're alluding to
is a complete system change forus to be able to do this right.
Because at one level youmentioned that some of the data

(15:21):
is available, butit's not shared.
Another level, we actuallydon't have enough diversity
in the clinical trials.
And then we havefailures of the system
to be statistically correct fora certain subgroup of people,
which requires somemeasurement and so forth.
So these are verydifferent things.
One is almost like you have tocreate a data sharing market,

(15:44):
and the other one you have toencourage people that are of not
so prevalent features togo for clinical trials.
And the third one is actuallya better understanding
of the statisticalimplications of all this work.
I mean, who isdoing all of this?
Who's actually bringingall of this together?
So let me startwith the third one
because this issomething that we at MIT

(16:06):
actually have capacity to do.
I think that there are a lot ofvery hard algorithmic questions
that are currently not addressedbecause for the longest
time in machine learning,be it clinical machine
learning or any othermachine learning,
the only things thatwe cared about is
our accuracy on the test set.
So the first step was torealize that actually we are not
testing it on the test set.
We're testing it ona diverse population,

(16:27):
and every machine learningis a transfer learning
because there will alwaysbe distributional shift,
no matter what you do.
But then there are allthese other questions
that were very strongly studiedin traditional statistics,
like uncertaintyestimation, calibration.
All these things, they werenot second class citizens.
They were n plus 1 citizens intraditional machine learning.

(16:48):
So I think thatone of the things
that the technicalcommunity needs to do--
and now, it's happening,but maybe a bit slow.
But really bringthis capacity that we
need to have available to theregulators, to the practitioners
to provide them as part of themodels that optimize accuracy.
Like I could give you even asimple question that we had.

(17:11):
We have a hospital networkin developing countries
that utilizes the AI tools thatwe develop in general clinic.
And one thing that we've donewas to say, OK, I have my tool.
I tested it fine onmany populations,
but now, I'm goinginto your population.
I am going to validate it on10,000 examples to make sure
that it predicts correctly,and then you can use it.

(17:32):
And in many of thesehospitals, people say,
we don't have 10,000.
It costs us a lot.
And then others said,so 10,000 is enough.
Maybe I should do 20,000.
So in ideal case, you wouldhave some statistical estimate
that says that if you wantto have this error bound,
that's how much you need.
We don't need torely on our intuition

(17:53):
on some number 10, or 5, or 20.
We don't have these tools, andthere are mathematical solution
to all these questions.
So I think that we asa research community
should reallyprioritize this question
and develop these techniquesbecause one of the reasons you
can say that FDA was incorrectby assuming that it's human's
job to monitor and say,is it diverse enough,

(18:14):
is it biased enough.
How do I know?
But the truth iswe don't provide
the regulator with the toolswith a well-equipped statistical
machine learning tools,which can help them to guide
them to answer those questions.
So I think this is upto us at this point
because we need todo the development.
Now, relate to otherquestions about the data.

(18:36):
I think it's extremely complexphenomena because there are
many different stakeholders.
There are legalquestions because HIPAA
was written for insuranceswith insurance in mind.
It was not written withmachine learning in mind.
But at the same time, there aresome encouraging developments.
For instance, NIH works on allof us on this very big million

(18:58):
people data set, and Broad isthe one who actually collects it
with other centers.
So there are ongoing effortsin various parts of the world,
but it's still very fragmented.
And even if you're thinkingabout places, like all of us,
it's still very hard touse it because you need

(19:20):
to use it on their computer.
It's very expensive.
So it's something happening.
I think that we allneed to put our effort
to really make it happen.
And regarding the second point,Munther, that you pointed,
which is very correct.
Really ensures that all of usare contributing and donating.
I think that tome we really need

(19:41):
to educate the publicabout really this dilemma.
That if you want tode-risk, you actually
need to help by donatingyour data to be part of it.
Because if you arenot part of it,
the tools are not goingto be optimized for you.
And I think that thispoint is lost today.

(20:02):
It's such aninteresting question
because I think when peoplethink of de-risking the AI,
they think of protecting theirdata more and not giving it up.
And I think that bringsme-- this is probably
a question I shouldhave asked a while ago,
but I can't helpbut think about it
and want to really makeit clear for our audience.
When I hear thewords bias, or when
I hear, oh, a regulatory bodyneeds to come in and tell you

(20:24):
whether things are toobiased or not right,
it makes me imaginethat someone did wrong.
That there was something almostintentional to this bias.
But I'm not surethat's the case.
And so could youtalk a little bit
about whether there was sortof intentional bias ever put
into these data sets orwhether the bias has always
been accidental, andtherefore, the fix to it

(20:46):
is really educationrather than punishment.
So in majority ofcases, the bias comes,
I think, without bad intention.
And I would give you an exampleof bias that happened in my lab,
and it's a very bizarre bias.
And remember, when we juststarted in 2015, work on images.

(21:07):
And what we tried to do is topredict whether the patient has
future risk of breast cancer.
But the first test that we didwas actually look at the image
and predict whetherthere is cancer.
And to my surprise, whenwe built the first model,
the student came back andsaid, we got 99.9 on the test.
And you know and Munther knowsthat when the student tells you

(21:27):
something, something is wrong.
It cannot be true.
So my firstdecision-- or maybe--
maybe test and trainare not separated.
Maybe they are the same.
It wasn't the case.
We literally spendtwo weeks trying to--
because we knewsomething is wrong,
but we couldn'tunderstand what was wrong.
We received the datafrom another hospital,
and we couldn'tunderstand what was wrong.

(21:48):
So after literally two weeksand a lot of exploration,
we found out thereason behind it.
And the reason behind it--that for whatever reason,
the data providers putall the positive cases--
let's say from 2010 to 2012.
And all the negativecases-- from 2014 to 2015.
And in the middle, theychanged their machine.

(22:10):
And on the image, thereis an imprint which
device produced the image.
So in that case, you hadfull confounding variable,
which is the sourceof the image, which
perfectly correlated withnegative and positive things.
So if you just givethe image, which
is a very simple anddeterministic task,
to say where the machine comesfro, you can on test do 99.9%.

(22:33):
And if it wouldn't be 99.9--
if it would be 88, I would say,wow, we really did a great job.
It's just 99.9 neverhappens in reality.
So the point is that it took alot of investigation-- target
investigation, but youcan miss those things.
So a lot of this biascome from the fact

(22:54):
that there was somebodywho made a decision.
They never sharedit or documented it.
Another example-- again, a bias.
I remember I was looking at someprediction from tabular data
for predicting who isgoing to get breast cancer
or something like this.
And in that particulartable, one thing
that strikes me was that the--
again, some humongous portionof the women in that list

(23:18):
had no children.
It's like, wow, the numberof children correlates--
increases your chancebut not that strongly.
So I asked them about it.
Again, just reallyrandomly scrolling.
And they ask, and theytold me that the women
have this questionnaire.
Some of them putthe number, whatever
is the number of children.
Some of them just don'tfor whatever reason.
And then the softwareautomatically put
everywhere zero, and nobodypaid attention to it.

(23:42):
This was some default,and nobody pay attention.
And then you can come andbuild yourself a model
and make the predictionthat women with cancer
have no children.
So lots of it is justreally low level mistakes
that we don't even knowhow they got there.
There are trulyno bad intention.
But I would give you anotherexample, where people
are very honest about the bias,but it doesn't stop physicians

(24:04):
from using it.
So there is a model, whichis called [? Tarakucik. ?]
This is a model that looksat your categorical data.
If you had children and whenwas your period and whatever
and decides what is yourrisk of breast cancer.
And the reason it isused because according
to US standard of care,if you are above 20% risk,

(24:26):
according to this model, youcan be screened with MRI.
You can give certaindrugs to decrease
your chance of breast cancer.
There is a whole bunchof things that you
can get if by this model youare predicted to be in risk.
I mean, this model is not great.
The accuracy is around65 AUC when 50 is random.
But the funny partthat it performs

(24:48):
like really abysmally forAfrican-American, for Hispanic,
for Asian.
And the reason isthat this model
was created-- it's anormal statistical model.
It was created in Londondecades ago on White women.
And it was describedas such in the paper.
There was nobody whomade an assertion,
but it is usedfor everybody else
because there wasno other model.

(25:09):
So this is an example.
When person just described whatthey did-- they had the data.
They did it for Britishpopulation, but then it came in,
and it started to be usedin all the other places.
So there are lotsof sources of-- it
comes in many different ways,none of it with bad intention.
And that's why Ithink we really should

(25:29):
stop using humanintuition try to identify.
Which is great, except thatalso regulation and so forth,
as you mentioned earlier.
There's a lot of humanintervention, which is--
as you said, can make bigmistakes and so forth.
So there's onephilosophical dilemma
that I want to presentto you because we've

(25:51):
been talking about this, atleast, with this initiative
that we have on systemic racism.
Because at some level,we want all algorithms
to be freed from biases,and so we remove features
that describe you being Black,or being from a poor community,
and so forth.
Yet at the same time in healthcare, some of these features

(26:12):
are actually critical andimportant for diagnostics.
And so we're trying to balance.
When is the featureimportant to be included,
and when is the feature notimportant to be included?
And that applies-- thatextends to socioeconomics
and demographics.
I mean, we know certainareas in the United States
have a high probabilityof getting cancer.

(26:32):
It's environmental.
It's habits and so forth.
And yet at the same time, whenyou start asking questions
about using census data andlocational data for diagnostic,
you get a lot of pushbackbecause you are biasing
against groups of people.
And so how do wemanage this tension
between what wethink we're doing

(26:53):
the right thing butignoring important data
for good diagnostic,and hence therapeutics?
I think that first of all,when we are thinking we
are removing thedata, many times we
are not removing the data,and we've demonstrated it.
[INAUDIBLE] group demonstrated.
There were many people thatdemonstrated that, for instance,
looking at theimage for mammogram,

(27:13):
you can predict therace of the woman.
So even if you remove therace, it's still there.
I mean, it's still there.
Not in a way that the doctor candetect, but it's still there.
But I actually think it's notabout removing in the training
because the only thingthat we care about when
we're predictinghealth outcomes is
to be as accurate as possible.
I think the questionis, what happens next

(27:36):
when you have the prediction?
Because we know that thereare a lot of diseases
that really are specificto a subpopulation.
So I'm an AshkenaziJew, and there
is a whole slew of diseasesthat happened in Ashkenazi Jews.
Would it really help that youremove this information, which
would most likelyresult in the decreased
accuracy for predicting for me?
I don't think so.

(27:56):
But what we need to make surethat once we discovered and we
know potential outcome,then we're actually fair,
and the system is systemicallykind of mid-service.
And I will give you anexample for breast cancer.
For instance, fora known reason,
many African-American womenget onset of breast cancer very

(28:19):
early on--
before 40.
And today, the USregulations that
were just publishedlike few weeks ago,
the screening starts at age 40.
I know endless amount ofAfrican-American women
who were diagnosed before.
And if you are doingit for one population,
and you're not thinking aboutother population, especially

(28:40):
population which are youngwomen and they don't even
think about breast cancer,you are putting yourself
in the situation thattheir health outcome
are going to be much worse.
And indeed, this is the casewith African-American women
with breast cancer, whoseoutcomes are much worse
than for White patientsbecause many of them,
due to the lack of screeningand awareness in young age,
they miss a criticaltime when they

(29:02):
could have been treated withmaybe less invasive treatments.
So I think it's moreabout what do you do,
and how do you createregulations that are
equal across different groups?
And it's true for other cancers.
Again, non-smokers and others,like in Asian population.
The fact that in the US you needto be smoker to be screened,

(29:26):
the ethical group which havemore prevalence of lung cancer
are discriminated against.
So it seems like thereis really concern
about the efficacy of existingtreatments because of this.
Just like if you're anAfrican-American woman,
maybe you should begoing for a mammogram
starting at 30 instead of 40.

(29:46):
And so is this reallya big issue right now?
Are there a lot ofexamples of this,
where you feel like theefficacy of certain treatments
are really in questionbecause of groups?
I mean, should thepublic be worried
about this fordifferent diseases
and different treatments,and what can they do?
What can someone do toknow how they should
be treated based uponwhatever their ethnicity is,

(30:09):
or their environment,or whatever this is?
Is there any resource for them?
So I think, of course, it'san extremely complex problem.
There is a lot ofdocumentation that there
are certain groupsthat systematically
have low quality health care.
And for instance, weknow that prostate cancer
unproportionately interms of the severe cases

(30:30):
affect African-American men.
And I was just recentlylistening to a talk,
and they were describingthat since most of them
are not treated inplaces like MGH,
even the quality ofbiopsies are not good.
That they come.
Sometimes they're treatedin the centers, where
there is explicit differencein the health care and access
to insurance and so on.
So this-- but of course,it's super important,

(30:51):
and I'm sure there are manyother people who can speak
about it more effectively.
But the question is what wecan do as a data scientist
to stop it or atleast to interfere.
And I think that the problemis like-- as you suggested,
let's say we startscreening women at age 30.
And the question is,what are you going to do?
African-American?
Are they going to bescreened every year?

(31:12):
Are you going to do AshkenaziJews because many of them
have BRCA?
And then it starts.
What is the population?
And I think everybodywould be fine.
If you do a first mammogram atage 30, you look at the image,
and you say, this woman isunlikely to develop cancer
in the next 10 years.
We don't need to seeher for 10 years.
This woman really should becoming every year or every two
years.

(31:32):
Then you can designpatient-specific intervention
based on theirpersonalized things,
not even like a bigger group.
Because if you lookat African-American,
there are many differenttypes of African-Americans,
and they have differenttypes of predispositions.
So the point is we need tohave a predictive tool that

(31:53):
can look at the patientand say, this patient, this
is their likely trajectory.
That's what they need.
So you can servicea broader population
without burdening thesystem economically
and at the same timenot overexposing them
to radiation and otherside effects of treatment.
So I think that AI can actuallyplay a really significant role

(32:13):
in this change.
So actually, tying thisto what you said earlier,
and that is theimpact of insurance.
So I have a history ofcolon cancer in my family,
so I screen.
And because we have ahistory of colon cancer,
we cannot screen ina non-invasive way.

(32:35):
You have to have an invasiveway, which always results
in many different tests and soforth, which is a pain, right?
But the insurancecompany will not
accept a predictionof my case right now.
So in other words, even thoughevery test has been clean
and many doctors feel that, OK,there's really 10 years here

(32:58):
before you come back, theprotocol is every two years,
and they will notdeviate from the protocol
because they'reafraid of a lawsuit.
And so this is what'shappening with insurance
and this personalized medicine.
Is that what we needto do is potentially
do more of clusteringand so forth.
I don't know if you havethoughts about this.
Well, absolutely.
Absolutely.
Because if you think aboutit, like, the major societal

(33:20):
dilemmas, the screeningfrequency, it's a big one,
correct?
Because in your case,it's invasive procedure,
which may have side effects,which is really interrupting
[INAUDIBLE].
If you're looking atmammograms, which are done,
it's like a big decision.
There are lots of people, who Icall them mammogram deniers, who
thinks that it doesn't help.

(33:42):
That there is no point toscreen women every year.
That maybe it increases cancerbecause they're irradiated.
There is all these dilemmas.
And the reason is we aretrying to create a policy.
This is one cluster.
This is a rough cluster,and everybody has to follow.
And if you think about--
I always give this example, butI think it's really telling.
If you look at Amazon, it'snot like Amazon divided us--

(34:04):
you are women with PhD after 50.
You are a teenager and so on.
We have a veryflexible mechanism
that look at everythingwe click and buy and do,
and we get a recommendation.
Why we cannot have this type offluent assessment for our health
care system?

(34:24):
Because we don't.
Because we're divided bythis very rough clusters,
and everybody aregetting the same.
I think that'sone of the reasons
that we are so non-effective.
And how we can change it--
I think it's somethingthat we have to change.
[MUSIC PLAYING]
Thank you for listening to thismonth's episode of Data Nation.

(34:45):
You can get more informationand listen to previous episodes
at our website,idss.mit.edu, or follow us
on Twitter andInstagram @mitidss.
If you liked thispodcast, please
don't forget to leave us areview on Spotify, Apple,
or wherever youget your podcasts.
Thank you for listeningto Data Nation.
From the MIT Institute ofData Systems and Society.

(35:08):
Advertise With Us

Popular Podcasts

On Purpose with Jay Shetty

On Purpose with Jay Shetty

I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!

The Breakfast Club

The Breakfast Club

The World's Most Dangerous Morning Show, The Breakfast Club, With DJ Envy And Charlamagne Tha God!

The Joe Rogan Experience

The Joe Rogan Experience

The official podcast of comedian Joe Rogan.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.