Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
You can speak to GPT-4 like you would speak to another human.
(00:04):
We know how to communicate.
I don't speak protein.
I don't know anyone else that does.
Do I spell it out in amino acid?
How did the best machine learning practitioners
get involved in the field?
What challenges have they faced?
What has helped them flourish?
Let's ask them.
(00:26):
Welcome to Learning from Machine Learning.
I'm your host, Seth Levine.
Hello, and welcome to Learning from Machine Learning.
On this episode, we have a very special guest, Dr. Michelle
Gill.
She's the tech lead and applied research manager at NVIDIA.
She works on projects like BioNemo and BioFoundation
models, frameworks, and inference service
(00:48):
for AI-assisted drug discovery.
Recently, she gave the keynote at the most recent PiData NYC.
It was one of my favorite talks, and I'm
thrilled to have you on the podcast.
Thank you for the invitation, Seth.
It is awesome to be here.
I'm really excited.
Me too.
So you are a little unique for the podcast
(01:09):
with my background in NLP.
So yeah, why don't you kick us off?
What's your background, and what initially attracted you
to machine learning?
Yep.
So proud that I don't fit many molds,
where it is a badge of honor.
I did my PhD in structural biology and biophysics
at Yale.
And then for postdoc, I studied enzyme dynamics
(01:31):
with a technique called nuclear magnetic resonance
spectroscopy, NMR for short.
I was particularly unique in that I didn't really
do computational work.
For my scientific career, I was what
we would call a wet lab scientist.
I actually did experiments, expressed proteins,
and studied them.
When I became a scientist at the NIH, it was around 2014.
(01:56):
That's when AlexNet won in the ImageNet competition.
And it became very clear to me that there was probably
going to be something to this machine learning thing
that many people were very excited about at the time,
and that it was going to, while it had already
had some impact in various fields,
it was going to have tremendous impact across many fields.
(02:19):
And certainly, I wanted to stay very close to science,
but I wanted to learn more about this
and basically hopefully do this for my career.
Like I said at that time, machine learning and science
wasn't exactly a thing yet, certainly not to the level
that it is now.
But I managed to stay pretty close,
(02:40):
and now it's actually pretty easy to work
at those intersections.
Very nice.
Yeah, I think AlexNet was a moment for a lot of people,
both in and out of the field, being
able to see something that could come in and just
like have error rates on such a hard problem.
And it started to become more tangible
(03:00):
that this sort of stuff was going to be usable
and was going to be able to be applied to other fields.
OK, yeah, so back in 2014, during that time,
you were working in the lab.
But now fast forwarding, now you are
doing some computational stuff, right?
So being able to see that difference between the hands
on work and how fast things are moving now,
(03:22):
can you speak to that?
So I think it impacts the way, so certainly,
having worked in the lab, it impacts the way
I think about data and perceptions of,
it helps you deeply understand the data where
there might be errors.
But certainly, the field now of machine learning and science
is moving so fast.
(03:43):
When I gave my Pi data talk, I had a slide that was,
it was a joke, but not really, which
is like, by the time you finish reading this,
there will be a new protein design paper
out completely revolutionizing the field.
And it's kind of true.
Actually, one of my side projects
is using some retrieval augmented generation models
to identify and recommend new literature
(04:06):
as it comes out for me, because it's a nontrivial thing
to follow the literature these days.
So yeah, it's changing incredibly fast.
I think there's so many potential areas where it's going to,
it's starting to and will change the field.
I think we don't always know exactly the right ways it's
(04:27):
going to fit in, but there are many, many people
trying to figure that out.
Yeah.
Machine learning, AI, NLP, the field's moving so fast.
The joke is that by the time it gets peer reviewed,
it will be outdated.
So that's why there's nice things like archive,
where people can just kind of get it directly to their readers.
(04:51):
There are some pros and cons to it, though.
Of course.
There certainly are.
You always have to evaluate what's there
and understand where you have to learn
to read between the lines what the strengths
and actual weaknesses of a particular piece of research
are.
You don't always know.
But I also think that is a model.
(05:11):
The open access is a model that has allowed machine learning
to gain a lot of traction.
Certainly, the field of science is
undergoing some transitions.
And not all of scientific literature
has been open access, but that's a lot changing.
I think that's been very helpful to the growth and acceptance
of machine learning.
Yeah, absolutely.
(05:33):
So in this field, the limited amount of stuff
that I know about it, I'm very interested in it.
I wish I knew more, but I can't go by without talking
a little bit about Alpha Fold.
What were your initial reactions to it?
Could you tell us some more about it?
Sure.
So Alpha Fold is a model.
(05:54):
It's a family of models called Equivariant Models.
And the purpose is to predict the three-dimensional structure
of a protein.
So biology, we can talk maybe a bit about this in a moment.
But we often can be represented with text,
which is not a coincidence because that's
how we communicate with each other.
(06:15):
So it's not surprising that we have developed ways
to represent various biological moieties
in different fashions with text.
And certainly, it's a valid representation.
Those models have power and use and advantages.
But to describe a three-dimensional protein
structure, as you might guess, often need three dimensions.
(06:39):
There's some other ways to do it.
So Alpha Fold takes in a sequence.
It's called amino acids.
That's the building block of a protein.
It takes that amino acid sequence.
There's one-letter abbreviations for each
of the 20 naturally occurring amino acids.
And then it predicts the three-dimensional coordinates
(06:59):
of that protein.
And so in 2018, Alpha Fold won CASP,
which is the Critical Assessment of Protein Structure.
It's a protein structure prediction competition.
And there's a great paper written by a colleague of mine,
Jeff Hoke.
And he went in and examined all the metrics
(07:20):
and all the different competition types of metrics
that were measured from CASP.
And Alpha Fold didn't just win.
It dominated.
It really, really changed the way
we think about machine learning.
And that was the Alex Net moment for science, in my opinion.
So I started doing machine learning in 2014,
(07:40):
but not too far down the road.
We had our own moment of intense realization
that this was going to change everything.
And there are Alpha Fold.
There are certainly challenges with Alpha Fold.
I'm not saying it's a panacea.
Drug discovery and science just in general are challenging.
(08:00):
And it has limitations.
But of course, we iterate on those.
That's how the field progresses.
Right.
So some of the work that you're doing at NVIDIA.
So can you tell us about BioNemo?
Sure.
So BioNemo, as you briefly alluded to,
it's an inference service and a framework.
The inference service is models from the community,
(08:23):
including Alpha Fold, Open Fold, all these derivatives
of the folders.
There's models for docking small molecules in proteins
and finding the binding pocket of it.
There are models for protein representation learning.
There's a model called ESM, which
(08:44):
was developed by a group at Meta at FAIR.
They're now their own group.
And there are other models as well for small molecules.
And so we basically accelerated those checkpoints
as much as possible.
We did our NVIDIA thing with them, put them behind an API.
And so they are accessible to users.
(09:05):
There's a Python client.
Individuals, companies developing software
could put those API calls into their software
if they don't want to host their models.
There's load balancing.
There's all these nice features at scalable,
so you don't have to deal with that.
There's also a framework that's more for ML researchers
to train and develop their own models.
(09:27):
Very cool.
So the users are both, say, research institutes
and companies as well?
Yeah, I would say the users range from data scientists
to even bench scientists who want to use an API
to researchers who want to build their own stuff.
Right.
So yeah, being at NVIDIA, I wouldn't
(09:50):
say the first thing that I would think of is AI drug discovery.
Well, I guess the AI part.
But I wouldn't put NVIDIA and drug discovery together.
But yeah, so can you tell us why is NVIDIA
interested in doing this type of work?
Yeah, great question.
And I get that a lot.
NVIDIA's objective is not to become a drug discovery
company.
We want to enable those doing drug discovery or developing
(10:13):
software for drug discovery with the best of what
we can offer on GPUs.
NVIDIA is in kind of a unique position
because we do make the hardware and a lot of the software
along the stack building up to BioNemo we touch.
So we can surface optimizations and the best
(10:33):
of what the hardware has to offer in BioNemo.
And in turn, what users need and ask for and pain points
can get filtered back to us.
And that can influence hardware, software,
you never know down the road.
So that's really our objective.
Right, that makes a lot of sense.
So there's probably like some, you
(10:54):
talked about it in your Pi data talk,
like there's this cyclic nature of supply and demand, right,
between the hardware and developing
these sorts of solutions.
So you have this architecture that you
can run things in parallel.
And you can do all of these amazing things.
You can test out all of this stuff.
And you have these sorts of machine learning models
(11:14):
that can create tons of different whatever
tasks you're trying to do.
So yeah, it's a very interesting relationship.
So just talking about, well, let's back up for a second,
AI drug discovery.
Why is that so important?
Why is it something that people should care about?
(11:38):
Why should we put it in?
Why should people be thinking about it?
Yeah, the drug discovery process is long and expensive.
For a particular class of drug, small molecules,
it's something like close to $3 billion in 10-ish years.
And there's tons of failures along the way.
(11:58):
It's a very manual process.
And there's certainly a lot of repetitiveness.
So that's a place where you can start to think that machine
learning might come in.
For example, small molecule development,
once we have a candidate, a lead that binds to a protein,
then it's optimized.
(12:20):
And that's a very, you know, there's
the molecule synthesized by a chemist.
It's tested in an assay.
It's evaluated.
You know, this design-make-test cycle goes over and over.
If we can start to help superpower those chemists
by enabling some of that to be done in silico,
then our objective is to make this process faster,
(12:42):
make their lives better.
We can give scientists the power that they
need to work more efficiently.
It must be pretty tricky.
I mean, evaluation is always a tough thing
for any machine learning, really, I guess, any problem.
But when do you know, OK, like, you know, in silicon,
(13:02):
like, this molecule, this protein
has reached the point where it's time.
Like, let's actually synthesize this thing
and let's start testing it in a real lab, not a, you know,
in a physical lab.
Yeah, I can't answer that because it's certainly
different for every application, every chemist, every model.
(13:28):
So, yeah, as for my objective, I think
to use these effectively, you would
have to get a sense of how well does the model perform
in this assay.
And certainly, that's something.
Developing generalized models, like foundation models,
for example, that can then be specialized by users,
that's why it's important for NVIDIA.
(13:50):
We can train these powerful models,
but users will specialize them to their particular drug
program, their particular target.
You know, we understand that certainly there's
a lot of specifics that have to go into this,
even after the model's created.
Right, that makes a lot of sense.
So you're more focusing on the tools
and enabling people to do this type of work.
(14:12):
Exactly.
So talking about, and you alluded to it
before, some of the overlap or how you can use MLP to help you
with biology.
Where do you see that there's overlap
and where are there some differences?
Yeah, so there are, yeah, like I said,
natural language is an excellent way to represent biology.
(14:35):
The different moieties in biology, DNA, RNA, proteins,
small molecules, they all have their own language
that can be used.
And that enables us to beg, borrow, and steal
from all of these NLP breakthroughs that have happened.
And that has really helped jumpstart that field.
(14:56):
It's also, there's considerations
from a biological standpoint, which
is that there's a lot of NLP data, of text data available.
For example, with proteins, we have many more protein
sequences than we do protein structures,
than three-dimensional structures,
like we discussed earlier.
So in some contexts, you have a lot more data.
(15:18):
So that's very powerful.
I think differences in NLP are certainly vocabulary.
Many times, biological vocabularies
are like tokenized three letters at a time,
or every amino acid, in which case there's 20 amino acids.
(15:39):
So the vocabulary size is certainly less than 50.
And the distribution of the tokens is very different.
So those are ways that biology doesn't necessarily always
follow the same trends as NLP.
We still try to do experiments empirically.
Sure, x, y, and z features improve the model for NLP,
(16:01):
but we need to do the test for biology as well.
Right.
Yeah, it's so interesting to hear
you mention tokenization, because that's
something in natural language processing
that you're always thinking about.
And it's like, what level of analysis
do you want to be thinking of your problem?
How do you want to break down your problem?
(16:22):
And then for any sort of problem,
that's how you're going to have to figure out
what should be the proper way of tokenizing these things.
And that's an open question in biology, I would say.
Yeah, there certainly have been explorations
to do more other types of tokenization.
But I don't think anything has had sufficient sticking power
(16:45):
yet.
That's interesting.
Yeah, there's always new things coming along in NLP.
For a while, there was subword tokenization.
And then when you're doing sentence tokenization,
well, how should you do the sentence tokenization now
(17:06):
with retrieval augmented generation?
It's like, how should you do the chunking properly?
Because it's always about capturing
the amount of meaning and getting the right context
that you have.
And it's interesting, because you think with language,
oh, we understand language.
We should be able to do this.
It's hard there.
So I can't imagine in biochem, it
(17:30):
must be extremely difficult to understand
what's the smallest level of meaning for these systems.
Right, and you don't know, too.
It may be different for different questions.
This is a fundamental challenge.
If you want to do things with in-context learning,
(17:50):
if you can speak to GPT-4 like you would speak to another
human, we know how to communicate.
I don't speak protein.
I don't know anyone else that does.
Do I spell it out in amino acids?
I don't know.
How do I tell it?
So that has led to a lot of ideas
that the multimodal models might be a way to go if you actually
(18:12):
want to speak to a model that will design a protein for you
the way you would speak about how you're
going to do an experiment.
So certainly, those are active areas.
And that's a difference as well, I guess, with biology versus NLP.
But yeah, back to the discussion about tokenization,
we don't always know.
Proteins have a whole hierarchy of structure.
(18:34):
There's amino acid sequence.
There's something called secondary structure.
And then there's the tertiary structure, which
is sort of the 3D coordinates.
Maybe you need to break it along secondary structure entities.
And many sort of sentence piece type tokenizers
will actually do that a little bit.
I've done some experimentation.
(18:54):
But not always.
And so yeah, what's the right way?
Or do you need to break it along three dimensional domains
for it to be useful?
Maybe.
I don't know.
But then it's not always the same amino acid.
And some of the amino acids are kind of interchangeable.
So then you need to be able to account for that too.
Right.
So I can't even imagine.
(19:15):
It's so tricky.
I mean, I'm just going to keep coming back
to natural language processing, thinking
about different domains.
And you think about a task, say, like sentiment analysis.
And then you think about it in product reviews.
Or you think about it in news articles.
Or you think about it in dialogue.
And those three, yes, they're all sentiment analysis.
(19:37):
But those three are all extremely different tasks.
And you have to do specific things for all of those things.
So I'm sure there's parallels for the work that you're doing,
trying to understand what is the specific context
and what's the specific methodology that
would work for this type of problem.
So it's probably a lot of experimentation
(19:58):
that needs to get done there.
And another thing that I would say that it's very hard,
you as we as speakers of the English language
have some gauge whether or not the prediction of that sentiment
is close, way off.
We can kind of sniff test that.
It's very hard with proteins.
Same thing, I don't speak protein.
(20:18):
So we don't know.
We have to compare it to existing assays.
We have to be in a situation where one can run assays,
or these lab in the loop ideas, which is a very smart way
of doing things.
But yeah, that's another challenging part with biology,
that we don't have a great sense of how correct a prediction is.
And there are many predictive tasks, as you alluded to,
(20:42):
that are useful, that are useful in the drug design process.
Certainly, the value of a foundation model
would be the very rich predictive embeddings.
that it produces.
Very cool.
So let's go into it.
What does it take to create, or try to create, or begin
to even think about bio foundation models?
(21:06):
Yeah, well, certainly first and foremost, I
think a data strategy.
The hard lessons learned, I think,
by many data scientists and researchers
is that data is the most important thing.
And I certainly think that's starting
to be true because model architectures, obviously,
(21:26):
they will progress and change.
And there are improvements.
But there are a lot of model architectures
that are starting to be very good.
They're easy to use.
So data can be a huge source of advantage.
So I'm part of a team right now that's
starting to think about building these bio foundation models.
And it's a matter of picking a problem where we think NVIDIA
(21:49):
can uniquely succeed.
NVIDIA is not a drug discovery company,
so we don't generate our own data.
That doesn't mean we can't get it from places.
But we have to think very carefully about that.
And we're also trying to think very carefully about,
like I said, we do have a lot of compute.
So that is a very significant advantage.
(22:12):
But what is the data we want to use?
What is the problem?
We also want to really, we're thinking
about what are grand challenge problems in biology
and trying to make sure we position ourselves
to work towards something that is a really fundamentally
difficult problem.
Yeah, that makes sense.
(22:33):
What are some of the potential grand challenge problems?
I mean, I don't know that I should say too much.
But I think these are probably fairly obvious things,
like how do you simulate protein-protein interactions?
How do you simulate parts of a cell, functionality of a cell?
The way cells work, there's not an absolute cell model.
There are temporal parts to that.
(22:54):
There are tissue-specific aspects.
All these things are very hard.
Yeah, that makes a lot of sense.
OK, cool.
So trying to create these sorts of things,
you probably need a pretty multidisciplinary team.
I know in your Pi Data talk, you spoke
about building a team in your previous project.
(23:17):
You went from two to about 30 or 40 people.
Yeah, the broader product team is about 40,
right close to 40 people right now.
Yeah.
How was that?
What was that like building a team?
I mean, there's so much that goes into it.
(23:38):
There's all these machine learning problems that we face,
but dealing with people is a whole other level.
And people are always the most challenging problem.
I think first and foremost, you've
got to get the culture right.
Team dynamics matters so, so much.
And you have to cultivate an environment where people
(24:03):
feel valued for their expertise.
They feel like they're working on things
that are important, that are of some level of enjoyment
to them, but that also align.
It's always like getting all the three buckets
to align of things, your interest, product, impact.
Getting those things to align is very hard,
but try to strike a balance.
(24:24):
So hiring, certainly, culture is really important.
We certainly, not every role, but some roles,
we do look for some domain experience
because there are a lot of details and nuances
to scientific data.
And it's even genomics data is different from protein data,
is different from chem informatics data,
(24:45):
and then usually some deep learning experience.
One thing that is very true is that the field
is changing so fast.
The things that we, as a team, have had to do
have changed and evolved pretty quickly.
So it's also figuring out how you
can hire for people that can adapt and learn new things.
(25:05):
And I mean, layer on top of that the speed at which the field is
changing, which we've discussed, but also
NVIDIA comes out with new GPUs pretty regularly.
In fact, I think it's a very regular cycle,
and maybe even increasing in frequency from what it has been.
So we want to surface the best of those GPUs.
So that means constantly updating what you're doing
(25:27):
or the way you have things implemented
in your software stack.
So benchmarking, lots of benchmarking,
and not just predictive power, but inference and training
time.
So we do a lot of that.
So those are very different, I think,
from what data scientist roles might be like at other companies.
(25:51):
Yeah.
You really need to have a strong background.
I think it's always so helpful.
Well, like in your case, you worked in the lab
before now working on these sorts of problems.
So even though I know you were saying
it's hard to know if it makes sense,
but you do have a general sense.
Is that number even remotely right?
(26:13):
Right.
Is it even in the right ballpark?
Yeah.
Which is so important for all of these problems.
Yeah, there's a lot of things.
I guess one thing I'm curious about,
so yeah, you've played different roles, right?
So you've been an AI researcher, and that was more hands on.
(26:35):
And now you've kind of transitioned
into a manager role.
Are there anything that I'm missing in there?
No.
Are those the two?
Oh, those are the two?
Yeah.
Those are the two?
I was a teacher.
I taught a data science boot camp after I left the NIH.
I was basically learning more machine learning
as I was teaching it.
OK.
(26:58):
That's the best way to learn.
Yeah, it is.
By teaching it, because then you're
forced to really know all of it.
And then someone will ask you some question,
and you'll be like, I never really thought of it like that.
Or when you're forced to explain it to someone,
you're like, oh, I see this gap in my knowledge.
Yeah.
Humility.
It's good to have a little bit of humility.
(27:18):
That's also very important on a multidisciplinary team,
is to be open to learning from each other.
Yeah.
Any other traits that you're looking for in a team?
I think, obviously, technical excellence and experience
are important.
But I culture and attitude are the really big ones,
(27:42):
because I think they can make a break.
Yeah.
A coworker of mine just shared this piece.
It was about looking for people who are hungry, humble,
and smart.
But not smart in the traditional way.
Smart in like, you know when you're
in a meeting what you're saying, how
(28:03):
it's going to influence the people in the room kind of thing.
Emotional.
Yeah.
EQ.
High EQ.
Yeah.
Some high EQ.
Very hard to find people that check all of those boxes.
It is.
But that's the fun part of everything.
Yeah.
I think I was just going to say, yeah, becoming a manager,
(28:25):
it's really fun to see these extremely talented people grow
and develop in their role and start to lead things.
And that part is really fun about a manager.
It's really fun seeing them succeed and cheering them on.
Yeah, absolutely.
I've spoken to people who have done a similar transition
(28:48):
from sort of like a IC.
No one's really an IC, right?
You're always working in a team.
A team, yeah.
But sort of that IC role to the manager role
and this idea of, oh, I'm a 5x, 10x engineer.
But when you go into managing, you
(29:09):
can actually unblock so many people
that you can have a bigger impact on your company,
actually.
So yeah, it's such an interesting thing
because people want to be very, sometimes you
want to be very focused on what you're doing.
And then you don't want to be distracted
by all of these other things.
(29:30):
But I guess it's just each person
needs to sort of find their own balance between that.
So like, NVIDIA is a very ground up company,
which it's one of the things I love about it.
But then it's helping your mentees
to understand how to prioritize things.
(29:52):
Because it gets to be kind of nuanced
or how to help with the situation
but not get completely sucked into something that probably
isn't a top priority for you.
It's like the nuances.
And I think there's some of it too,
just acknowledging that the work that we do is hard
and helping them understand that it's expected.
I expect that this is going to take a while to learn
(30:14):
how to do it.
So I don't love, I kind of made a face at the 5x, 10x engineer.
I don't love that stuff.
I don't know why.
Because what does it mean?
What does it really mean?
Right.
Yeah.
I know.
But people say it.
Yeah.
But I like them.
But yeah, just acknowledging that what we do is hard
and that that's OK.
You're going to have to work at this.
(30:35):
Yeah, for sure.
I think working with other people where I'm at,
we have so many initiatives and projects that are taking place.
And you can't get so in the details on everything.
But having a team that can be a sounding board
and having someone that you can say, OK, this is what I'm up to.
(30:57):
These are the next things I was thinking of.
And then someone can kind of say, well, I
did a project that was similar to this.
And ABC, you'll probably go down this rabbit hole.
So maybe do DEF.
You know?
Yeah.
And it's really nice to, yeah, working in a team,
it's incredible.
I think at one point in my life, I always thought,
(31:19):
oh, I can go very far myself.
But being a part of a startup is where I realize, wow,
you can do unbelievable things.
As a team.
Yeah.
Yeah.
For sure.
Yeah.
So not really transitioning, but just the next question,
just talking about machine learning,
and we can still talk about the same stuff.
(31:41):
What's an important question that you believe
remains unanswered in machine learning?
Yeah.
So I alluded to this a little bit when
I was talking about my postdoc.
So biological and chemical modalities.
So number one, they aren't taxed, even though we
represent them that way.
They're three-dimensional, but they're also dynamic.
(32:02):
And that motion is very fundamental to the roles
that they play in biology.
To protein-protein interactions,
like proteins bind together, to the binding of a compound
or a drug, a ligand.
A ligand or a drug, usually a drug
(32:22):
is just a ligand that binds in a specific way
and causes an enzyme.
Those are really fundamental to biology,
and we don't have good representations for them yet.
Not even alpha-fold is trained on the structural data,
our x-ray crystallographic structures
from a repository called the Protein Data Bank, which
(32:45):
is if you publish a structure in a scientific journal,
you have to deposit the coordinates there.
But those are static.
And so they're static, and they're also,
like to crystallize a protein, it
has to pack into this ordered lattice.
So it's whatever conformation of a protein
packed into this lattice.
Many people don't think about that, but they're not moving.
(33:10):
So there's very, and sometimes protein motions are small,
but sometimes they're really big.
They're really very, very major conformational changes.
So we don't have a good way of describing that
in machine learning models yet.
We're starting to think about it.
There's ways you can collect data.
For example, molecular dynamic simulation data
(33:32):
can be used for that to some degree,
but we just don't have good models yet for it.
Yeah, and so dynamic here, just to make sure,
you're talking like changing through time, right?
Basically, it's not stationary.
It's actually changing.
Yeah, OK, yeah.
So as much as I'm interested in machine learning
(33:54):
and artificial intelligence, I'm very
interested in natural intelligence also
and how the brain works and everything.
And when I was first, really took a deep dive into that,
I found that to be the case also,
is that people looked at things a lot in a static way
and that understanding that it's like things are,
it's dynamic, complex, adaptive, changing system,
(34:17):
and that it's not in isolation also, which
is what you're mentioning to how it interacts
with the other.
Yes, cells are changing.
They're different in different tissues.
So there's physical differences.
There are temporal differences with cells,
like in disease states, the cells
(34:39):
interact with each other.
So it's a really, really complex coupled system
that is associated with biology.
Yeah, so all of this amazing progress
that's happening in your field with AI drug discovery
and natural language processing, really just
like across the board, right?
(34:59):
There's like a new multimodal model that comes out every day.
There's a new technique that comes out every day.
How do you view the gap between the hype of this frenzied
state that the field is in and the reality of AI?
(35:20):
I think so.
Certainly, you have to accept that it's a thing.
And you have to evaluate papers very carefully.
I think you have to speak to researchers in the field,
going to conferences.
I'm actually, after this, I'm going to go listen to this.
There's a bunch of NeurIPS workshops.
(35:40):
And actually, I was laughing because when
I went to NeurIPS in 2018, there was one biology workshop.
And I think there's like five each day or something
on Friday and Saturday.
It's crazy.
We're talking to researchers, talking
to those in pharma companies that
are in the field that can help you really understand
the flaws in what you have developed, which
(36:01):
is not always fun.
But yeah, I think we have to be very careful.
We have to make sure that we're doing good baselines.
There was a paper published recently
in the single cell genomics field
where they evaluated two transformer models
and found that linear regression did better
than both of those models.
That's taboo.
You're not allowed to say that.
(36:23):
So I think really making sure that we do the right baselines.
100%.
Man, the amount of times.
I think that every single person that I've spoken to
for this podcast has spoken about the importance
of getting baselines.
And in the work that I'm doing, it's so true too.
(36:43):
I know I was briefly telling you about this before.
But yeah, like natural language processing,
for some of the simpler tasks, people
are trying to throw the heaviest, monstrous of a model
at problems that can be solved using simple embeddings
and either logistic regression or support vector machine,
(37:04):
or at least get a baseline with it, right?
To at least just see, because maybe you'll find a problem
in your other pipeline just based off
of what you did with the other pipeline
that you created to get a baseline.
If nothing else, the baseline is something
that you can put into an interface or your product
to keep building so that you're not blocked by this model.
(37:27):
Because my joke is that, especially deep learning,
it's like an ideal gas.
It will expand to fill whatever space and time you give it.
But it's true.
Even in chem informatics, chem informaticians
will tell you that they've used machine learning
for a very long time.
And the random force models and some simple fingerprint
(37:50):
features of small molecules are pretty hard to beat.
And they're not wrong.
Yeah, it's true.
Yeah, if you ask some Kagglers, they'll
say XGBoost is all you need.
That's the joke there.
But no, ensemble models, you can really
(38:12):
do a lot of amazing things.
And in a little bit of a different perspective,
I'm thinking about your Pi Data talk.
You spoke about creating a product where
you were visualizing things that wasn't really
heavy on machine learning and the importance of that.
(38:33):
Do you want to speak to it a little bit?
Sure.
I guess, are you asking sort of generally,
or are you asking about that particular part?
Well, no, just the importance of visualizing things
before you even jump into the machine learning parts.
Yeah.
Oh, yeah, I understood that.
Yeah, understanding your data, like I said,
data is the most important thing.
And getting to know it on a very personal level
(38:55):
is really useful.
And it can give you ideas for different scenarios
that you want to test with the data.
How do you split your data?
So for example, with protein structure prediction,
the data are often split temporally
based on the data at which they were entered
(39:16):
into the protein data bank.
So one hypothesis is that that enables you
to predict future structures.
And that's not a bad hypothesis.
However, having been a structural biologist,
I can tell you that there is almost certainly bound
to be a lot of redundancy in proteins
(39:36):
that are deposited later.
Because if you solve a crystal structure,
even if it's something that's been solved before,
but you do it as part of your paper,
as like a check of something to ensure that you didn't disrupt,
I don't know, the structure or something, which, by the way,
that's the thing that is done.
Then it gets deposited in the data bank.
So if it ends up on the other side of that temporal split,
(39:57):
there's actually a lot of redundancy in the data.
So then you start to think, OK, what
are better ways to do this?
Maybe you cluster them.
But how do you cluster them?
You cluster them by amino acid sequence.
Well, protein constructs can be crystallized.
They can be changed when they're crystallized.
Do you cluster it by three-dimensional similarity?
(40:18):
And then how?
And how do you do the alignment?
I guess a very nuanced problem.
So getting to know the data and visualizing it,
where does the model do really well and why?
Where does the model really fail and why?
And sometimes it's useful to look at an aggregate plot.
But there were situations, I was working
with someone on the team, where there's like angular rotamers.
(40:42):
It's hard to explain, but there's different angles
that are predicted.
And so we started going through the things,
the ones that were really off, because we
saw this periodicity.
And so we started to understand that one of the,
there was many issues, but I think one of the issues
was probably the software that was measuring the angles
didn't understand some aspects of chemistry.
(41:02):
But when you start to see, the aggregate is useful,
but so are the examples.
Right.
Yeah.
I always, as much as in the work that I'm doing,
we want to automate a lot of it, obviously.
We want to streamline things.
There's always this step in the beginning
when you have a new data set, where you just
visualize it in some way.
(41:23):
Just think about how could you even possibly visualize it.
The way that you're thinking about it is like all
the different, either the states or the different features
that you want to be looking at.
But just even going through that exercise
gives you this like inherent, not inherent,
but like a little bit of a more intuitive understanding
of what's taking place.
(41:44):
And then it'll give you some nice hunches
as you try to figure out even just the problem space itself.
Yeah, that was really cool.
I just have a real love of visualizations.
So a lot of the eye candy in your presentation
really attracted me also.
(42:04):
Yeah.
Probably there's some cool videos approaching structures
that are in there.
And that was really just for fun eye candy,
because they're really cool to see.
They are beautiful.
Very beautiful.
When I saw those when I was learning biochemistry
and taking biochemistry in undergrad,
that was the thing that made me really
fall in love with structural biology.
(42:25):
It was like, wow, this is really cool.
Right.
I want to do this.
Yeah, a picture is there's you can't even
put it into words, right?
Yeah.
It really.
Powerful.
Yeah, it can be very powerful.
So now you've been working on drug discovery using AI
(42:45):
and developing these sorts of tools.
So how have you seen the field change
since you started working in the industry?
I think in general, when I started
working in the industry, data science was still
kind of a young new thing.
(43:06):
And so there were a lot of generalists.
And at least maybe this is biased by having been in my field
too.
I start to see a lot more individuals in the field who
are coming to the field, but they have been trained
in a domain.
And then they picked up the data science
as part of their domain specific education.
(43:28):
And I think some of that just reflects
the way universities are starting
to integrate these computational skills
into their curricula, which I think is very important.
Teaching a compute literate, machine learning literate
(43:48):
student is really important.
So I start to see that.
I think that is very useful.
And demonstrating some level of deep understanding of data
will benefit you, even if you change domains, in my opinion.
Because then at least you have a frame of reference
for the kind of ways that things really went
wrong in the other domain.
(44:10):
So I think that's a big one that I see changing.
I think some of the machine learning tasks
are getting to be a lot easier.
We have libraries.
We have AutoML.
We have things that start to even hyperparameter tuning
for deep learning models.
It's getting a little easier.
It's always hard if you're at the cutting edge of what's
developed.
(44:30):
But I think that stuff's all changing.
Right.
I think, yeah, at one point, there
weren't that many models.
There probably weren't that many resources.
Now there's too many.
There's so much out there.
(44:52):
There's a lot of libraries.
A lot of them are overlapping.
There's a lot of vendors out there that
are trying to do the same things.
Yeah, there's this whole AutoML movement.
But yeah, that's interesting the way
that people can learn about a particular topic.
(45:13):
You can start to apply these things more easily.
So maybe they can get a better understanding
of the pros and cons of it so they can understand
what they're learning there at a deeper level.
But then you can also take that and apply it
to other fields as well.
So yeah, that's going to be a very interesting trend.
Maybe a loaded question, but how do you
(45:33):
think machine learning will change in the next 10 years?
I guess for you, what impact do you
think it will have on scientific discovery?
I think it will profoundly benefit
the set of scientists who understand
(45:54):
how to incorporate it in their work
and how to critically evaluate it.
It's a question I get a lot from my former colleagues,
for example, in the Mars spectroscopy
and structural biology.
I think it will continue to become more
of a part of the field.
So thinking about structures from alpha-fold,
(46:17):
can and how should they be deposited
into the protein data bank?
That's a thing to think about.
It's not my job to solve, but certainly it's
an interesting thing to think about.
What does that mean?
Any different?
I mean, structures are computed from other data.
I'm not sure it's different.
Maybe it is, but how do you represent that?
(46:39):
I think the fields in general, like I sort of alluded to,
needs to find better ways to represent biology.
And right now, there is certainly no one way
to represent it.
I think that's a very open question.
With small molecules, there is language called SMILES.
There's graphs.
Graph models do really well, by the way, with small molecules.
(46:59):
There's three-dimensional representations as well.
Same with protein.
There's a variety of ways.
So I think that will crystallize a bit more.
Trying to think of what else.
Yeah, I think just really big picture.
I think those who learn how to use these tools
and how to interrogate them will figure out useful ways.
(47:21):
I figure out ways to use GPT and fluorine
my everyday life all the time.
We have internal NLP models at NVIDIA.
We don't enter proprietary data, of course, in public models.
But using those, playing around with those from time to time
to try to do things.
But you have to kind of experiment.
Like, it's not straightforward.
(47:42):
So you actually have to devote time
to figuring out how to do that.
I think, at least.
No, definitely.
And for technologists or for innovators,
having these tools then allows you to do so much more.
I mean, that's why it's so incredible.
(48:02):
I don't know.
The music industry is benefiting from all of this stuff.
I mean, The Beatles released a new song, right?
Like, that's the crazy.
That's nuts.
Because they were able to create a way of segmenting
John Lennon's voice from the TV in this old track
that they had.
And honestly, I listen to that track too often.
(48:26):
It's haunting for some reason.
Yeah, I don't know what the analogous thing is there
for science.
We'll have to see.
I mean, there are models now where
dynamics can be predicted.
So it's literally a model that does this,
that predicts the next frames of the simulation.
(48:49):
They've got a ways to go yet before they're
simulating proteins and simulating proteins
on the full time scale.
But maybe that's the analogy of what you're thinking of.
I'm not really sure.
Yeah.
I'm just saying how cool it is.
Yeah, yeah, yeah.
And just what becomes possible.
And then I just think about this common story
(49:13):
that they say, it's in a book, Prediction Machines.
Like when new technology comes out,
they talk about accountants.
And when Excel came out, there was this new spreadsheet tool
basically.
And then people were saying, oh, all accountants
are going to lose.
There's not going to be any accountants.
But all it did was it actually created so much more,
(49:34):
like a richer.
People could then apply more human thinking to it.
And they were able to do more art of finance
and create more sophisticated models.
And yeah, in some sense, it allows
you to do a higher level work.
It does sometimes make things more complex
(49:55):
and conflate things as well.
So it'll be interesting to see over, say, the next decade
what's just kind of hype and what are actually
tools that are enabling us to do these incredible things when
it comes to drug discovery, being able to try and create
(50:15):
all of these different things that can help us
fight different diseases, help us deal with,
create new medicines.
I mean, it's so exciting from the outside looking in.
It's such exciting work.
And knowing what thinking about what
(50:36):
the process is of trying to get a drug from idea to market
and how long that is, and if there's
anything that can be done to expedite that in a safe way,
is really pretty awesome.
It's probably the coolest thing to be working on.
(50:58):
Thanks.
And I mean, it's tremendously important.
We certainly don't wish another pandemic on the world.
But I think it's pretty likely it's
going to happen again, unfortunately.
So how do we be ready and think about the next one?
And how do we have the tools in place
so that we don't have to scale up vaccine manufacturing as much
(51:23):
as we did this time, right?
So yeah.
Yeah.
Yeah, obviously, we don't want anything like that to happen.
But it's sort of inevitable that there will
be something along that level.
But knowing that there are teams and companies and research
institutes that have the tools that can enable them to quickly
(51:47):
combat that sort of stuff is a little reassuring.
There's still the human element.
People have to actually do.
I know.
But at least you can just give people the tools
to do the best they can.
So switching gears into the learning from machine learning
(52:10):
aspect, we'll get into some advice questions,
everyone's favorite type of question.
For people who are just starting out in the field thinking,
hey, I want to be a data scientist,
or that they're doing some biochem stuff,
what's advice that you would give to people that are just
(52:31):
starting out?
I think this one's probably always data, data, data.
It's always about the data.
I think the cool thing with machine learning,
it's a very open community.
There are folks who participate in Kaggle competitions
or something if you want to get to know a particular domain
better.
There are active discussions there on those competitions.
(52:54):
You can learn more about the field that way.
That's something that's relatively accessible to all people.
Yeah, I think that's the biggest thing is data.
I think in this field, you need to figure out
what works for you as a way to continue to learn.
And that isn't just reading papers,
(53:15):
or maybe it's testing a few new tools as they come out,
these new visualization and interpretation tools
as you were alluding to.
Maybe it's that.
Maybe it's continuing to refine your software development
skills.
Maybe it's reading papers.
It depends what you're doing.
But figuring out a way to do that is important.
(53:36):
I think it's harder as you get older.
I mean, I was completely self-taught with machine
learning.
But I used to get up at 5 AM on Saturday
and do all my machine learning courses and my homework
for the week.
And man, I can't even.
It was hard.
Yeah.
Yeah.
And Andrew Ng class.
(53:57):
Yeah.
Oh, yeah.
That stuff, Jeff Hinton's class.
Yeah.
That was how I did it.
And then, yeah, certainly I've kept learning since then.
The field has changed profoundly since then even.
Yeah.
That's a trait that I find, I guess,
in data scientists and software engineers,
the really good ones, it's just they're capable.
(54:20):
But then it's just this idea of just continuous learning.
I'm going to continue to learn the newest things out there.
But then they have-
Tools, VS Code, whatever it is.
Master the tools.
But it's not mastering all your tools.
It's mastering the tools that matter.
And that's the hard part.
And it's the ability to master a tool.
(54:43):
And I think that's like school.
It's not like everything that you learn in school
you're actually applying in your job.
But for someone in a position like you,
you can consume research papers probably better than 99.9%
of the world at this point.
And being able to take the pieces of it
(55:05):
that are applicable for you.
But yeah, in this field, it's important to understand,
yeah, which tools are important and how quickly can I
get onboarded onto that tool.
Right, right.
Yeah, a little variation of the last one.
What advice would you give yourself, I guess,
(55:27):
earlier in your career?
Yeah, I was certainly when I was doing this transition,
I was very intimidated.
It's a big thing to sort of make a switch from something
I'd spent at that time, I don't know, 11, 12 years of my life
studying this.
And certainly, I still get to work
(55:47):
at the intersection of science.
But it's a very different type of job.
I love it.
But it was very intimidating and stressful at the time
to make that jump.
Right, right.
So what would it be?
Just you're going to make it through?
It's going to be OK?
Or don't be intimidated?
(56:08):
Don't be intimidated.
And I think keep your ear to the ground about things like this.
Don't be so focused on your, it's
important to focus on your domain, your world
as a structural biologist, as an NMR spectroscopist as I was.
But it's also important to be aware of other trends
(56:29):
and think about it.
Because I think the people that maybe are earlier on
in the field figured out that, oh, yeah, this is a thing.
I should go learn about this.
There's this whole revolution going on outside
of my purview of my day to day.
Maybe I should at least try to understand it.
Right.
Yeah, keep a pulse on the progress that's
(56:53):
taking place in other fields.
Yeah.
That's something that I learn.
Well, I mean, I guess I've always sort of tried to do it.
Just my interests have been so varied
that I've been able to do it.
But recently, I listened to Jeremy Howard from Vast AI,
just really hammering that point home.
(57:14):
If you're interested in natural language processing,
it's OK to learn something about computer vision.
If you're interested in computer vision,
you can do activity recognition.
Whatever it is, there'll be some thing
that you learn in signal processing that will help you.
Because when it comes down to it,
it's representing things numerically.
And it's doing manipulations to it.
(57:35):
It's pattern matching.
Right.
Yep.
So there's always a way.
Oh, yeah, sorry.
That's why I said that I think there
is a lot of value in having domain expertise,
and knowing and understanding something deeply what works
and what doesn't work.
And I don't actually believe that it prevents you
from switching domains later on if you want.
(57:57):
I think it is an asset because you
have dealt with some very fundamental problems
with data and modeling.
And you will see them apply, many of them in different ways,
perhaps, in that other domain.
But I think it teaches you what it takes to really interrogate
the data and build a good model.
(58:18):
Absolutely.
Something that's interesting, I know
that you work in the field with so many things going on.
But who are some people in the field that influence
you and your work?
I would say certainly my colleagues at NVIDIA.
(58:40):
They are tremendously talented and make me rethink and think
hard about problems every day and sometimes
help me when I'm stuck.
Sometimes I help them when I'm stuck.
I feel very, very fortunate to work
at such an incredible place.
I would say my postdoctoral advisor, Art Palmer,
(59:00):
and my grad school advisor, Patrick Loria and Scott Strobel,
they taught me to think very carefully and critically
about what I do and how to think about problems where they're
empirically, the problems that are very empirical.
And it actually has a lot of applications
in machine learning because it's a very empirical field
in many ways.
(59:22):
Certainly in the domain that I'm specifically in,
I think Alex Reibs from Evolution A Scale,
they've built some of the best, very, very well-known protein
representation models out there.
Deborah Marks, who's a professor at Harvard,
has done amazing things in machine learning
and development of some very fundamental techniques
(59:43):
with protein sequences, as well as does experiments
to back up the work.
So I think anyone who does both wet lab experiments
and machine learning, I think, gets a special place
of recognition from me, at least,
because I believe that's very hard to do.
And it really, really informs the work you do.
(01:00:04):
Yeah, absolutely.
Yeah, that makes a lot of sense.
And being on learning from machine learning,
I get to ask this question.
What has a career in machine learning taught you about life?
Yeah, I would say, so the importance of continuing,
of continually learning, whatever it means in your life,
(01:00:25):
whether it's, I learned how to snow ski two years ago.
And I wish I'd learned when I was young and made of rubber.
I'm not afraid to bite it, but I'm
a lot more afraid of tearing an ACL as an adult.
But it's fun.
And so this Christmas, we're going skiing.
And I'm excited to eventually get to the point
(01:00:46):
where I'm able to ski on blues pretty regularly.
My husband is an excellent skier.
So at some point, I want to get there.
That's my goal.
Nice.
I think how to debug complex empirical problems
and how to think rationally about, like, how do I try this?
And how do I debug this quickly?
(01:01:06):
Because it's not always as easy as you might think to solve it.
And then I think, I never really thought about classification
problems that much as a scientist.
So it really makes you think about,
there's many ways of being wrong.
And sometimes, some of those ways
matter a lot more than others.
(01:01:27):
So really understand what being wrong means
and what you need to optimize for.
It's not always accuracy, for example.
Sometimes false positives and false negatives,
they don't always carry the same weight.
Yeah, you beat me to it.
I was going to say, everyone wants
to know the accuracy of the model.
And I try to say, all errors are not equal.
(01:01:51):
They still want to just know the accuracy.
They don't care.
I try to explain it to them.
But then if you bring out a confusion matrix,
it gets a little too confusing.
But no, it's important the takeaway of just like,
all errors aren't the same.
And some matter much more than others.
That's a really good takeaway.
(01:02:13):
And then just about the debugging complex empirical
problems.
So what is it?
What makes it so hard?
Just all the moving parts?
Sometimes it could take a long.
If you have a problem, for example, with,
I'm just going to pick something, training
where it starts to throw nans.
But it doesn't happen until pretty far into it.
How can you figure out what the problem is
without minimizing the time it takes to solve the problem?
(01:02:39):
Are there things you can scale down?
Can you figure out if there's a piece of data point?
Can you figure out if, like just learning
how to debug that in an efficient way is,
it can be really tricky.
I don't know.
I certainly think it's hard.
No, absolutely.
Absolutely.
And being able to approach those problems
(01:03:00):
in a cool, calm, and collected way, you will get, well,
most of the time, you'll be able to figure it out.
Yeah.
Sometimes I get CUDA in it errors,
and then I don't like those.
Oh, no.
Because I'm like, oh, boy.
Oh, boy.
Yeah, well, then you have to go ask someone.
(01:03:20):
This is going to be a while.
Yeah.
CUDA alignment errors.
Those are not my friends.
No.
Or those out of memory errors.
And it's like, now?
This is when you're going to throw me that one?
OK, fine.
Yeah.
Yes.
Wow, Michelle, it's been such a pleasure.
This was so cool to dive into this area.
(01:03:43):
Thanks.
It was my pleasure, too.
Yeah.
Understanding how people even begin
to approach how to do drug discovery, leveraging AI,
it's such a fascinating field.
I think I'm going to use this as motivation
to continue to learn more about this.
For people that want to learn more about you and your work,
(01:04:06):
where would a good place to go?
Probably my website.
My papers, talks end up there.
I do need to update it.
That's a holiday project.
This is to go add a few more things,
including my Pi Data Talk.
But that's probably the best place.
I have Twitter and Mastodon.
Well, I guess X these days.
(01:04:28):
I don't tweet as much these days.
Maybe that will change.
I'm a modern scientist on Twitter.
But whatever accounts I'm using at the moment
will be linked to from my web page.
So it's michellelingale.com.
Cool.
And I can add some of those to the show notes as well.
And I encourage listeners to definitely check out
your most recent keynote at Pi Data NYC.
(01:04:50):
I'll add that as well.
What an amazing talk.
Michelle, thank you so much.
I really appreciate you giving me the time
and letting me pick your brain for a bit.
Yeah, it was all my pleasure.
Thank you.
Thanks for the invitation.
Thank you.
Thank you.
On this episode of Learning from Machine Learning,
(01:05:13):
Dr. Michelle Gill shared her incredible journey
from wet lab biochemist to driving cutting edge
AI at NVIDIA.
Her work helps address one of health care's biggest
challenges by enabling researchers
to do drug discovery, both faster and better.
Michelle discussed the critical need for better machine
learning representations for biological structures
(01:05:36):
and her insights on leading a multidisciplinary team.
Thank you for listening.
And remember to subscribe and share
with your friends and colleagues.
Until next time, keep on learning.