Deep learning algorithms, natural language processing, and the brain, with Jean-Rémi King

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Stephen Wilson (00:06):
Welcome to Episode 27 of the Language
Neuroscience Podcast. I'mStephen Wilson, and I'm a
neuroscientist at the Universityof Queensland in Brisbane,
Australia. My guest today isJean-Rémi King. Jean-Rémi is a
CNRS researcher at École NormaleSupérieure, currently detached
to Meta AI, where he leads theBrain and AI team. He's been

(00:26):
doing some really groundbreakingwork using large language models
and deep learning to investigatethe neural basis of language.
Today, he joins me fromMarseille, France, and we're
going to talk about three recentpapers from his lab. Okay, let's
get to it. Hi JR. How are youtoday?

Jean-Rémi King (00:42):
Hi, I'm good.
Thank you very much for havingme.

Well, thank you very much for joining me and
it's morning time in Paris.
Right?

Right. I'm actually Marseille. I live in
Marseille although my work is inParis.

Oh, really? Do you?
Oh, Okay. Sothat'll be very convenient for

Yeah.
you to come to SNL later thisyear, then won’t it?
Absolutely.

Okay.

I won’t need a hotel.

All right. So, are we going to enjoy visiting
your, your town?

Sure, yeah. I think it's a beautiful town.
It's by the sea. The city isvery nice, there is, there are a
lot of museums, a lot ofscientific groups that are
definitely related to languageand the neurobiology of
language. So I think it's a it'sa great city to, to, to do SNL.

Yeah, I'm really excited. I'm really looking
forward to it. So, I usuallystart the podcast by asking
people about their childhood andbackground. But with you, I want
to ask something else first,which is, so we have a mutual
colleague, Anna Kasdan, andshe's told me some, lots of
stories about you, and my, butmy favorite one is that you have
installed Linux on her computer.
(Laughter) She couldn’t tolerateit. Is this true?

I don't actually recall specifically, (Laughter)
but that looks, yeah, thatsounds completely possible. Yup.

Yeah, so, you're a Linux guy she said. But I know
that you're doing the recordingon a Mac right now.

Yeah. So I'm currently working at Meta, where
Linux was not an option. It waseither Windows or Mac. So I had
to, I had to go for for Mac,which is based on Unix systems.
So it's, it's a bit easier to,to accommodate from a Linux
background.

Okay, so you've sadly had to move away from it
huh?

Absolutely.

Yeah. Okay, so then getting back to, how did
you become the kind ofscientists that you are like,
were you, what were yourchildhood interests?

Oh, wow! That’s a, that's a big question. I
don't entirely know. I thinkit's, it's, it's a lot of
different factors that areinvolved. I was originally
interested in, in AI, when I wasquite young. And it started, I
think, when I was playing withLegos, there was the Lego

(02:58):
Mindstorms at the time that youcould program was very
rudimentary sort of program thatyou could, you could do. But
that, that made me interested inthe topic and in programming,
and I did a first internship inan AI Lab in 2000, I think, if I

(03:20):
recall correctly, and then Icontinued in this domain during
my undergrad where I did AI andcognitive science. And around
that time AI was not really,let's say, working and so the
advice from my professors at thetime, was to sort of try to do

(03:42):
something else, maybe somethingwith a real future. And so was….

What years would that have been?
Yeah.

So it was around 2007, I think, at the end of my
undergrads and so I moved toComputational Neuroscience. So,
And that's,after I decided to do a gap year
I did a two masters in, in, inBrain and Mind sciences, in
and after that, I moved tolanguage in New York, for a
between UCL and Paris, at EcoleNormale Supérieure and UPMC. And

(04:09):
then I continued in Paris did aPhD in, in neuroimaging and to
postdoc in David Poepple’s lab.
And then I joined Meta. I got atry to decode brain activity
from healthy participants and,and from patients who suffer
from disorders of consciousness.
position at Ecole Normale as aCNRS researcher and now I'm

(04:39):
detached to Meta AI which is afundamental AI research lab that
Meta has.

Yeah, so most of our listeners will probably know
Meta as the parent parentcompany of Facebook, and I don't
know how many like I mean, I'll,I'll ask you about it, but like,
I don't know whether Facebookhas like more rules or Meta has
more rules on talking aboutthings than like a university
would, would. But you know, sofeel free to share what you can.
But you know, what is their,what is their long term, are

(05:15):
they sort of have a long terminterest in supporting basic
research like this, they see itas being like, central to their
future?

I think Meta like any big companies is very
conscious of the potential of AIand the pressing necessity to be
at the cutting edge of the ofthe research because things are
moving extremely quickly. Andthe, gave, I think, a lot of
opportunity to scientists likeGeoff Hinton, in the case of

(05:46):
Google or Yann LeCun, in thecase of Facebook and now Meta,
to build the lab in a way, whichwould work and, and both of them
in this particular case went forbasically the principles of
academia. So, the general ideais to say that, it's very
difficult to know what's goingto work, it's even harder for,

(06:07):
let's say, the hierarchy to knowwhether a researcher is doing
something good or not. And sothe best way to, to evaluate the
progress in research isbasically to go through a peer
review and anonymized peerreviewed and publication. And so
I think the really managed toconvince these big tech company
to, to follow these principles.
And so in this case, the thelong term future is very hard

(06:35):
to, to know. And I think no onereally knows how to position
themselves. There are a lot ofquestions, it can be a case by
case issue. But what's clearthat you need to have sort of
the top researchers within yourcompany to be able to compete
and to develop the algorithmsthat will work tomorrow for

(06:56):
their use case. But researcherswants to work on general,
general principles. So…

Yeah, because the, you know, the fascinating
stuff you're doing that we'regoing to talk about in a moment,
like in relating these largelanguage models to the brain, I
mean, it's not immediatelyobvious how that gets built into
a Facebook app. Right? So they,they're willing to kind of give
you free rein on doing what youthink is, what, what you think
what you want to work on andthey'll and they’ll see down

(07:24):
the track, like, what it evolvesinto? Is that kind of the
philosophy?

Yeah, absolutely. I think the
philosophy is to hire reallygood people and then to consider
that they will make the choicesthat are the best one for their
fields. The, there are someprojects, which are directly
applicable, let's say to, toproducts. In the case of

(07:50):
computer vision, this is quiteclear, like if you want to
filter hateful contents, andlet's say, pornography on
Facebook, you need to have analgorithm that can recognize the
content of images. And so thosewho work on fundamental
research, in the case of visionhave a direct impact, even

(08:10):
though they don't necessarilywork with actual Facebook or
Instagram content, the path is,is much clearer. And at the
other end of the spectrum, theyare researchers who are really
sort of distant from anyapplication and the goal is
really to try to understand theprinciple that allows a system
to become able to learn muchmore efficiently. And so I would

(08:34):
be much more in this, in thiskind of other end of the
spectrum. Yes.

Yeah, that's really interesting. And do you
do you interact much with otherAI folks at Meta who are doing
more applied things that havesort of nearer term
applications?

So, the lab is quite horizontal. So, we do have
meetings together, we haveregular moments to, or occasions
to discuss. We try sometime tobridge, to bridge projects. So
for instance, for those work onlanguage models, we try to
discuss on what kind ofarchitecture we think is more

(09:18):
relevant to try to learnlanguage at scale. And so this
engages a conversation. Andsimilarly in the case of vision,
like we have an ongoing projectin, with a group that works on
DINOv2 which is aself-supervised learning
algorithm, trained to recognize,let’s say, structures from, from

(09:40):
natural images without, withoutsupervision, and we have an
ongoing discussion on how we canuse neuroscience to try to
improve or evaluate these modelswhich can be very difficult to,
to do. So, so we haveinteractions but more I think at
the ideas level and sometimes,some, some, some, yes, some

(10:02):
coding project that, that I'llshare together. But generally
speaking is more thanintellectual level of, of
collaboration than a very sortof, let's say product oriented
based collaboration.

Yeah, but I think that's a very important
level of interaction too. And,and like, there's, you know,
there's all there's a history ofthis, right, like Bell Labs in
the US, you know, developed, Ithink, information theory came
out of like, I think Shannon wasworking there, you know, but
they definitely produced a lotof stuff that ended up like
being like pretty core cognitivescience. And it wasn't like,
just like what you're saying itwasn't being done in support of

(10:38):
like, we're gonna put this intomorrow.

Absolutely. Yann LeCun actually worked in Bell
Labs and I think his, hisphilosophy and how to organize
research in the private sectoris heavily influenced by this.
Bell Lab’s I think, had four orfive Nobel Prizes, before they
close down. So they really had amajor impact and the core way

(11:04):
of, the core organization wasreally to let researchers do
whatever they wanted, andwhatever they thought would be
impactful. So I think, yeah,this is sort of what Yann LeCun
managed to, to instill withinthis company, I think it was
kind of the same for, for Googleand other companies made many

(11:25):
different choices like Amazon,and Microsoft, work slightly,
slightly differently.

And how did you land this job? Did they come for
you? Or was there a posting andyou were like, I'm gonna apply
to that?

Yeah, they reached out to me, actually,
back in 2018. I was surprised. Imean, like you, if I understand
your question correctly, it's, Iwas wondering whether I had a
direct utility or relevance for,for their goal. And it's

(11:58):
really, I think, the bigargument that convinced me to
join them was the fact that theywere working on really an open
source approach. So they werepublishing papers, they were
releasing code, releasingmodels. And I thought that this
was a healthy, healthy path toto create common good and to

(12:19):
continue, do good research. Andthen once I, I joined the lab, I
was very impressed by the levelof the researchers there. It's a
really a top AI Lab, theconversations are always
extremely, extremely useful andwith a lot of intuition. Their

(12:42):
trials doesn't work as well.
They don't, don't always work,let's say, and you learn a lot
from those failed attempts. Soyeah, this environment was very
fulfilling, in a sense.

Yeah, that's great. It's so interesting to
talk to you about this, because,you know, most of my guests are
people in academia and it's,that's the work environment that
I'm familiar with. So it's kindof, you know, just neat to hear
about what it's like for you. Soyeah, like, let's talk about
some of these papers that weplan to talk about. These are a

(13:17):
couple of a few recent papersthat you've published with some
of your students, includingCharlotte Caucheteux. Is my
pronunciation acceptable withinthe bounds of my, my accent? And
yeah, we will talk first about apaper it's called ‘Toward a
realistic model of speechprocessing in the brain with

(13:39):
self supervised learning’,published in NeurIPS 2022 by
Millet probably Millet(different pronunciation), I'm
guessing, should be French.

Yeah, Juliette Millet.

And Charlotte Caucheteux and yourself as
senior author. And we want tostart with this one, because
this is one of the one of thefirst papers from, from your
group in which you kind ofestablish these correspondences
between large language modelsand neural activity, right?

Yes. So, there were papers before, before this
one that, that showed somesimilarities between deep nets
and, and the brain. So perhapsI'm going to backtrack just a
little bit.

Sure.

So maybe, maybe just for the anecdotes when I
was a students back in the day,and I think this idea continued
to be true for, for a while. Thenotion of an artificial neural
network was really considered tobe metaphorical. It's like we
say, okay, we speak aboutartificial neural networks but
these are just, this is just aloose analogy, this, this has
nothing to do with what thebrain does. These artificial

(14:46):
neurons, it just sort ofcomputational units that were
kind of inspired fromneuroscience, but really, they
didn't work in the same way. AndI think this has switched or
pivoted radically in the fields,around 2014, where several labs,
especially coming from vision,started to compare deep nets to

(15:10):
brain activity. So yeah, the labfrom Nikolaus Kriegeskorte,
from Marcel van Gerven and fromJames DiCarlo, from, Bertrand
Thirion which pretty much allsimultaneously compared the
brain responses to images, tothe activations of AlexNet,
which was sort of one of thelandmarks in computer vision

(15:33):
models. And, and thereafter VGG19, which is another, another
computer vision model. And whatthey show is that you, with some
fancy linear algebra, eitherbased on so called RSA, or
linear mapping, they show thatyou can find similar type of
activations in the brain, and inthe deep nets. So if you present

(15:57):
an image to the algorithm, thealgorithm sort of combine the
pixel together creates newactivation in order to identify
whether there is a cat or a dogin the image. And when you
present the same image toparticipants, and you measure
them with fMRI, or in the caseof monkey electrophysiology, you
record this spiking activity,you can see that basically, you

(16:19):
can find biological neurons orvoxels, which respond similarly
to different images to, to theartificial neurons in the deep
nets. And there were a lot offriction at the time, but I
think people started tounderstand that perhaps, these
deep nets, which werealgorithms, may transform visual

(16:41):
inputs, to some extent in thesame way as, as a brain. And,
and so we should not think ofthose as just a metaphorical
model, but perhaps we canactually start to think of those
as, as useful model for, for theneuroscience of vision. And
around the years and many otherfields, tried a similar idea in

(17:04):
other domains, in the domain ofspatial navigation for
hippocampal place cells, andgrid cells in the case of motor
control, in the case of auditoryprocessing, and in the case of
language and speech. And so wesort of fit, I think, in the
group with this general tendencyof a systematic comparison

(17:26):
between the pending algorithmsand the brain to try to see
whether indeed, these algorithmgenerates representations
activate themselves similarly tothe brain, in response to the
same sentences in response tothe same sounds. And so that was
sort of the starting point. Butthe, the, one of the motivation

(17:49):
behind the work was to insist onsome potential differences. And,
and in the case of language, theone of the key difference that
is very quickly obvious is thatfirst language models work in
the text domain. So the inputis, is already sort of a word,
it's not quite a word, it's atoken. So it can think of this

(18:10):
as as a morpheme, really, issubordinate. And so that's first
difference. And the seconddifference is that the get
trained with just a giganticamount of data, if you train
them with small amounts of data,they just perform extremely
poorly. And so in thisparticular that work that we've
done with Juliet Nia and solidCousteau, we're trying to think

(18:32):
to test whether we could gotowards more biologically
plausible architecture that'strained with the raw audio
waveform, and where's thesensible amounts of data? And
for this, we we focused on analgorithm that was developed at
MIT, actually, by AlexeiDaisuke, who used to be a

(18:55):
colleague of mine, and, and hisgroup. And it's called a web
directory, it's an algorithmwhich is input with a waveform.
And it tries to do two things,it tries to predict missing bits
of sounds a bit similar to to alanguage model where you try to
predict the next bit given thecontext. But it also tries and

(19:18):
that's, that makes the wholething much more complicated. He
also tries to learn what shouldbe predicted in the first place.
So you have this sort of dualgoal in the algorithm you need
to learn to predict and you needto learn what should be
predicted. And it's this dualgoal, which is sort of very hard
to optimize. And so yeah, so wethought okay, maybe that's a

(19:39):
plausible candidate, because nowwe can train an algorithm with a
row speech waveform withoutsupervision. And with we tested
in this case, to train the wetrain the algorithm with 600
hours of speech data, which isvery roughly about a year
exposure of speech for, for ahuman being.

Well, it depends on how talkative the parents
are. Right?

Exactly. It depends. It depends on other
things, teenage …

Ok, yeah.

Yeah, absolutely.

So yeah, so I'm seeing like, there's a couple of
different innovative aspectslike you know, you're doing,
you're doing it on the raw audiosignal rather than on the
tokenized transcribed language.
You’re doing it with a smalleramount of training data rather
than than training the largelanguage model on like the
entire internet and all thehistory of all human thought.
Those I understand and then theone that I don't fully

(20:31):
understand from your, fromreading your paper is the self
supervised aspect, which youjust kind of alluded to there.
So I don't understand what thatmeans for it to be learning what
it needs to learn, I waswondering if you'd be able to
explain that?

Yes. Let me try to unpack this. So, perhaps the
best thing to do is to startwith what it is not right.
language models like GPT, theyare unsupervised, right? You
don't need to have a humanlabeler that says, this is what

(21:07):
you should do for this sentence,this is what you should do it
for this sentence. So the waythis works, is to try to predict
missing bits of the data. In thecase of GPT, the missing bit is
always the last word given thecontext. So it's basically
trying to do autocomplete nextword prediction. And it's
without supervision, because youcan just roll a Wikipedia or the

(21:28):
entire internet to try topredict what is the next word
given the 2000 preceding tokensor 2000 preceding words. And so
that's unsupervised, but whatshould we predicted here is
determined by the experimenter.
We asked the algorithm topredict the word level, or in
this case, the subword level,but it's a fixed, it's a fixed
level of representations. Whatwe don't ask the algorithm to

(21:51):
do, is to predict, for instance,the next idea, right? Or the,
the next narrative structure,right? We ask it a very concrete
and well defined goal, which iswhat is the next word. And the
reason why we do this is becauseit's very well defined, we
actually know the ground truth.

(22:13):
So we, if we go and check, wecan say the next word is
actually x or y, and not Z, asyou predicted. So for instance,
if you have once upon a, thatstarts in a sentence, he asked
the algorithm to make aprediction, what is going to be
the next word, is it going to betable, is it going to be dog, is

(22:34):
it going to be time, and thealgorithm has to guess that it's
more likely to be time than dogbecause once upon a dog is
unlikely given the corpus withwhich has been trained. So it's
well defined. But as soon as yougo in other modalities, and
that's the case, for vision,it's a case for audio, you

(22:54):
realize that this approach isnot practical. So in the case of
vision, if you try to predictthe next pixel, given all
preceding pixels, at thebeginning, you do quite well. So
if you have, let's say, thefirst half of your image, and
the beginning is a zebra, whatthe algorithm will try to learn
is to predict, okay, this is ablack stripe, so I'm going to

(23:16):
try to continue the blackstripe, and then should probably
be a white stripe. So, I'm gonnago white. But then it becomes
non, non determined. And so whatit's going to try to do is
basically to try to predictsomething, which is half of the
time black half of the time,white, and so it's going to
predict, basically gray, whichis the wrong prediction. And the

(23:37):
reason for this is because whatit should be doing is to try to
predict a high level feature notdefined in a deterministic,
deterministic, deterministicfashion, at the pixel level, but
determined at a higher level,which is, in this case, the
notion of texture Or stripe. Andso in the case of vision,

(23:58):
people, I think have understoodfor a while that staying or
forcing the algorithm to makeprediction at the level of the
inputs is not practical. Samefor the audio, like if you will
try to predict the nextamplitude of the waveform, which
is sampled at 44 kilohertz,that's going to be very, very

(24:19):
painful. Because you're doingthis job like, it takes a lot of
compute just to predict everysingle sub-millisecond. And so,
what is being done these days,this one possible path, which I
think is promising, is this ideaof self supervision. So in the
case of self supervision, youalso learn the level of

(24:41):
representation, which has achance to be corrected, to be
predicted accurately. So to takethe example of the zebra and the
stripe, basically, you ask thealgorithm to find a
representation such that you canpredict accurately what's going
to be in the next 10 100, 1000pixels. Right. And so, so in the

(25:05):
case of speech here, that'sprecisely what happens. It's a
deep nets which, for which you,you input the audio waveform,
the audio waveform istransformed. And then at some
level, it generates acategorical representation,
quantized representation. Andthen the deployed continues with
a transformer and thetransformer, the goal of the

(25:26):
transformer is to predict thismiddle representation that it
learns in the first place. Andthey are trivial solutions to
this problem, which is to, forinstance, pretty constant values
or just predicting zeros all thetime. And so you have tricks to
try to avoid this collapse. Andthose tricks are basically

(25:48):
contrastive learning trickswhere you, you try to make a
prediction such that if you haveseveral elements in your batch,
you would, you would find theright, the right prediction
amongst the different elements,which prevents predicting the
same thing all the time. Perhapsthis is going too much into the
details, but the basic idea isthat the standard, let's say

(26:13):
autoregressive, and VAE models,they are evaluated at the end of
the day at a fixed level ofrepresentation, which is
determined by the experimenter,whereas the self supervised
learning algorithm, they have tonot only learn to predict, but
they also have to learn whatlevel of representations are
likely to be, to be predicted.
So it's a dual, dual problem,which is harder to learn. Yeah.

Yeah. That was a great explanation, I think I
understand it a lot better. Sowhat, how big are these chunks
that end up getting predicted?
Like, are they at the level ofphonemes? Or morphemes? Like,
where, or is that? Can you thinkabout it that way or no?

So they are defined in, as with a time
constant. So they're not definedfunctionally. So I don't think
we can directly associate themwith phonemes and morphemes or
words, but what we can say thatthey are in the order of 100,
200 millisecond. So they areslightly below the phonetic
units. And there have been in alot of experiments on what level

(27:19):
of like, what timescale shouldbe best for, for this learning,
as evaluated with downstreamtasks, if after this, you do,
let's say, a speech to texttask, you get better by training
with longer or smaller units andthe authors have converged on
this relatively small temporalscale. And I think this is, this

(27:43):
has to do with the fact that thealgorithm learns only one level
of representation. So it's notpredicting at the row waveform,
it's predicting at a high level,but it's still one level of of
representation that is trying topredict, which is around the
phonetic level. But we touchedon on this issue a bit in a
different, in a differentarticle with Charlotte

(28:05):
Caucheteux actually in a paperthat was recently published in
Nature Human Behaviour, which isreally tapping into this idea.

Yeah we can talk about that. Yeah, yeah. Okay, I
mean, maybe these things areabout the size of syllables if
they’re that length, maybethey're a little shorter than
syllables. Okay, so, so then inthis paper, we're talking about
the, the NeurIPS 2022 paper, youthen show which brain areas, in

(28:37):
which brain areas can thesepredicted representations track
with the bold responses after asuitable convolution with the
HRF. Can you tell us what yousee there?

Sure. Yeah. So So what we try to do is to
quantify these functionalsimilarity between the deep nets
and each voxel with a linearmapping. So basically, what we
do is we learn a regression fromthe activations of the deep nets
to the voxel activations, to tryto see whether we could, we can

(29:11):
accurately predict whether thevoxel is going to going to be
high or low given the speechsound given the activations of
the defense, and that's a prettystandardized approach these
days, which was I mean, it's GLMin fMRI already based on this
idea, but then they wereformulated for this goal in the

(29:32):
paper by Jack Gallant andNaselaris in, I think 2011. So,
the method is basically linearalgebra. I'm not gonna go too
much into the details, but itgives us one number, which is
for each voxel, how similar isit to the activations of the

(29:52):
deep nets. And the first thingthat we do is that we do this
similarity analysis for each tolayer on the deep nets. So deep
nets is organizedhierarchically, so we have the
first layer, which just take theraw waveform and this
representation is passed on to asecond layer, which, again,
transform the representation,which is passed the third layer,

(30:13):
and so on and so forth. And sofor each layer, we have a set of
activations that we can compareto each voxel. And what we
observe is that different voxelsin the brain are more or less
similar to different layers in adeep nets. And the striking
observation, when you look atthe, the, the overall result is

(30:37):
how structured this similarityis. So if you look at A1
responses, you basically getactivations, which are most
similar to the very early layersor the transformer in the deep
nets. And the further away yougo from A1, and the more the
activation that's being recordedwith fMRI gets similar to deeper

(31:00):
and deeper layers in the deepnets, such that if you go to
temporal pole, or to thetemporal parietal junction, or
the prefrontal areas, you end upwith voxels, which are most
similar to the deepest layers inWav2Vec 2.0 And it's very, it's
what is really striking is howmonotonous this relationship is,
it's like, the further you are,from A1, from a sort of a direct

(31:26):
path distance, the more yourrepresentations appear to be
similar to deeper and deeperlayers in the algorithm.

Yeah, if I like the just to kind of try and
help, you know, audio listenersvisualize it, I mean, this is
figure three in the paper. AndI, to me, it resembles like kind
of concentric circles coming outof A1, right? So in A1, you've
got, prediction being mostsuccessful from the like you
said, the earliest layers thatthe most superficial, most

(31:56):
similar to the input. And thenlike, as you said, the further
you go out, it's almost likethese concentric rings, as you
go into the temporal lobe andinto the inferior parietal lobe.
So like the Angular Gyrus, itkind of gets to further and
deeper and deeper layers, butit's not completely concentric,
because it doesn't just randomlygo into the insula, and it
doesn't just randomly go intothe sensory motor strip, right?

(32:16):
It very much goes out into thetemporal and inferior parietal
regions, and also frontal.

Absolutely.

Which is kind of noncontiguous to the frontal. So
it's, it's very much like yousaid, like, it's a beautiful
figure, by the way, it's likethe heart of the paper, but it's
obviously capturing like,something pretty basic about the
structure of the languagesystem.

Absolutely. I was, I mean, when I saw this,
this figure, when we wereplaying with the data, I was
instantly shocked. I was like,wow! you don't usually get this
in, in fMRI, like, my experiencewith fMRI before is you get like
this contrast between I don'tknow Jabberwocky, and meaningful
text, and you end up with ablob, or let's say, a set of

(33:03):
blobs, which are different,depending on the contrast, and
it's very difficult to makesense of these things. Whereas
here, the map is remarkablysmooth and, and continuous and
simple to describe it in asense. And I think the reason
for this is because we areworking with a large number of
participants that were madeavailable from different groups.

(33:28):
So in this case, I think theywere with more than 400
participants listening tonatural stories, and it's really
the the big numbers that I thinkallow retrieving this, this very
simple, simple structuring ofthe language processing in the
brain. But it's not just aconcentric circle, either.
Because if you look at theprefrontal cortex, you have this

(33:50):
very interesting sort ofgradients within the prefrontal
cortex where you have a stripethat starts from the motor and
premotor areas and goes towardsIFG, and within the interior
frontal gyrus, you also havesome, some gradients, which I
think could make sense in lightof anatomy, because we knew that
different parts projects todifferent, different parts of

(34:10):
IFG project, through the whitematter tracks to different parts
of the temporal lobes. And ifyou take close attention, you
will see that these actuallymatch with our expectation. So
it's, it's a very strikingfigure I find.

Yeah.

Because, so not only because visually, but also
because of what what this means.
And so when I listened to myprevious postdoc advisor, David
Poepple, I, sometimes I hear himsort of criticize the whole
approach on being too technicaland too fancy and sort of
forgetting the ideas but hereand the criticism being that's

(34:49):
okay, but these are models withbillions of parameters and you
just doing a huge regression andwe don't understand anything in
there. But here what I find Thestriking is that the
optimization function is verysimple to describe. It is just
one equation, you say, you havetwo things to do, you have to
learn to predict, and you haveto learn what should be
predicted. And this is the goal,right? And this is sort of the

(35:11):
essence of it all. And if you dofeed the algorithm, if you
follow the algorithm to optimizethis function, then it naturally
comes up with a hierarchy ofrepresentations, which seems to
provide a very strong structure,or at least, is seemingly simple
or simple enough organization ofspeech processing in the brain.

(35:34):
And to me, that's quite, quitequite striking.

Yeah. No, I just, I mean, it's not overly
complicated when it, well Imean, I think it's probably very
complicated to implement. Butlike you said, that there is a
simplicity to it, as well andwhen it gives you a result, that
makes sense. It's definitelyreassuring. You know, I was a
little bit like you mentioned inthe paper too, like that you do

(36:00):
interpret these gradients in thefrontal areas as well. And I
looked at it a bit, and I waslike, you know, I think you
might have to, like, do a bit ofconvincing for me there, because
I get it with the temporal,temporal parietal thing is just
pristine. And, you know, it'snot like, it's not like trivial,
either, because, you know, therewas like, you know, big debates
between, like, you know, GregHickok, and Sophie Scott, for

(36:21):
instance, as to whether thepredominant direction of
processing in the temporal lobewas headed, like anteriorly, or
posteriorly, from, from A1, andyour data basically shows well,
there's no winner there. They'reactually both right, because
it's going in both directions.
So, you know, I do think it'sactually addressed, you know,
this data is not just a prettypicture, it's it does address
like open questions. But I'm nottotally convinced about the

(36:45):
frontal gradients, like I think,I'm not sure if you have more
data that might sort of provethat those are replicable and
meaningful and related to theconnectivity in some way that
makes sense.

No, I think it's just, it just a hunch. So, this
is just a first study, we didnot look for these gradients in
the prefrontal cortex, we justobserved them. In retrospect, I
think they make kind of sensefrom an anatomical point of view
and let's take one example, inthe case of the motor cortex.
So, I'm not, I’m not coming, asI said, In the beginning of

(37:22):
this, of this conversation, Idon't have a background in the
neurobiology of language, Idon't have, I'm not a strong
defendant on, let’s say motortheory of language. I, to me, it
was kind of a of a story, likewe have many stories in, in
science and cognitive science inparticular. And so first of all,
seeing that you had strongactivation in the motor cortex,

(37:42):
it was like, okay, that's,maybe, maybe there is something
to the story, and then to, tosee that the representation in
the motor cortex appear to belower level than the
representation that we observein the premotor areas of in SMA.

Yeah.

I think that's, that's also going on by the
action. Right? It could havebeen it could have been the
other way.

Yeah. I see it now. Yeah, you've got like,
earlier, you've got a, you'vegot a lower layer response in
this sort of dorsal part ofventral premotor cortex, which
matches up to this area thatlike my friend, Eddie Chang, who
you've probably familiar withhis work, you know, so he has

(38:22):
these, this paper from 2016. Ithink it's, I, the first author
is I think, Cheung, where theyshow that that area up there, do
you know, the paper I'm talkingabout? It has like auditory

I think it’s neural correlates of larynx. Is

Well, that's not what they say. But they kind of
properties.
that correct?
show that it's basically anauditory area. This paper is
published in, some good journal,I forget which one. (Laughter)
But anyway, it's, um, it's 2016.
And they show that that areabasically has auditory

(38:56):
properties. Like it doesn'treally behave like a motor area,
it behaves like an auditoryarea. And so yeah, now I see it.
I didn't see it when I waslooking at this before. But
yeah, that's out of all yourfrontal areas that’s the one
that's like, linked up to theearliest layers in your model.
So you're capturing the factthat that's more of a sensory
area, and then the moreprefrontal regions are deeper so
yeah, okay, I, I buy it. I buyit now.

I'm not trying to sell it. (Laughter) But, um,
the first time I think Iencountered these, these motor
activation, I mean, it's, again,it's pretty recently given that
I am a newcomer in field that,was with, with imagery. So when
we do the social constructionwith imagery, we also see very
early on, activation in themotor areas. And at the

(39:41):
beginning, I was a bitsuspicious because I thought
they had just a socialreconstruction issue, but now we
actually see it also inintracranial recordings, and
here was fMRI. So I think all ofthese different sort of pieces
of evidence point toward asimilar, a similar findings, so,
clearly it is just, to me, it'sjust the beginning, right? This

(40:03):
again, these are justactivation, these are just
correlation. We don't know howimportant these activation will
turn out to be and I think thelesion studies and all this
remain completely relevant. Butit's, it's the simplicity of the
overall organization revealed bythis, by this mapping, which
strikes me as first and think Ireally think of this as Okay,

(40:24):
now we can see it a bit betterhow, how the language network is
structured, to process, toprocess language, but everything
remains to be, to be done and tobe investigated more thoroughly
in light of recent studies andanatomy and individual
variations.

Yeah, sure. But yeah, this is a good, this is
definitely a good ground rock tobuild on. Yeah, definitely
recommend everybody checks outthat figure. So, can we, can we
move on to the next paper? Or doyou want to say anything more
about this one?

It's, it's, it's your podcast.

Okay, I just, I do my best to structure things.
But you know….

There’s one thing that I can say here.
Because, again, it was, I wassurprised. And I think in
retrospect, I shouldn't havebeen, given what was said in the
literature. But I was justsurprised. So when we do this
comparison between, in this casewith Wav2Vec 2.0, and the brain,
we obtain a similarity score.
And as I mentioned, we do it foreach layer, and we find this

(41:25):
structure. And then the rest ofthe paper really goes much more
in depth in, like what kind oflearning strategy leads to more
or less high similarity scores.
So we train Wav2Vec 2.0 on, onspeech on non speech on speech
from a different language thatthe participants were exposed

(41:49):
to, and so on. And the strikingthing to me at the time was that
if you take random coordinates,you already get very high
similarity scores, you get atleast 70% of the variance
explained from the best, fromthe best model. And at the
beginning, I thought this was wedid something wrong, there was

(42:09):
something like a mistake in ourpipeline and all this, but I
think it was already, it wasalready described as such, in
different papers, including thepaper by Josh McDermott from
2017. It just was notnecessarily emphasized. And I
think in retrospect, we, perhapswe should not forget about this,

(42:31):
that even an architecture whichis not trained actually has
representations which linearlymap onto the brain, just simply
because of the, of theconvolutional hierarchical
structure of the network, youalready get sort of a very good
first step. And so the learningthere comes as a, something

(42:53):
which will increase thesimilarity, but it's, it's
clearly not the only thing,which makes the model similar to
the to the brain.

Okay. I, yeah, that was one of the things that
I was, I had written down to askyou. So I'm glad you went back
to it. Was, why does theuntrained model succeed at all?
But I still don't reallyunderstand why based on what you
just said, because, what doesit, why does, why would the
structure of the model be enoughto make it match up to bold fMRI

(43:25):
data?

So I don't, I do not know why. And again, here
are some, only some intuitions.
The, the way I think of this isthat sound is structured in
time. And so if you apply amathematical operation, which
preserve the temporalcomponents, you will generate a

(43:47):
representation, a newrepresentation, in a sense that
it's an information which wasnot linearly readable before but
is now linearly readable. Youwill, you will learn, you will
have something which is notcompletely random. And so, the
way I mean, that's the best wayto explain this. The way I see

(44:11):
representation learning orlearning in general, is that you
need to find combinations offeatures which are most usable,
to act on the world right to orto predict what's going to
happen. And this combination hasto be structured, but space and
time basically provide you withvery strong inductive biases. So

(44:31):
convolution in space orconvolution in time, preserve
the temporal spatial structure.
And so when you do thesenonlinearities in between
layers, you learn more and morecomplex or you generate more and
more complex representations.
And if they are sort of biastowards preserving temporal or

(44:52):
spatial structure, even therandom ones may be a good start,
as opposed to just completelyscatter or shatter the, the, the
information. So that's, that’show I think of this. But I, the
truth is that again, I do notknow why it works so, so well

(45:12):
with the random networks.

Do the random networks also replicate this
kind of almost concentricstructure that we were talking
about?

They have, they have a bit of it, yeah, but it's
less strong than what we haveobserved with Wav2Vec 2.0.

Could it be that the temporal receptive field
increases, as you go deeper inthe layers even in the random,
in the untrained network wouldthat be a potential explanation?

So, Wav2Vec 2.0 is organizing in to two bricks,
right? They are, there is afirst deep net which is
convolutional. And so here, thedeeper you go in the network,
and the more the larger thetemporal receptive fields of
each unit, simply because the,each unit is having its own
receptive field. So it's builthierarchically. So naturally,

(46:04):
you get this gradient. But in atransformer, there is no such
thing. So you need to learn tobuild larger and larger
receptive field because thetransformer basically sees it
all, like even the first layeris able to combine all of the
tokens from anywhere in acontext. But learning will bias

(46:25):
the, in fact this is what weobserve in, learning will bias
the first layers to focus onwhat's happening nearby from a
positional embedding point ofview and so they will naturally
build smaller receptive fields,fields. Whereas the deeper layer
will tend to learn a largerreceptive fields. But in
principle, if you take a randomtransformer, then you do not

(46:47):
have this, this bias.

Okay. So it couldn't explain that. It could
only explain sort of asymmetriesin the convolutional layers, not
in the transformer layers.

Absolutely.
Yeah.

Okay, so there's still some things to understand
here. Okay. Talk to them aboutlet's talk about the next paper.
Yeah?

Sure.

This one's called ‘Brains and algorithms
partially converge in naturallanguage processing’ by
Charlotte Caucheteux andyourself in Communications
Biology, 2022. And this one, Ithink that the essential step

(47:31):
forward of this one is to showhow this convergence between the
models and the brain is reallydriven by the ability of the
models to predict. So that, thatprediction is what explained
success. Is that like a fair wayto summarize its main point?

Yeah, I think the main result is that the, so
the question that we ask, is,what, what factors leads an
algorithm to be more or lesssimilar to the brain? And so we
already tapped on to this, thisquestion just through the
previous question. And so, okay,so we observe a functional

(48:14):
mapping between language models,and the brain and we see that
some models correlate betterwith a brain somewhere or
correlates with, less with thebrain. And in the literature,
what was not clear is, whatmake, made an algorithm more or
less similar to the brain,because they varied in pretty

(48:34):
much everything, right? So ifyou compare GPT-2 and Birds and
I don't know LSTMs and all thisthat are available online, they
have different architectures,they have been trained with a
different objective withdifferent optimizers with
different databases withdifferent size of databases, and
so you don't really know if,let's say GPT-2 is working

(48:57):
better than any other algorithmis because it's a better
architecture, it's because it'sbeen trained with more data, or
because of some of the factors.
And so, when, when we hadreleased, we released this, this
study, the same week as a studyfrom Martin Schrimpf and Ev
Fedorenko, where they did thiskind of mapping with the, the

(49:24):
existing models like Burton andGPT-2 and RoBERTa and all of
this zoo of models available.
And so the, they came to asimilar conclusion that’s the,
it seems that's one variablethat's predicts very well with a

(49:46):
model will be similar to thebrain or not, is its ability to,
to predict the next word. Sowe're really happy to see that
two independent labs sort ofcome to the to the same
conclusion.

Okay, let me say that again, just to make it real
clear, because I think that's soimportant, like because that,
that there's many differentways, there's many different
model architectures you canconsider and many different
parameters you can vary. But thebiggest factor that predicts
whether a model is going to do agood job of matching the brain
is how well it can predict thenext word in that sense in which

(50:17):
all of these models are set up.
Because now we've kind of goneback to talking about techspace
models, right? So we're not,we're not, we're not working
with the audio or auditorysignal anymore. We're back in
sort of classic by which I mean,like the last two years, was
language models where it's wordprediction. Okay. Let’s go on.

Absolutely.
Yeah. And so that's sort of the,the main result, and I think the
whole, how do we, how do we knowabout this? So again, what we
did is, we analyzed fMRI data,but also magnetoencephalography
data, which were recorded by JanMathijs Schoffelen at the
Donders institutes. And in this,in this study, participants had

(50:54):
to read sentences in a heavydecontextualized fashion. So you
only have a sentence, and thenyou have a five second delay.
And then it's another sentence,which has nothing to do with a
proof sentence. So it's quitedifferent from the from the
previous day when people arelistening to podcast.

Yeah, this is an RSVP Paradigm, right? Rapid
serial visual presentation. So,yeah.

Absolutely. So it's a reading, reading task.
And so the question that we hadis, okay, so what drives an
algorithm or language model tobe more precise, to be more or
less similar to the brain. Andso what Charlotte did is
basically to retrain a lot ofdifferent architectures that are
based on the GPT-2 architecture,based on BERT architecture. She

(51:43):
tried to vary the depth of thetransformer, how many activation
units they are, for each layers,how many attentional gates they
are. And so if you really didsort of a systematic grid search
there, and for each model foreach embedding, you can get one
value, which is okay, howsimilar is it to the brain after

(52:03):
a given training. And then youcan just feed this to an ANOVA.
In this case, we use anonparametric analysis. But in
principle, it's the idea is yousay, amongst all of those
factors, amongst the depth ofthe architecture, the weight of
the architecture, the number ofattentional gates, what we ask
the algorithm to do analysis,which variable contributes to

(52:26):
make a better brain score,similarity, which is increased
with regards to brainactivations. And as soon as we
have one variable, which is theperformance of the model to, to
predict the next word, itbasically sets up all of the
variance. So the ability of thealgorithm to predict the next

(52:49):
word, no matter how big thenetwork or deep the network,
disability suffices to predictwhether the model will turn out
to be more or less similar tothe brain. And that was very
intriguing in some sense. And wedid not necessarily anticipate
that the other variables wouldhave such a small contribution.

(53:12):
They are all significant. Again,we're working with a lot of
participants there, I think it's100 participants for fMRI and
the same for MEG. So basic,basically, all of the variables
have a statistical significance,but they're really small as
compared to, to this next wordprediction effect. And that

(53:34):
suggests one thing, which isthat the behavior or the task of
the model is really whatmatters. And then the
architecture and the attentionalgates and numbers of layers and
all this, these already means totowards that, that, that task.
And if you can do this task,then basically, even if you're a

(53:55):
small network, that's thatshould suffice to represent
things similarly to the brain.

So does that make you think that the brain is
engaged in predictive coding?
Or, or do you think that to getthe next word right, you need to
develop good representations oflanguage?

I don't think so, for either either questions.
So, I think the first thing thatit says is that the we should,
we should take this seriouslyand not spend perhaps too much
time on, on the architecture andtrying to sort of pinpoint

(54:31):
exactly what the relevance ofthe particular layer in learning
intelligent representations, butrather towards the the goal and
in this case, the goal is indeednext word prediction or word
completion. And this is the goalthat basically drives, drive
the, the rise of smartrepresentation is a

(54:57):
representation that can beuseful for something else.
Whether the brain does followthe same principle, I very much
doubt it. And the main reasonhere is that, unlike language
models, we are not exposed withthe same amount of words. So we

(55:18):
cannot just rely on trying topredict the next word. Because
in our lifetime, we don't justhear a sufficient number of
words to to complete this task.
So it's very interesting in, in,in the, over the past 60 years,
there's been a lot of debateright on what needs to be
innate, what can be acquired inthe context of language. And
there were a lot of arguments onreally sort of math, math based

(55:42):
arguments saying no, but it'sjust not possible to learn the
structure of language learnsyntax, with a simple exposure.
Conclusion, you need to have aninnate bias for, let's say,
recursive structures if we gotowards generative grammar. And
I think now, it's pretty clearand this has been argued, for

(56:03):
instance, by Steven Piantadosithat's, this argument is clearly
wrong, like language models nowcan process language, they can
retrieve syntactic structure.
And they are trained with a hugeamount of data. And so
statistics, let's say, onlysuffices to to, to learn the

(56:27):
structures of language. But nowthis perhaps clarify the debate
and saying that, okay, maybeit's possible that the whole
point where that's, thisrequires a huge amount of data
data that we thought before werejust not accessible. I think
that's also why people got itwrong is that it was not
conceivable that we could feedan algorithm with so much text.

(56:47):
And so, now that this has beenproven, the question still
remains is okay with arelatively small amounts of what
exposure, what's computationalarchitecture, or perhaps what
objective suffice to learnlanguage efficiently. And my
strong conviction is that nextword prediction is not the right
objective, because again, wedon't hear a sufficient number

(57:10):
of words per day. So just as asort of rough estimation, like
the few study that I could find,suggested, around 13,000 words a
day, it varies immensely acrossacross individuals, depending if
you're a teenager, if you're achild, if you're, depending on

(57:30):
your social class, and, andeverything. But the average was
13000 words a day, which fitswithin 50 books a year, right?
So 50 books a year. And then wecan decide how many years of
language you want for languageacquisition. But basically,
it's, it's going to be in theorder of the 500 books, if

(57:52):
really, we want to take a largemargin. And GPT for now, is we
don't actually have the exactnumbers, but they train on
hundreds of millions of books.

Yeah.

So it's just orders and orders of magnitudes
higher. And so, clearly wemissed something fundamental
here, the objective, that thesealgorithms are trained with this
next word prediction objective,they just, I mean, they clearly,
they work at scale, and this isvery impressive, like everyone,
I'm amazed every six months bythe new power of the planning

(58:27):
and language models inparticular. But still, we have
to recognize that there issomething extremely inefficient,
that they require an amount ofdata, which is just ridiculous
as compared to what children,children do. And so I think that
the historic question remains,and, and clearly the, we haven't

(58:48):
solved this problem yet. And soI'm still very excited by this.
So this was a long tangent,towards your question. But the
question was about this nextword prediction question, which
is, is this, is this at the endof the day what we do? I don't
think this is what we do.
Because language models requiretoo much data for it, for this
rule to succeed.

Well, even I mean, yeah, so you're entering
from a learning perspective. Andthat's very interesting tangent
for sure. A lot of things thatcould go follow up on there. I
mean, because well, one justquickly, like, you know, you
can't it's not even sufficientabout a learn from what the
average child receives, right?
You have to be able to learnfrom what the, the, you know,
the child and the poorenvironment, because because

(59:29):
people can learn even in veryimpoverished environments. So
it's kind of I mean, it's gottabe able to deal with like, maybe
10% of like, what would benormal and still, the kids will
acquire language with noproblem.

Yeah, absolutely. I mean, we can, we
can tighten our hands in theback to me that the challenge of
an even more difficult, buteven, even with, let's say, rich
environments, the amount of datathat we are exposed to is
ridiculously small compared tolanguage models, so we don't
even need to go into the extremeIn cases. But I certainly agree
that children have a naturalbias for, for babbling for, for

(01:00:06):
learning languages. This isobviously something that we do
not see in other species. Andeven after heavy training, we
don't manage to train gorillasor chimps to learn, for
instance, sign language, or atleast, whatever they learn is
extremely poor as compared towhat children are able to
acquire in a couple of years.
And so clearly there is analgorithm, there is an

(01:00:29):
objective, or an architecture,which allows to learn language
extremely efficiently. I thinkthis is what Chomsky and many
others had in mind, in a theory.
And so I think this this line ofthought remains extremely
relevant. And we should notdismiss it just because we have
now language models that work atscale, it depends, again, on the

(01:00:53):
objective. If the objective isjust to have an AI system that
learns to process language atany costs. Sure, we've, we've
arrived there, and this is done.
And so perhaps we don't needgenerative grammars and theory,
but if the question is, how dowe learn, how, what are the
rules or principles that sufficeto learn efficiently? There I

(01:01:16):
think that we are just at the atthe beginning, we haven't, we
haven't found it yet. At least.

Stephen Wilson (01:01:23):
Yeah. Yeah, I agree. There's something in this
paper that's interesting to mewhen I and, and Schrimpf et al.,
who you mentioned, did the samething, right? Which is that they
quantify, when, when you'redoing these model comparisons,
you kind of need like a nice,simple, objective way of talking
about how well the models fitthe brain. And they use this

(01:01:44):
concept of a noise ceiling,which you do, too. And so the
noise ceiling is, basically it'show well you can predict one
participants brain from theother participants brains. So
it's like kind of inter subjectcorrelation analysis. And the
idea would be like, well, we cannever hope to predict from a
language model, what isn't kindof shared among all humans,

(01:02:07):
right? There's always going tobe like, individual variability
in people's BOLD activation. Soit's unfair to make the model or
try and capture that, right? Themodel can only possibly capture
what is shared among all humans.
So the, so the correctdenominator, when you're
evaluating performance, is howwell can you predict one person
from other people? Okay? Did Isay that right?

Jean-Rémi King (01:02:28):
Yeah.

And so in Schrimpf et al., I know they get
very close to 100, they say veryclose to 100%. Like these models
are predicting almost everythingthat can be predicted by other,
from other humans. Do you getthat in yours as well? Or no? I
don't, I don't really, I didn'tget that from your paper,
whether yours was like that too.

No. We don't get that. But there are, so there
are, there is a first majordifference, which is that in
Schrimpf et al., they focus onthe on the 10%, best voxels. So
the, for each of the 10individuals that they analyze, I
think it's 10, maybe it's seven,I forgot, they take the 10% best

(01:03:12):
voxel and then they do the wholeanalysis on this. And so that's
quite different from what we'redo in a sense, because we do the
analysis on, on, on all thevoxels.

I think that would be enough to explain the
difference. Yeah.

So that's the first difference. The second
difference is that the noiseceiling that they use is based
on an extrapolation. So if yousort of get fully in the method
they extrapolate, okay, if weadd more participants, can we
expect to have a noise ceilingwhich ramps up more or less
quickly, and they derive noceiling from this sort of a

(01:03:46):
projection. And we don't do it,we just take the whole cohort
and we say, Okay, this is thenoise ceiling. We don't try to
extrapolate if we add a cohortof 1000 participants, we will we
get something better. And sowe've had a lot of pushback in
noisy earnings over the years, Ithink it's a bit less the case

(01:04:09):
now. But the first thing iswe're really hard on this
because we were always asked,okay, but you don't, you guys
don't provide noise ceilings,participants never hear the same
sentence and twice. And so wedon't know how much variance is
explained and therefore, werejected the paper. And I find
this, I find, I think this ismissing the point. So I think
noise ceilings can be useful,right? Because it gives us an

(01:04:30):
estimate of how good, how goodwe are. But in many cases in the
data that we've analyzed, weactually have models that were
better than the noise ceilingthat we build. And this, this is
for an obvious reason, which isthat when we train to predict
brain activity from a givensubject, given all other
subjects, all of the subjectsare also noisy. And so you're

(01:04:52):
learning to predict somethingfrom, from a noisy data and so
that's, that can be challenging,and so there are a lot of like
arbitrary decision how you buildthis, this noise ceiling and I
said projection is one thing,the voxel that you select is
another thing, whether you basethis on repetition or not
repetition within subjectsacross subjects, all of this and
then you end up I think, sort ofbuilding a whole analysis on

(01:05:15):
something which is not sostable, it depends, it depends
on on how you choose your noiseceiling. And so my, what I tell
students is to not care aboutthe noise ceiling whatsoever
until the very end, we'll do thenoise sitting at the end. But I
think what is reproducible, ismuch more robust is to provide
the actual effect sizes withoutno ceiling. So we say, Okay,

(01:05:38):
this is our score on the rawwaveform, this is what we, on
the raw BOLD signal on the rowimages signal. And this is
useful, I think, because ifyou're another lab, you will
also get this, this raw signaland so you can evaluate your
model with, with this kind ofdata. And it's, it will be
easier to compare acrossstudies, then if we have sort of

(01:05:59):
a zoo of different, of differentnoise setting. I, again, I don't
want to dismiss my ceilingaltogether, I think they are
useful, but they are, they are,they are too many choices at the
moment. For them to be, to be amust go.

Okay.

And in the case of language, it's even more the
case and in other modalities. Soin the case of images, for
instance, we know from again,monkey electrophysiology and
fMRI that if you present animage multiple times, most of
the activations are similaracross repetitions. But we know
that in language is not thecase, if you hear the same

(01:06:39):
sentence twice, we know that,for instance, prefrontal areas
are activated less the secondtime, even less a set time. And
this can be, this, this is notjust an adaptation effect, you
can have a repetition, which iswith sort of a destructor of
multiple minutes in between, forinstance, the work of Ghislaine
Dehaene-Lambertz, from the early2000 that's, that show this like
if you hear the sentence, secondor third time within the

(01:07:02):
session, the prefrontal cortexreact a lot less. And I think
the reason for this is at leastmy intuition, but obviously, we
need to do a lot of studies toconfirm this, but the intuition
is that we build languagestructures on the fly the first
time we hear them. But as soonas we know what we mean, in a

(01:07:23):
given sentence, we sort of formonline, these idioms are we sort
of can extract the meaningwithout having to build the
whole syntactic structures andsolve the ambiguities. And so
many of the voxels will not haveto, or many of the neurons will
not have to be recruitedbasically to, to achieve the
same goal. So perhaps is alsothe case envision, but in the

(01:07:44):
case of language is particularlyparticularly the case, and so
noise ceiling, the consequenceof this is that you cannot
present the same sentencemultiple times and hope that the
participants will process it inthe same way. And therefore the
very premise of my ceiling hereis jeopardized.

Yeah, that makes sense. I mean, it's just the
language is just more, morecontextualized. It can't not be
contextualized, relative tosomething like looking at a
visual scene where you can likelook at the visual scene, or at
least early visual areas willrespond the same way. Yeah,
okay. Do we have time to talkabout one more paper? Sure. All
right, let's talk about,Caucheteux et al., 2023. This

(01:08:26):
one's called ‘Evidence of apredictive coding hierarchy in
the human brain listening tospeech’, in Nature Human
Behaviour. Just came outcongratulations.

Thank you.

And this one kind of starts from the premise
that large language models arenot as good as humans at
processing language. And thenyou sort of ask why that might
be. And you have a possibleexplanation in mind, which is
that whereas the LLMs arepredicting just the next word,

(01:08:56):
what humans might be doing ismaking predictions, longer term
predictions and that like sopredicting more words and
perhaps predicting some kind ofhierarchical structure. And I
really liked the, you start inyour figure, you have this nice
sort of layout of the experimentand figure one and the example
sentence is ‘great, your paper’and then the prediction is, ‘is

(01:09:18):
not rejected’, (Laughter) whichis what we all hope for with our
papers, right? That's theprediction that we want to make,
although it's not always borneout.

Absolutely.
Yeah. We start from this, Imean, this resonates with a lot
of the points that we discussedearlier. But, so the example
that we chose in this, in thisfigure, which is clearly an
inside joke and I am not sureit’s entirely appropriate, but
whatever, is to focus onnegation. And so, I very much
like negation, because it's, tome a very minimalist example of

(01:09:52):
a very interesting composition.
In the case of , of negation, ifyou say for instance, it's not
rejected, you know that you needto combine the words in a
nonlinear fashion, in order toretrieve the meaning of the, of

(01:10:13):
the phrase or the sentence. Ifyou were just to do a linear
composition, it would be sort ofa bag of words. So you would, at
best, you would be able to saythat it's something about
rejection, but you cannotretrieve not rejected. And this
is even more obvious when youhave a slightly more complex

(01:10:35):
sentence. So if you say, it isnot small, and green, and you
have another sentence, which is,it is small and not green, if
you do a linear combination ofthis, you won't be able to,
understand the meaning becauseyou need to, to know that ‘not’
is applied to one adjective or,or specific words and not the

(01:10:56):
other. And this representationhas to be a nonlinear
composition. So that's, that'swhy we sort of focus on this
example. And the idea here is tofocus on this issue that I
mentioned earlier, which is thatwhen you try to do this next
word prediction objective, carryit scale, because you have a lot
of data. But you're not pushingthe algorithm to try to learn to

(01:11:20):
predict the next idea. And so inthe case of a composition, we
wanted to focus on this, if wesay it's great, it means that
the following part of thesentence should be something
positive. Rejected is anegative, at least from our
point of view, it's kind ofnegatively correlated. But if

(01:11:40):
it's combined with ‘not’, it'sfine, you have the right the
right prediction in mind, youknow, it's going to be something
positive, you have not rejectedin your let's say, your
validation, and you verify not‘not rejected’ is something
positive, and therefore, your,your prediction was correct.
Whereas, if you had tried topredict, let's say, 'great, your
paper is accepted', you wouldhave gotten that wrong, because

(01:12:04):
the true word was ‘not’ and thepredicted word was ‘accepted’.
And so, you would have told thealgorithm, okay, you completely
wrong, try something else,whereas actually, you had the
right, the right idea. So again,that's a long tangent to come,
to come about the same idea,which is that we should try to
have algorithms that are, do nottry to just do autocomplete. If

(01:12:27):
you can do autocomplete, whynot? but they also try to
predict the next ideas. The nextidea, perhaps, and then the next
hierarchy of ideas, because youhave structures that unfold over
different timescales, what isgoing to be said, within this
constituent within the sentencewithin this paragraph, what is
going to, what is narrativestructure of the story, all of

(01:12:47):
these things are to some extentdetermined and can be predicted.
And so we should optimize thesealgorithms on this goal, as
opposed to the sole goal oftrying to predict the next word.

Yeah.

The analogy that, I that I have is that I,
again, perhaps it's a wronganalogy, but what, when I think
of this, I think of how we wouldteach a kid to ride a bike. And
so here, when, what we have withlanguage model is, we basically
tell the child or the languagemodel to just focus on what's

(01:13:18):
exactly in front of the wheel.
Okay, try to predict whetherthere's going to be a little
stone or whether you should turnleft or right, right now just
avoid the obstacle, which is avery proximal short, sighted
objective. And, of course, youneed to do this, if you don't do
this, you will fall. But if, ifwe want the child or the agent
to be intelligent, we also needto say, anticipate your turn

(01:13:39):
anticipates the, how you'regoing to, where you're going to
direct your, your gaze. Andultimately, also, how do you
drive around a city? How do youwant to plan your route if you
want to, to go from point A topoint B? And if you have enough
driving experience, perhaps youcan do this only by looking at

(01:14:00):
what's exactly in front of your,your wheel and you will learn
every turn of the city. And I'veno doubt that this is what
language model basically does.
But that's probably not theright and the most efficient way
to, to learn. And so that's sortof the the idea here, we should
have algorithms that are trainedto predict multiple levels of
representation, and not justhope that these levels of

(01:14:22):
representation will emerge, justfrom the mere amount of data
that we feed them with.

Yeah. So there's so much behind that example,
which in the in the paper, youjust put the example in the
figure and, and, you know, Idon't think you talk about the
negation and the uniquechallenge of it. That's, that's
neat. So, in the paper, you, youkind of get, you address this by
introducing a forecast window,where the models have to predict

(01:14:49):
different numbers of words intothe future. Can you explain and
that's, and the hope is to seewhether introducing these
forecasts windows into the modelimproves the correspond between
the models and the brains. Socan you kind of explain how the
forecast windows fit into thewhole architecture that was a

(01:15:10):
little bit, I didn't reallyunderstand that when I was
reading it?

Absolutely.
Perhaps I should first say thatthe negation example is
something we are pursuing withArianna Zuanazzi in David
Poeppel’s lab, so that we have apaper on archive, specifically
focusing on negation, butoutside the domain of language
models, but for those interestedin the brain basis of minimalist
composition, like negations,that’s, I think, a cool paper to

(01:15:35):
have a look at. In the, in thispaper with Charlotte Caucheteux
and Alex Gramfort, we indeedchange the objective. That is,
that's the goal, we want tochange the objective of a
standard language model so thatit doesn't just predict the next
word, but it potentiallyforecast longer term
representations. And for this,we use two different strategies

(01:15:59):
independently from one another.
One, which is based on linearalgebra, and the other one,
which is based on optimization.
So perhaps I can start with theoptimization one because it's, I
think, simpler, but also a bitless conclusive because it's
sort of deep learning magic, asopposed to linear algebra, which

(01:16:19):
sort of decomposes things in aclear fashion. Which is the
exact reverse, what we did inthe paper. So, in the
optimization case, what we takeis, what we do is we take GPT-2,
we train it to predict the nextword. And then we take another
GPT-2 and we train it to predictthe next word and the latent

(01:16:41):
representations of the nextwords, and I think we take
something like the seven oreight words after the current
item. So if, for instance, youhave ‘once upon a’, the first
model is trained to predicttime, and the next model is
trying to predict time andwhat's going to happen in seven
words. But we know that's what'sgoing to happen in seven words

(01:17:04):
is non deterministic, it's veryhard to know what word will be
said in seven words from now,just because there are so many
possibilities, sort of theforking paths problem. So what
we train the algorithm to do, isto learn to predict a latent
representations of the futurewords.

Not the actual words.

Not the actual words, but the latent
representation. And so, twoobjectives, one, which is
proximal, language model, nextword prediction, and one which
is distant, which is trying topredict the latent
representations of what's goingto happens in seven words from
now. And what we show, is thatthese dual objectives lead to
activations which are mostsimilar to the brain, then the

(01:17:48):
sole, proximal objective, whichis next word prediction, that's
sort of the bottom line. Andthen we have this other
approach, which is not based onGPT-2 fine tuning or retraining.
It's based on sort of a linearalgebra decomposition. So, what
we do is a bit more complextechnically, but conceptually,

(01:18:10):
it's, it’s the same. So, we takethe activations of GPT-2 in
response to a given word andit’s preceding context. And we
asked, okay, what is thesimilarity between GPT-2 and the
brain? That gives us one score.
And then we say, if we were toadd to these activations of
GPT-2, the future activations ofGPT-2, would that increase the

(01:18:33):
similarity with the brain? Andthe answer is yes, up to, well,
it peaks around 9, 10 words, ifI recall correctly. And so, we
can do this systematically, wecan say if we add the future
activations of the word, solet's say we do a, we sort of

(01:18:55):
peek into what's going to happenin the future, we embed these
words in, into GPT-2. We extractthe activations, we use this
additional future activationsand we stack them on to the
current GPT-2, we ask, is itsimilar ‘yes’ or ‘no’ to the
brain, we obtain a highersimilarity score, higher brain
score. And we can do this andvery systematically the number

(01:19:19):
of words we peeked into, in thefuture, we can vary how deep the
representations of these words,these words should be to do
this, this similarityassessments systematically. And
the point, all of this is verytechnical, and I, I cannot
imagine how hard it is to, to,to follow what I'm saying in a

(01:19:43):
podcast without any diagrams.
But the point is that we havemethods to evaluate whether an
algorithm which has long termforecasts predictions is more
similar to an algorithm whichhas only short term forecast,
forecast predictions like GPT-2.
So that's the method. Firstresult is that it works better.

Hang on a second. I'm also curious to know
whether people understand. Ithink so actually, because maybe
for me, at least, I mean, Iguess I've already read the
paper, but, I understand itmore already having heard you
say it out loud. I think there'ssomething about like just
describing things in naturalconversational language that

(01:20:28):
just makes them easier tounderstand, at least I hope so.
That's the premise of thepodcast. So yeah, I think people
will understand the gist of it.
And there's always the paper ifthey want the details. Okay, so
tell us what you found.

Sure. I mean, I didn't mean to under evaluate
your audience. I know this is apretty advanced audience. So the
results, the results is that, ifyou enhance these GPT-2 LLMs, so
a language model, with long termforecast predictions, the
activations ends up being moresimilar to the brain. That's

(01:21:03):
sort of the basic finding. Andthis is not the case everywhere
in the brain. It's the casereally in the length standard
language network. So it'sSuperior Temporal sulcus and
gyrus, Prefrontal areas,especially IFG, a bit of the
Angular Gyrus, but it's, it’snot the case in let's say, the

(01:21:25):
ventral visual stream or in themotor areas. I have a doubt, I
haven't looked at the picturerecently, I don't I don't know
whether we have again, in, inany of the voxels in the motor
cortex. But generally speaking,it's really the expected
language network. Is typicallythe type of areas that you would
end up with, if you were to do alocalizer on language, as
opposed to some of the tests.
And so those are the regionswhich are better explained more

(01:21:47):
similar to the algorithm, whichis a long range forecast and the
short range forecast one. Andwith, from this, we can
systematically decompose, how isthe forecast structured because
we can systematically varywhether the forecast should be
short-range or long-range ormiddle-range. And so we try with
trying to predict the next wordor two words from now, three

(01:22:11):
words, four words, and so on andso forth. And it peaks around. I
think, again, between eight and10 words. It varies slightly,
it's not an exact number,depending on the voxel you look
at. And what's reallyinteresting is that it's these
forecasts depends on, the rangeof the forecast depends on where

(01:22:35):
you are in the hierarchy oflanguage. For instance, if
you're around the primaryauditory areas, the forecast
seems to be peaking at a shorterrange than if you are in the
prefrontal cortex. Again, thisis not just a representation,
it's, it’s the prediction is.

(01:22:55):
You wouldn't be, you would havea better model of the prefrontal
cortex if you enhance yourlanguage model with a long range
forecast. And you would have abetter model of the auditory
area if you had a short-rangeforecast, and you have sort of
these gradients in between thosetwo, those two extremes.

Yeah.

That’s sort of one, one dimension of this
forecast structure. And theother dimension is not how far
ahead, the forecast happens, buthow deep it is. And so for each
future word, we can try topredict the world level, which
is sort of the lowest possiblelevel. But also, it's the

(01:23:39):
representation that it has inthe first layer and the second
layer and the third layer, andso on and so forth. And that
gives us sort of a level ofabstraction, loosely defined as
how deep the representation isin a transformer. And again, for
each voxel in the brain, we cansay is, is it better to have a
forecast which is rather shallowor weather deep in the network.

(01:24:00):
And again, we observed thatprefrontal and parietal areas
tend to be associated withdeeper forecast, and auditory
areas tend to be associated withshallower forecasts. So that
resonates a lot with this ideaof predictive coding where you
would have noticed oneprediction, but a hierarchy of

(01:24:21):
predictions. And thesepredictions are organized
similarly to the hierarchy ofinference of representation,
which is that lower level areasrepresent the past and predict
the future in a relatively shorttimescale at a relatively
shallow level. Whereas thedeepest levels of the of the
language network would belearning and representing much

(01:24:46):
longer context would beanticipating much further away,
well 'much' is perhaps anextension, it's further away
than the lower level regions.
And we predict these, thesemore abstract levels
representations.

Yeah, and it looks like yeah, specifically
like prefrontal, not premotor,Angular Gyrus is the part of
parietal lobe, which looks to bethe most extreme on that
measure. And then also, I'd saylike ventral temporal, I kind of
inferior temporal. It looks likethat, I mean, that that also all

(01:25:21):
kind of makes sense in terms ofbeing like, further downstream
than those primary auditoryareas. I probably should go and
eat dinner with my family. But Ido, but I do want to ask you one
more thing, if I can?

Sure.

You have this really neat analysis. It's very
complex. (Laughter) Little, I'ma little bit hesitant, but I but
I'd love to hear you explain it,where you look at this semantic
versus syntactic forecast. So Imean, this is this is a topic
which I just think isinteresting. The extent to which
we can, you know, parcellate outthe language network along those

(01:26:01):
lines. So can I kind of get youto tell us how, how that, how
you distinguish between thosedifferent kinds of predictions?

That's actually, I think, my favorite paper from,
from the PhD of CharlotteCaucheteux, who defended
recently, a PhD. So we add thisanalysis is derived from from
this paper. So we had a paper atICML, I think it's 2021, where
we developed this analysis todisentangle syntactic from

(01:26:33):
semantic representations in thebrain using language models. And
here, we're just applying thisanalysis in the context of
forecasts. But in the paper, weare applying it in the context
of just representations. Butit's completely analogous in
analytically speaking, and theidea is, is not that difficult.

(01:26:54):
It's, it’s quite mathy thepaper, but the idea is, is I
think it's pretty simple.
Usually, what we do is wecompare a deep nets to the brain
in response to the same inputs.
So the deep nets here, ‘onceupon a time’, the participant
here ‘once upon a time’ and weevaluate with the activations
that are similar to theactivations of the brain. And in

(01:27:15):
this paper, we thought, okay,perhaps what we can do is not
present the same inputs. But topresent an input with the same
syntactic structure but adifferent semantic content. So
for instance, ‘once upon atime’, I'm not able to pass this
quickly. (Laughter) I can takeanother example, if you take, if
you take the following sentence,‘the giant bowl is on the
table’, you can create asentence, which is ‘a red cow

(01:27:44):
lies near the house’, it has thesame constituency tree, it has
same dependency tree. But ofcourse, it doesn't have the same
meaning. And so what we do inthis paper is, we made a little
algorithm, which generates a tonof sentences. And we try to

(01:28:08):
optimize this algorithm.
Basically, at the end, generatesentences that have the same
dependency tree as the originalsentence. And we present them to
the algorithm. And we extractthe activations for each of
those sentences, which aresyntactically matched. And we,
we, the result of this processis that we can have an

(01:28:33):
activations in the deep nets, inresponse to sentences with the
same syntactic structure. And wecan use those activations to try
to predict brain activity. Sowith this, basically, what we
have is, we have a model thattells us what is the expected
activations given syntacticstructure. And this model is not

(01:28:57):
derived from linguistics, wedon't have any ideas about merge
and, and movements and, and allthis. It has some constrain, and
because of, we do generatesentences which have the same
dependency trees. So it's notcompletely random either. But
it's kind of a linguistic freemodel, in that sense. And we can

(01:29:21):
try to see which areas arepredicted effectively by these
activation, these syntacticactivations in the model as
opposed to a full languagemodel. Okay, so that's one
analysis. And then we cancompare these, these, these

(01:29:42):
effects to a random model ormodel which only has access to
position. So you will generatesentences which have the same
number of words, but they don'thave the same syntactic
structures. And finally, comparethis to a model which has the
exact same sentence and so ithas both syntax and semantics.
And by doing the systematiccomparison, we can try to see

(01:30:03):
which areas basically areaccounted for by syntax,
syntactic representations, whichareas are accounted for by a
semantic representations andwhich areas are associated with
both representations. So youneed both syntactic activations
and semantic activations to bestaccount for the activation in a
given voxel. And so that's whatwe do in here in this paper was

(01:30:25):
forecast and what we observevery briefly, is that the
syntactic forecast seems to berelatively shallow and
relatively around SuperiorTemporal gyrus and Superior
Temporal sulcus. And it's notheavily associated, a bit, I was
a bit disappointed by this, butjust data, it's not necessary,

(01:30:46):
for instance, heavily with IFGor with the Angular Gyrus. It
tends to be relatively centeredaround the temporal lobe, where
the semantic forecast appear tobe more distributed. So that
perhaps is a clue towards, it'sthe first step towards trying to

(01:31:07):
systematically decompose theseactivations into something that
we can relate our theories on,as opposed to just say, this is
a similar activation between thedeep nets and the brain, but we
have no idea what thisactivation is actually
represents.

All Episodes

Episode Transcript

Popular Podcasts

On Purpose with Jay Shetty

Stuff You Should Know

The Joe Rogan Experience

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Deep learning algorithms, natural language processing, and the brain, with Jean-Rémi King

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}On Purpose with Jay Shetty

Stuff You Should Know

The Joe Rogan Experience

All Episodes

Deep learning algorithms, natural language processing, and the brain, with Jean-Rémi King

On Purpose with Jay Shetty