Nils Reimers: Sentence Transformers, Search, Future of NLP | Learning from Machine Learning #3 - Learning from Machine Learning

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
How did the best machine learning practitioners get involved in the field?

(00:09):
What challenges have they faced?
What has helped them flourish?
Let's ask them.
Welcome to Learning from Machine Learning.
I'm your host, Seth Levine.
Welcome.
I have the honor of having Niels Reimers on the show today, the director of machine learning
at Codehear, former researcher at Hugging Face, the creator of Sentence Transformers,

(00:34):
and researcher on dozens of papers in NLP.
Welcome to the show.
Great.
Yeah, really happy to be here.
Why don't we start off, just give us some background on your career journey.
How'd you get to where you are today?
Yeah, so it actually started, I would say, a long time ago.

(00:55):
So I first trained the first neural network in the early 2000s.
So at that time, I was playing with this Lego, I don't know, these Lego robotars you can
build.
And so I thought, okay, maybe it's cool to add some artificial intelligence to the robot
and control the robot with a neural network.

(01:15):
And yeah, this was kind of toy example in the early 2000s before, I don't know, AI was
in the media or in the hype.
And another key point was like in 2009, I had a great lecture in Berkeley, Introduction
to AI, so a super awesome professor who was introducing the concepts was like Pac-Man.

(01:40):
So every concept he presented and started was like some A-star algorithms to find its
shortest distance between two points, which is like for a root planet.
And then showed this, how can you do it, it was like Pac-Man or reinforcement learning,
how can you do it, how can you Pac-Man become smarter, how can your girls become smarter.
And had a lot of challenges, playful way, it was like super amazing to do this.

(02:04):
But then yeah, sadly, I did not went directly into machine learning AI.
I first did like a detour in information security, so did a master degree on information security.
But then after the master degree, I said, okay, I want to go back to artificial intelligence,
artificial learning, and started to do my PhD in that field.

(02:26):
Very cool.
So what was it that you initially figured out like that machine learning is something
different and it's something powerful?
I think it was super fun in the beginning.
So it was like super fun to watch your Pac-Man and try to build it smarter.
And it was like really stupid in the beginning.

(02:47):
So you see, okay, your Pac-Man was hunted by the ghost, and then it does the wrong turn.
So it goes to the left instead of like to a dead end, into the trap to the left instead
of going to victory or take the right turn.
Like yelled at the machine and say, why do you take a left turn?
Why did you go to the right?
Why are you so stupid?
And then why is it so hard to make you smart?

(03:10):
And so it was like extremely playful and triggered your like ambitions.
So you want to say, okay, I want you to make it better and better and better.
It was every iteration, it got better and it was like fun to create.
And then after the master degree, I said, okay, it's super powerful things you can build.

(03:31):
And at that time I said, okay, language is extremely has a lot of potential because so
much information is stored in language.
So if there are spoken language or written language like text, and if a machine can understand
text, can understand what's written in Wikipedia and all the books, all the news articles everywhere

(03:51):
and everything, if a really machine cannot understand this, this will give you like an
extremely powerful machine.
So Pac-Man was nice and fun, but at the end, a smart Pac-Man can just win the game.
But if you have a nice and fun Pac-Man that can read text and can produce text, that's
super amazing and super powerful.
And that's what got me interested in 2013.

(04:13):
Awesome.
So what was one of your first major projects in natural language processing?
Yes.
So at that time, LP was completely different field.
So my professor, Lina Gurovich, where I started my PhD, she heard a lot of boss from the computer
vision domain about neural networks.

(04:35):
So neural networks had a lot of hype really a long time ago, like in the 50s and 60s and
70s.
And half a century ago, people were extremely excited about neural networks, but it didn't
work out.
And people said, no, neural networks, that's not working.
That's like bad technology and you shouldn't use it.

(04:55):
And I don't know, I know some colleagues who talked to Hinton in the early 2000s and he
started to talk about neural networks again.
And they were like, oh, please stop about this boring topic.
I don't want to learn anything about neural networks.
It's bad technology.
Why are you still doing that?
Again, in 2012, there was like the ImageNet moment where neural network from Hinton was

(05:20):
so much better on ImageNet and recognizing images than any system seen before.
That computer vision was extremely excited about neural networks again.
And then my professor asked me in 2013, hey, is neural networks something that might be
relevant for NLP?
So in NLP at that time, we were still using Naive Bayes, support vector machines, back

(05:45):
of words, TF-IDF, these types of features and stuff.
And yeah, my first task and or the whole task for my PhD was figure out will neural networks
have an impact on NLP and what type of impact will they have?
And the first project was named entity recognition.

(06:05):
In the text, can I identify named entities or company names, company names, person names
and this?
And yeah, but more generally, it was like neural networks, how will they change NLP
and how we do machine learning?
Awesome.
Having the creator of Sentence Transformers, I would love to dive into it with you.

(06:31):
Can you explain what is Sentence Transformers?
What was its original goal?
And why is it so powerful?
Yes, so Sentence Transformers is an open source library, which takes text into a vector space.
Sounds bit abstract.
So far as humans, when we read text, we can make sense of the text.

(06:55):
But for computers, it's like really, really hard to if they just look at the text, just
read the text to really make a sense.
So they don't understand the words, how the words are connected.
So for example, the word hotel and motel, it's like these things are really similar.
But from a computer perspective, these are two different tokens, two different words.

(07:16):
And the computer doesn't know like how are these two words connected?
What are concepts?
Are they similar, dissimilar and so on.
And so what we do is like, take the text written by humans and transform it to a representation
that's understandable to a computer, that here vector spaces are extremely powerful.

(07:36):
And then in these vector spaces, we can embed and code relationships of words.
So like hotel and motel, yeah, both are, it's a place, a physical place, which has rooms
where you can go and stay overnight, which has a reception where you pay money to stay
overnight.
And then can encode all this information we have into the vector space so that the computer

(08:02):
can start to reason about the text, like really understand the text.
And yes.
Yeah.
So, so the text embeddings are basically taking the text and converting it into a numerical
representation.
And then you can sort of do operations with the text.
Can you speak to that?
Correct.

(08:22):
So it encodes all these information we have on text to all the information, semantics
of a text makes it accessible to the computer.
And then you have certain dimensions, certain directions.
So for example, you have singular and plural words.
You have like, for example, I don't know, gender, you know, king is connected to male

(08:46):
and queen is connected to female.
You know, the relationship between London and England, that London is the capital of
England.
And then, you know, okay, what's the capital of Germany?
And then we can take the same relationship, the same direction, the vector space to infer,
okay, the capital of Germany is Berlin.

(09:06):
And then the user can add the computer can learn from all these encoded relationships
of words and sentences and phrases and paragraphs and infer meaning like, okay, in which direction
and how do I talk about that?
Singular, murals, synonyms, relationships like capital, politicians, companies, and

(09:26):
relationship to founders.
Basically, you encode all these relationships you kind of have about the world into the
vector space to enable an efficient access of the computer to this to the information
in the text.
What do you view is the biggest jumps in text embedding from you know, bag of words to word

(09:49):
to vec to where we are today?
Um, yes.
So the first big splash, which caused a lot of interest was like word to vec.
So before that, you represented text as, yes, as like, unique tokens, like, like, one hot

(10:12):
encoding, one hot encoding.
There was like no relationships.
So the distance between hotel and motel, and the distance between hotel and dock was the
same.
So the so for a computer was like impossible to know that motel is probably more similar
to hotel than dock to hotel.
And then word to vec, they enabled this on a work level.

(10:35):
So had like a really cool paper showcase that you can encode this on a work level.
And then second big splash was in 2017, 18, like Elmo was was contextualized word to vec.
So the issue was word to vec is that it's on a work level.

(10:57):
So those, okay, hotel and motel are close by, and apple and banana and strawberry has certain
relationships.
But we use words in a big setting, like the word apple, I can refer to the fruit, or I
can refer to the company, or I can refer probably to some movie or song or some podcast series

(11:18):
or some website.
So how do you know what apple stands for in this context?
So when I say apple is great company, I probably talk about the company.
If I say apple is my favorite food, I talk about the food.
And yeah, Elmo was like the first way to show that you can compute contextualized word embedding,

(11:38):
meaning the model learns, do you talk about the company apple or about the food apple
or some other meaning of apple.
And this enables like, yeah, more complex, better understanding of how the words are
used.
And then on top of that, we started to build like understanding of sentences and understanding
of paragraphs.

(11:58):
Right.
So yeah, so the other ones that I view as like really big stepping stones are like top
to vec, you know, things like that, just being able to represent either words, sentences,
you know, full documents, numerically, and then being able to sort of do these operations.

(12:19):
So creating powerful text embeddings is obviously something interesting for people in the NLP
field.
But you know, so what?
You know, why should businesses care?
Why should people outside of NLP care?
So yeah, my favorite application here search.
So so far search.
It's horrible.
Like like a lot of applications, if you exclude like Google and Bing and so on.

(12:44):
So for example, if you go on Wikipedia and hit that search bar and ask question, what's
the capital of the United States?
First entry first search result is about capital punishment in the US.
The article about Washington DC has not seen the top 20 search results.
And yeah, that's like completely failed.

(13:04):
So even such the search query is simple, doesn't retrieve any relevant results.
Because Wikipedia right now, and like most other search systems in the world, use lexical
search where the system has no understanding of the text.
So it has no understanding that capital of the United States is connected to Washington

(13:28):
DC.
So it has like no idea that there's a relationship to this and Washington DC.
Because if you look at the surface level, just at the characters, it's different, like
capital of the United States and Washington DC.
That's different.
And with embeddings, we have these relationships built in.
So the vector space knows that Washington DC and capital of the United States is connected.

(13:52):
And also that United States and US and USA are connected.
So it can retrieve extremely good search results.
Also these embeddings, they make classification much nicer.
So I build a system to filter spam unwanted content from a new inbox.
I got like a lot of cold emails from people trying to sell me stuff, or headhunter trying

(14:17):
to send me CVs where I say, no, please move it away from my inbox.
And yeah, with these embedding approaches, I can say here like five examples of unwanted
emails, and now it works extremely well and filters out all the unwanted emails just by
providing these five examples because it learned what do I don't want to see.

(14:39):
I don't want to have like any code calls, code emails, people trying to sell me stuff.
So please filter it out.
Here are five examples how such a code call email looks like and then it learns it and
then knows and can generalize to other content in the field.
Right.
So embeddings are sort of the first step.

(15:00):
You're going to take a text and then you represent it.
And then you can start to do similarity tasks, which allow you to do search better, information
retrieval, text classification.
And as you were referring to, there's different techniques to do search.
So lexical search is that that's more where you're just looking for exact matches.

(15:25):
And then neural search is where you get to use sort of the power of text embeddings.
Are there any applications where neural search, there's some limits, where there are limitations
for neural search?
Yeah, of course.
I mean, there's always like pros and cons.
Obviously like neural search, I mean, neural search is like a really broad topic.

(15:51):
It's not only embeddings.
It's like, I don't know, probably 20 different techniques.
How you can use these technologies to improve search results.
These embedding approaches itself, they have challenges if you want to do like lexical
search.
So I don't know if you search for like a phone number, you want to find that entry with
this specific phone number.

(16:12):
And there's not like a lot of semantic meaning in a phone number.
So you cannot infer, hey, at position five, there's a seven.
You cannot say, okay, there's like a lot of meaning.
Or if these two positions are off by one number, that's a completely different phone number.
So this is like a limitation.
These approaches, obviously they have challenges understanding new words and learning these

(16:35):
concepts of new words.
So we constantly innovate new companies, new products, new movies are released, new people
become known.
So, so big question in the field is like, how can this model learn these new concepts
and the relationships of these new concepts?

(16:56):
Yeah.
So just, just digging into a sentence transformers, you know, seeing that it has like 9,000 plus
stars on GitHub and 25 million plus installations.
How has the package changed over time?
What's been your experience with open source?
Can you speak to that?
Yeah, sure.

(17:18):
So yeah, originally what a lot of researchers do in the field, they do some research and
then they mainly publish the code to reproduce their research.
Say, hey, here's my paper and here's some code you can run to get like same numbers.
There's like a lot of understanding in the machine learning research community, which

(17:39):
in my opinion limits the usefulness for these software.
I mean, you build amazing models, amazing tools, but other people do not really want
to use your tool to get the same numbers.
I mean, it's kind of boring as an output to say, yeah, at the end it prints out 82.5 and

(18:02):
that's the same number I put in my paper, but they want to take your software and build
a cool tool.
They want to do a semantic search on Wikipedia or do a semantic search on the emails or do
a semantic search on notes or do a semantic search on podcast transcript.
So this changed a lot.
So in the beginning I was similar, say, okay, mainly publishing code just to reproduce the

(18:24):
experiments and results from the paper to more, no, no, let's make a product out of
it.
So what's a cool tool coming from research which allows other people to build cool product
and cool use cases out of this?
And that's then the main thing I did in the past years in all the research.
Say, okay, do some research.

(18:44):
How can we enable X?
And then if we found a way for this, build a product, source product, and give it to
people to use it to build other cool stuff.
Over the three plus years having that library, what are some of the biggest challenges that
you faced?
I mean, as a researcher, you mainly often judge our main criterion is the number of

(19:11):
research to put out, like how many papers do you publish, what are your sector and contributions.
If you publish and maintain like an open source library, there's like, you have to do work
in terms of maintenance, update it, update it to the most recent version of Python or
dependencies.

(19:32):
Do some bug fixing because someone wants to use it on the back book.
And this takes time away from you doing research.
And so it can happen that you do all these work, which is amazing for the community,
but you don't have any time more to do research and contribute and improve human knowledge.

(19:55):
And so you have to find the right balance.
Right.
Yeah.
So finding that balance between having a library that's useful for as many people as possible
where they can use it as like a huge building block for their work.
And then also you want to be continuing to push your limits and continuing to expand

(20:15):
the work that you're doing.
One of the most exciting for me at least use cases of sentence transformers is set fit.
I see that you're an author on that paper with some people from Intel and hugging face.
Can you talk about the experience of creating set fit or helping?

(20:36):
Yeah, sure.
So in OpenAI when they published GPT-3, they had like paper showing that these large generative
models are extremely good and few short classification and start a lot of hype.
So if you write the right prompt, you can classify if a news article is about sports

(21:00):
or business or technology.
They show, okay, you only need like really few examples.
So you show them all like three examples about sports news articles, three examples about
business and three examples about technology.
And then for a new article, the model can for the category.

(21:20):
But if you really use this, it becomes like cumbersome to use.
So you have to do like people invented the term prompt engineering because it makes difference.
If you add like a colon, semicolon or an exclamation mark at the end of the prompt, sometimes it's
helpful if you ask like, please classify this article instead of classify this article.

(21:45):
So it becomes like really, really cumbersome, really hard to use from a user perspective.
And then last year Moshe from Intel AI said, okay, I think with embeddings, this is much
easier.
So let's just take the examples, take your three examples of tech news, business news
and sports news embedded in a vector space and train a classifier on the vector space

(22:08):
to see where the vector space or tech articles, business articles and sports articles.
And assuming the model has or the vector space has learned all these relationships about
sports and players in sports and players in technology and players in politics and people
in politics use this for future classification.

(22:32):
And yeah, what we showed or Moshe showed that he can be much better than GPT-3 in a few
shots setting while using like a much smaller model, like a model you can run on your phone,
a model that's like 50,000 times more efficient and faster.

(22:53):
And then he approached me and said, hey Niels, I found this super cool tool.
It's super amazing.
You want to do research on it.
And when we tested it, we were totally amazed because you don't have to do any of these
prompt engineering where it makes a difference if you end the prompt with a period or an
exclamation mark or a colon.
It works really nice.

(23:13):
It can scale to any size of training data.
It's super efficient.
It runs on your phone.
It's better than GPT-3 on your phone for text classification.
It works in a multilingual setting, Extremely.
So we tested these in-context learning examples for different languages.
We did not find any method, for example, that worked in Japanese.

(23:37):
So we tried really hard and connected also with native speakers.
So the girlfriend of one of the co-authors is Japanese.
So really make sure that we got the right Japanese prompt.
And if she has some ideas how we can modify the prompts to do classification in Japanese.
But yeah, with these embedding approaches, because they are language agnostic or can

(23:58):
be language agnostic, it doesn't matter if you're examples are in English or German or
Japanese or Arabic.
So you take like three Japanese news, you say, okay, these three Japanese news are tech,
sports, business.
And then you train the classifier and then you have like an extremely poor four-few-shot
classifier system in Japanese.

(24:19):
And yeah, we were totally amazed by ease of use because super easy to use and super fast
to run.
Yeah, I have been using it and I absolutely love it.
The results that you get are incredible.
Yeah, ease of use is, it's like, it's a pleasure.
It's really nice to use.
You can run it on your laptop.

(24:42):
The beauty of it for me, I think is like the contrast of learning the application there
and that how you can take an embedding and leverage the information that you have, right?
That these two samples are similar and these two samples are not similar.
And then you can use that to create a even better embedding that can help you with whatever

(25:05):
your downstream task is, say, text classification.
In the NLP field, I feel like it's very hard to know when you're making meaningful progress.
I know that you have spent time exploring different sorts of benchmarks.

(25:28):
I'd love to just get your take on it.
Can you speak to the big NLP benchmarks, the glues and super glues and all of that?
Yeah, sure.
So yeah, machine learning field, NLP field loves benchmarks.
That's totally what people do.
You take a benchmark, you see what's the latest number.

(25:50):
Let's say it's 84.0.
You try a lot of tricks and then you get a better number, 84.2.
And then you think, yeah, 80.2 is higher, larger than 84.0.
So you write a paper and say, this is better.
We seldom really ask, is it really better?

(26:11):
Is this 0.2 improvement, 0.1, or 1.5, whatever the delta is, really, really better?
And I experienced this with sentence transformers itself.
The original models from the paper in the benchmarks that were used at that time and
which were the common standard in the field, the sentence transformer, the first version

(26:33):
of the sentence transformer models looked better in all the benchmark than, for example,
the universal sentence encoder, which was like one, two years older.
But if you really use it for your own application and really try both stuff with it, you often
saw, OK, no, universal sentence encoder was at that time much, much better than sentence

(26:54):
transformers.
And so how can there be such a gap?
Like all the benchmarks say, yeah, sentence transformer, sentence bird, machine one is
better than universal sentence encoder.
But this got me interested, like, OK, how good are our benchmarks?
Are we really benchmarking what we want?
And how can we create better benchmarks?

(27:15):
And sadly, a lot of benchmarks in the field become, or are becoming meaningless now.
So in the beginning, when you publish a benchmark, it's useful, it's meaningful.
But then over time, the value of the benchmark decreases.
And at some point, it hits like zero value.

(27:35):
So it's nowadays completely useless that you get a new state of the art results on Glue
because the benchmark is over-secreted, over-fitted.
It doesn't really tell.
So even if the model is better on the Glue benchmark, it doesn't tell anything if the
model is really better, if people will use it.

(27:56):
Right.
And I love the comparison that you made where once a, in a recent talk, you were talking
basically about how having these benchmarks out and then having all these hundreds of
people trying to work on it to make it better, it's like you're going to be over-fitting
on that particular data set.
You don't know how it's necessarily going to generalize.

(28:18):
And you talked about how it was similar basically to like, you know, p-hacking, right?
Like if you run a hundred different experiments, yeah, you're going to see that something is
correlated with something else, but it doesn't mean it necessarily is a meaningful relationship
between the two.
Yeah.
I mean, if we take, for example, Glue, Glue just has like short text classification or

(28:44):
text understanding task in it, which is like mostly in the sentence level.
So this take this, the sentence and tell me as a like positive or negative sentiment.
And yeah, what people did working on this, they found ways how to train models that are
better on the Glue benchmark.
So now a lot of papers and models reports say, yeah, we just train our pre-trained models

(29:10):
on short sequences.
So we just pre-train it on like sequences up to I don't know, 128 work pieces.
Because for Glue benchmark to work well on Glue benchmark, we don't need long text understanding.
It's sufficient if the model just understands the sentence.
And yeah, it's sufficient for Glue.
And then people put it out and you think, okay, this is a great model, it performs well

(29:33):
on Glue and you use it for your own application.
But sadly your application is, I don't know, email classification, which is longer than
a sentence.
And there it works really badly because it was never trained to work on like longer text,
never tested on longer text.
And maybe the older model, BERT, which was not over-fitted heavily on Glue, is much,

(29:55):
much better because it works well for your emails, which are longer than a sentence.
Yeah, it reminds me, maybe it's like even the same thing.
How similar are the datasets that you work on when you're training and evaluating your
model offline and then when you put it into production and you see the real data that

(30:17):
it's dealing with and then you realize like, oh, it's not performing the way that I thought
it was.
But if you look at why, sometimes it can be fairly clear if you trained it all on short
text, so then it's not gonna work well on long text.
One thing that I find to be really incredible and yeah, the datasets that you used to train

(30:40):
sentence transformers, the variety of them, can you talk about that a little bit?
Yeah, sure.
So yeah, the original sentence transformers model was just trained on like small dataset
from NRI.
So there's like the standpoint NRI dataset, which is like all everything is like short

(31:02):
text, really clean, nicely written.
And as mentioned, the original model was not as good as Universal Sentence Encoder.
That was like really bugging me because Universal Sentence Encoder was trained on a lot of Google
internal data, like millions, billions of data points from all sources.

(31:23):
But sadly, it was not like publicly available, so it was like Google internal data.
So yeah, I spent a lot of time, a lot on the sentence transformers to get more and more
data and then use more and more iterations, make every dataset available to build up like
some public open source collection.
And so now publicly available, there's like over a billion training pairs available from

(31:46):
all places like Stack Overflow, Stack Exchange, Reddit, news articles, duplicate questions,
hands-on image captions.
And yeah, this allows them all to learn more.
So not only to learn what are the relationship between two nicely cleanly written sentences.
So that's what the first version of sentence transformers was trained on and sadly evaluated

(32:10):
on.
So now it was trained on like a whole bunch of texts, like really ugly, noisy, social
media text full of hashtags and emojis.
And now the model understands like emojis and hashtags and knows, okay, what's the similarity
between hashtags and relationship in hashtags?
And this gives you a much, much better model.

(32:32):
Sadly, sometimes if people use it on the old benchmarks on the nicely cleanly written text,
it doesn't perform as well as models overfitted on these settings.
And I say, oh no, how is it like a state of the art if it's like two points weaker on
this absolutely irrelevant benchmark because we have a big trust in benchmarks and think,

(32:54):
okay, this benchmark tells what is the best model.
Right.
It's pretty incredible, you know, how text that's in a book compared to text that's,
you know, in a social media post compared to text that's in a dialogue, you know, how
different they really are.
You know, yes, it's all people trying to express themselves, you know, using natural language,

(33:19):
but there's just such, such a variety.
I think that's what makes it hard to get this one, you know, encoder that's gonna work for
everything.
And I think that's why you sort of need to figure out what's the best one, what's the
best embedding that's gonna be applicable to your use case.

(33:39):
It was so nice to talk to you about, you know, sentence transformers and set fit to zoom
out a little bit just to talk about machine learning in general.
I'm just wondering in the field, what's an important question that you believe remains
unanswered?

(34:01):
So what amazes me about the humanist ability, how quickly we can learn and update information.
So still a big challenge of a lot of models, like if you take the bird model, birds still
thinks that Barack Obama is the US president has like no idea about Donald Trump, no idea

(34:21):
about Joe Biden.
And for us as a human to update the knowledge that there's like a new US president, it's
like one sentence like person X is the new US president.
I don't know, Joe Biden is the new US president.
And we have this information in our head, but we also update the relationships.
So we know, okay, there's a new US president, we know there's a new first lady, we know

(34:45):
maybe the party changes.
So before was Donald Trump Republican, now it's Joe Biden Democrat.
It changes the number of presidents who attended different schools, the number of presidents
who have been former vice presidents and so on.
Like a lot of knowledge is updated in our head from this short, I don't know, five words

(35:05):
examples like Joe Biden is the new US president.
If we do this for a model like birds, like some transformer model, it's like super, super
hard to update it.
So often we need like a lot of text, like millions of examples mentioning that Joe Biden
is the new US president and now the Democrats are back in the White House and that there's

(35:29):
a new first lady and so on.
So it's like extremely super inefficient.
And this makes it like really, really hard to have like models up to date, models that
you can learn like niche topics, because it's not data efficient.
And I think going forward, that's an extremely interesting research topic, how can we make

(35:51):
models update as efficiently as humans can acquire new knowledge?
Right.
So that ability to adapt to change.
And that sort of touches on my next question, which is how do you think machine learning
will change or evolve say in the next 10 years?

(36:16):
And what do you think the impact will be on society?
Big question.
Big question.
So currently the trend we see is we go to more exciting applications, which are harder

(36:37):
and harder to quantify.
So far machine learning research is a lot about quantification.
So you take a benchmark like Blue, where you have like 1000 examples, I don't know, movie
reviews, which you annotate as positive or negative sentiment.
And that's like really easy to benchmark your numbers and say, okay, this model is better

(36:59):
than previous models.
How's it go?
Like, I don't know, 95% correct and previous model got 94% correct.
With these more complex use cases like generative things like, I don't know, here's an email,
write like a nice response to that email.
Like how do we evaluate this?
Or I mean, we saw these applications, chat GPT write like a poem, how bubble sort is

(37:24):
working.
How do you evaluate if the generated poem is correct?
So you have to evaluate as the content, does it make sense?
How amusing it is.
How amusing is it?
Does it rhyme?
And this puts like a lot of stress in science, like, okay, how can we know these two systems,
this system creates like a nice poem, how bubble sort is working?

(37:48):
Or is this system working nicer?
And I think we're more and more tapping on these use cases, we see now it's possible.
As interesting, how can we create like a science out of this, which requires experiments?
And how can we continuously improve on this?
And yeah, the research field went away in machine learning a lot from human experiments,

(38:12):
like asking humans what they think to like take quantitative numbers, computing accuracy,
YF1 scores.
I think now it's going back again to human experiments.
So we showed 100 humans these two generated poems, which poem is nicer is better, and
then try to find these numbers.

(38:32):
But yeah, it will be really hard for science how to compare it, how to scale it.
So it will be a change how we do scientific stuff.
Speaking of chat GPT, and, you know, the generative models, you know, I see that it's creating
a big hype, right, even larger hype for AI.

(38:55):
How do you view the gap between the hype and the reality right now in machine learning?
Yeah, that's that's good one.
So what I mean, it shows that there are a lot of applications possible, which we have

(39:16):
not fought before which which could be possible.
I mean, so so I just read like, I don't know, really nice cool idea.
As as manager or director or whatever, we get like a lot of emails, so you have to respond
to a lot of emails.
And then yeah, use these generators models to create like a draft for every email you

(39:38):
get based on previous responses to send you do some post editing and then you can send
around and send it out and this will save you as a manager like a lot of time when your
main job is like communication and you don't have to write these drafts by hand each time
but just from previous what you send around, just take the same text distillate and generate

(40:00):
this.
Still a lot of uncertainty unknown in the fields like what's what the right business
and that's what people currently try to figure out what's the business model was similar
when when the mobile internet was launched and smartphones was launched.
Like how do you create like a business out of an app?
So so what's an app business?

(40:20):
That's kind of the business model over the past 15 years, we learned how you can create
like a business model out of apps.
And I think it's similar with AI now.
So what is potential business model for AI for generator things, majority of things will
not work out.
But there are some things that work out.

(40:42):
And then it can be copied over and over again.
Okay, that's that's a new way how you can use it.
Right with with the, you know, advent of generative models and sort of how they're permeating
through society, it's it's really crossing over that boundary of it's not like a natural

(41:03):
language processing thing.
It's not a machine learning thing.
It's it's affecting all different types of work.
I think it's really how humans interact with those generative models.
Like for example, you know, you're talking about like drafting up emails or drafting
up, you know, outlines for essays.

(41:24):
And then, you know, the human can then take that and work it and make it you know, make
it better.
It's going to be really interesting to see sort of how humans and machines, how that
interaction evolves over the next couple of years, when we're in this time, where it's
like now everyone is sort of seeing the power.

(41:48):
It's going to be interesting to see.
Yeah, like all the different applications of it.
So switching into our learning from machine learning part of the show, who are some people
in in the field that influence you?
Yeah, big, big impact like NG.

(42:13):
So he gave like, like, I don't know, like, not too long, quite recently, maybe like one,
two years, really cool, thought provoking talk about data center AI.
So which I totally love like in machine learning, people focus a lot on modeling, they say,

(42:34):
okay, this is my data set, the MNIST data set.
It's given, it's got given.
That's the training set.
That's the death set.
That's the test set, I have to get the highest accuracy on the test set, given the train
and death set.
And I try to hyper permute it to my model as much as possible.
But yeah, so he argues that working on data is a lot more fruitful.

(42:59):
So in many settings, in real cases, it's not like, okay, here's like one test set, or one
data set where you can train on one test set you're going to evaluate on.
And you can modify how the training works, you can change the training, add features,
add, clean it, get more data, annotate more data.

(43:21):
And this often improves your model a lot.
And yeah, that's also what I learned.
Okay, it's often not relevant to improve the model.
So a lot of work I did in the past years is not trying to make the model better and add
another, I don't know, skip connection or the latest add-on, whatever variation there's
out there, but find ways how to make the data better.

(43:44):
And this paid off a lot for search.
So we have these initial search models, and then we found ways how we can make the training
data nicer and better and work really well to train these vector spaces and remove bad
examples from training examples.
And that had like massive impact.

(44:06):
So if the model is really training nice, clean data, it works much, much better.
And then often you can just copy paste the same approaches and just use it with good,
nice, clean data.
Right.
Yeah, it's interesting because when you're learning about data science, I feel like the
attractive thing are the algorithms and you want to be modeling things.

(44:30):
Then you might go to a resource like Kaggle, which is incredible, but you're given, you're
handed a data set, right?
And it's pretty clean usually.
That is never the case in industry.
So I think that that's another part of the data-centric approach to things, just sort
of always making sure that, are you getting the best data that you can?

(44:54):
Are you processing it in the right way?
And yeah, I'm definitely a big proponent of that.
Yeah, totally.
I mean, in the talk Andrew showcases some cases from computer vision where they take
a picture to try to detect like effects and manufacturing.
So some parts which have like arrows.

(45:15):
And what they do is like, I don't know, use a different camera, set it at a different
angle with different light and model accuracy jumps like 20 points.
So that's completely outside of data pre-processing, feature engineering, models, hyper-permitter
tuning.
But yeah, how you acquire the data had a big impact.
Like, okay, use different angle of the camera, different lighting, and now everything is

(45:38):
super easy to spot these mistakes in the production.
That's nice thinking where you don't think, you get the data and then you try to, I don't
know, clean the data or tune the model.
But yeah, even a step before, like how do you acquire the data?
Right.
Just, yeah, figuring out different ways of, you know, what are the inputs going to be,

(46:03):
you know, for the model.
That's a very interesting way to augment data for computer vision.
With so many things going on in machine learning and it being such a rapidly evolving field,
how do you stay, you know, up to date with the latest developments and techniques in
the field?
Yeah, that's a challenge one.

(46:27):
I try, I do not stress too much about like trying to stay up to date with latest techniques
because a lot of things or majority of things that are published are kind of irrelevant.
Every month you have like thousands of papers on archives, but only if you do like a retrospect

(46:47):
what has been relevant papers last year, you can break it down to 20, maybe 50 papers or
so.
So you don't have to read all the thousands of papers that are uploaded to archive and
what's relevant will be resurfaced at some form because there's maybe some full-on paper

(47:08):
using this old technique because this old technique works well.
And then you say, okay, cool.
There's like some full-on work.
And in general, what works for me is Twitter.
So see what are people talking about, what are people tweeting.
Open papers, just read the, I usually just read the title and look at figure one or table

(47:31):
one, maybe the abstract and then just get like a gist of like, okay, that could be interesting
or feeling and then over time, maybe at some point we read some paper and say, oh yeah,
there was some old paper that had this cool figure.
And then yeah, sadly back to a search problem, trying to really find those papers still really,

(47:54):
really challenging.
And it's not a-
You could use a normal search.
Yeah.
I mean, Google search doesn't work that well where you say, yeah, I read some time half
a year ago an archive paper, which in my figure one showing how you can modify kind of, I
don't know, momentum in the optimizer, can you please show me this paper again?

(48:16):
It goes back to a prompt engineering, right?
Yeah.
Yeah.
I mean, yeah.
I mean, many, many people compare search query formulation with prompt engineering.
So we, I don't know, when Google launched, people didn't know how to use it in the beginning
and they had to learn what's the right query to write.

(48:36):
And yeah, that's similar with prompt engineering of generative models.
Yeah.
But-
Prompting Google in the right way can get you pretty far in this world right now.
Yeah.
In your machine learning journey, what's one piece of advice that you received that has

(48:56):
helped you or stuck with you?
One piece of advice that stuck with me or helped with me.
The next question is going to be harder.
So-
The next question is going to be harder.
Yeah.
I didn't reflect on that question.

(49:20):
So yeah, I cannot really narrow it down to like single piece of advice.
So what I'm more, what I did a lot of in the past years is more be closely connected to
the community, see what are common questions they have where there's no answer for this.

(49:42):
So for example, common question after launching sentence in Swalmose, which only works in
English, was like, hey, that looks promising.
I want to use it for another language.
And there was like no answer for this.
And I got this question over and over and over again.
I said, oh yeah, if there's no answer, that's great research question.

(50:03):
Let's do research on this.
And so we did cool research.
And then another common question was like a lot of people, yeah, I want to use this for,
I don't know, searching and CVs, but I don't have any training data.
So how can I tune this model without training data?
The answer was then, oh, you need training data.

(50:24):
It's not possible without training data.
So I started to do a lot of research last year, how to train models without labeled
training data.
So it's more like be connected to the field, see what are the challenges and find.
Yeah, that's good advice.
In the similar vein, someone who is just starting out in data science or machine learning or

(50:50):
say is making the transition from another field, what advice would you give them?
Depends a bit on the roles, if they more go into like PhD role, research role, or if they
more go into like product role, want to ship a product.
Say industry, say they want to go into industry.
Let's say go into industry.

(51:11):
First I would say, do not trust everything that you read in papers for both roles.
So often the best approaches are not the approaches, the newest approaches, like not try to get
the state of the art approach for whatever problem you try to solve.
But the early on approaches, there's often like the first iteration on something like,

(51:36):
I don't know how to generate images.
And then there's like a second iteration and a third iteration.
And this is often the best kind of like model.
And then at some point people start to overfit on the benchmarks and create like systems
that are complex and overfitted and unstable and not robust and not efficient just to beat

(51:57):
these benchmarks.
So I try to find out the sweet spot where we really make progress and then find a solution
that's easy.
And in general, yeah, a lot of testing, find ways, quick ways to test your hypothesis.
Don't try to be too clever with simple approaches.
In often simple ways, simple approaches brings you a lot further than super complex methods

(52:21):
and ways.
Yeah, I think that that's really good advice.
Start, start with the foundation.
Here's the tricky question.
What advice would you give yourself when you were just starting out in your career?
That's also a good question.

(52:43):
Thank you.
Yeah, so I would say, yeah, go early on with like the user centric research.
I'm a big fan of user centric research, which means do research that actually helps people
and make really sure to release something that's helpful for others.

(53:05):
So a lot of researchers, including myself in the beginning, were just saying, hey, we
just published this paper and the work is done if the paper is accepted at some conference.
I say, no, that's like, I don't know, that's not really the purpose.
We want to find a problem that's big.
And that's really challenging to find.

(53:28):
So partner up with someone experienced to see these big problems and then create like
a really nice solution for this and do the research, but also make the results accessible
in a really simple and easy way.
I like that.
I think the way that I view research is basically there's a big puzzle in front of us and you're

(53:52):
working on a single puzzle piece sometimes.
And I think you can sometimes lose sight of the so what.
So why should someone care outside of this field?
And that can kind of make your work more relatable and allow other people in.
And that's always a way to sort of enhance the work that you're doing.

(54:18):
I mean, accessibility is a big issue in machine learning.
So we have so many papers on, for example, optimizers.
Like, I don't know, every month someone publish an optimizer that's much better than Adam.
But I don't know, everyone is still using Adam optimizer, which is already, I don't
know, five years old or so or older.

(54:41):
Yeah, there's some funny memes about that, like Adam asking why me or something like
that.
I don't know how old is Adam.
It's from 2014.
So it's like nine years old.
And they she has like, in these nine years, there has been probably hundreds of papers
on better optimizers than Adam, but no one is using them.

(55:04):
And I think one big issue is accessibility.
So maybe you have found like a better optimizer than Adam, but they did not make it accessible.
And if you really want to make it accessible means to have it like nicely, efficiently
implemented and available in like common frameworks like TensorFlow and PyTorch and JAX and integrated

(55:26):
in libraries like I don't know, how in face transformers.
So make it extremely simple for people to use it, test it out.
And then often you see it, if you do it, yeah, it works on these few example use cases, you
tested it.
But if you take all the users of, I don't know, TensorFlow, you will see, yeah, maybe

(55:48):
doesn't work for 98% of the users of TensorFlow, it just works for like tiny slice of users.
So then you can do research, okay, how do you make it more broadly suitable?
So how can you increase it?
Or how can you better predict for which 2% of users is it actually like a benefit to
use this new optimizer?

(56:09):
For listeners that are just getting involved in machine learning, you know, what is an
optimizer?
How would you explain that to someone?
Sure.
So optimizers, the fundamental way how we train the model, so we take a model, give
input like an email, and then ask them, hey, do you think this like a spam email or is

(56:32):
it ham email?
And then the model say, no, I think, yeah, that looks legit.
You want to buy some medication over the internet?
That sounds good.
And then you have a label say, no, no, no, this is spam.
I don't want to see this in my inbox.
And then you say, okay, there's like a mismatch.
There's difference between what the model predicted and what's the correct answer.

(56:54):
And then the optimizer can bring it back to the input, say, okay, which words did you
thought make it look legit?
And how can I modify the weights so that the next time you see this example, the email
will be correctly classified as a spam email.
That's great.

(57:14):
Yeah, thank you for that.
That's a good explanation for optimizers.
So getting to the conclusion, what has a career in machine learning taught you about life?
Career in machine learning taught me about life.

(57:37):
That's what we're all here for.
Yeah.
I mean, it's more what has life taught me about machine learning.
That's the easier one.
Well, if you want to start with that, you can.
Yeah.
Oh yeah.
What taught me machine learning about life?
So I'm a parent of two kids, like one now, so like two and four years old.

(57:59):
And yeah, you kind of see it like as your model, learning, improving, doing mistakes
at the beginning and then improving and provide feedback.
So you're kind of like the gradient update and optimize it to your kids.
And it's kind of interesting for me, at least to see the things which are like what's similar

(58:23):
in machine learning, trying to improve the model and what you as a parent try to raise
your kids.
And yet still amazed by the learning capabilities of humans.
I mean, even in a young age, I don't know when they are like two or three years old,
you can invent names.
So you take like some toy or some stuff or some concept and you create like a fake name

(58:48):
for it and they can talk, use this name and start to reason about the name.
So if this is called like this invented name, then this must be some other invented name.
It's like extremely interesting how quickly kids pick up language and be able to draw
these conclusions and reason about this.

(59:11):
And yeah, if you try this, even with the smartest GPT, chat GPT model and say, hey, please call
my car, whatever, call my car, John, it's not really able to learn this and not able
to reason about this.
Right.
Super interesting.

(59:31):
So yeah, thinking of children and how reinforcement learning is influencing their behavior and
how they're representing the world around them, just like the models that we're trying
to train.
I'm going to butcher his name, but Francois Chalet, I love some of his tweets.

(59:55):
He talks about raising children and how it's like training a machine learning model.
That's great.
That's such a great take on it.
So just to wrap things up, if there are listeners that want to learn more about you, where could
they go to learn more about you?

(01:00:17):
Yeah.
So when you Google my name, you can find my personal website, neons-rymos.de.
You can find also my Google scholar profile about research.
And yeah, you can also watch cohere.ai about work we're going to do like instrumented search
and text understanding and bring it to production.

(01:00:38):
So really now more focused.
I went to move away a bit from research side more to production side.
So how can we deploy these systems and face the challenges from nicely clean research
benchmarks to, okay, I actually put it into production and see all the challenges you
have with ugly, noisy data and a production setting and how can you ensure that your system

(01:01:04):
still work well and nicely in these settings.
Right.
Niels, it was such a pleasure to have you on Learning from Machine Learning.
Thank you so much for being here.
Likewise.
It was great chatting with you.

(01:01:26):
Thank you for listening to Learning from Machine Learning.
This episode featured an expert in natural language processing, Niels Reimers, the creator
of Sentence Transformers and currently the director of machine learning at Cohere.
Be sure to check out the show notes to learn more about this podcast and some of the topics
discussed.
Talk soon and keep on learning.

All Episodes

Nils Reimers: Sentence Transformers, Search, Future of NLP | Learning from Machine Learning #3

Episode Transcript

Popular Podcasts

Bookmarked by Reese's Book Club

On Purpose with Jay Shetty

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Nils Reimers: Sentence Transformers, Search, Future of NLP | Learning from Machine Learning #3