Beyond Transformers: Maxime Labonne on Post-Training, Edge AI, and the Liquid Foundation Model Breakthrough

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
Yeah. So push training is is quite
simple to define. So you turn a model that is able
to do this auto completion stuffinto a useful assistant that is
able to answer questions and follow instructions.
Welcome back to Chain of Thought, everyone.

(00:20):
I am your host, Connor Bronson, and today we're going to be
going down the stack as I'm joined by Maxine Labonne.
Maxime is head of post training at Liquid AI, where he is
leading the development of theirLiquid Foundation models, which
we're going to talk about a veryinteresting and fundamentally
different architecture that's challenging the transformer
monopoly. He's also a prolific open source

(00:42):
contributor. Maxime is the creator of the
wildly popular LLM course on GitHub with over 66,000 stars
and much more. But we're going to say, hey,
Maxime, welcome to the show. Great to have you here.
Hey, Connor. Hey everyone.
Thank you for having me. It's honestly our pleasure.
We had several listeners that reached out to us and said, hey,
look, we love a deeper conversation with Maxime.

(01:04):
We think he would be a great guest.
And so it's a delight to have you on the show.
And we're going to get deep intoarchitecture, some post training
techniques perhaps and the trade-offs that have come from
your experience with open sourceand so much more.
But let's start with Liquid AI. An industry where attention is
all you need has been gospel since 2017.

(01:27):
You've gone back to first principles with liquid
foundation models, or LFMS. Can you walk us through what
LFMS actually are for those who don't know, and how they
differentiate at an architectural level from other
models? Yeah.
Thank you for this question. So Liquid started working on
architectures since the very beginning.

(01:49):
This is something that we're very keen on.
And we had this first version ofLFM models now called LFM one
that was released in 2024. And now we released LFM 2 in
2025. And this generation is open
sourced. We have a ton of models.

(02:09):
I checked just before this recording and we've open sourced
17 models since July. So in, in just like 3 months or
something and it's, it's going strong.
I can, I can tell you we still have more.
But to go back to the question about the architecture, what we
wanted to do with this LFM 2 models is having a on device LLM

(02:33):
that is fast and accurate. And the current breed of LLMS is
accurate, but it's not very fast.
So we wanted to go back to the architecture level to be able to
design something that was truly optimized for this kind of
hardware, like a phone, like a wearable, etcetera.

(02:55):
So the architecture that we ended up with is a hybrid that
has some attention layers. It has 6 attention layers in the
three 50 million parameter, 700 million parameter and 1.2
billion parameter model. It also has a new component
which is a short convolution layer and here we have 10 of

(03:17):
them. So there are more convolution
than attention layers in these models.
And what we gain from it is thatinference speed is a lot faster
and the memory usage also is a lot lower when you have long
context. So your KV cache doesn't explode
because of the convolutionaries and all of that while

(03:39):
maintaining the level of qualityof a pure transformer model.
So those are the three metrics that we look at.
And usually there's a trade off between the memory usage, the
inference speed and the quality.And here we managed to kind of
raise this Parado frontier of optimizing the model for the

(04:00):
three task. And it sounds like one of the
key goals for Liquid is to enable your models to run
extremely well on edge devices so that anyone around the world
can leverage in an LM on their phone, regardless of whether or
not they have a subscription, open AI, or whatever else.
Yeah, exactly. We started working with

(04:20):
customers in 2024 and even a bitearlier than that and we
realized that the model that wasthe most demanded was our 1D
model. And to be very honest with you,
our 1D model, the first generation of LFM, it was a bit
of an afterthought. We said, yeah, like we, we can

(04:41):
do it, so why not, right. But it was not like a flagship
model at all. And yet that was the most
popular 1. So that got us thinking that
there was probably a really interesting market to explore
because nobody is in this niche of small models that are really
optimized for on device inference.
And this was the reason behind this choice of going on device

(05:06):
LLM for Elephant too. I love how unique the approach
is because I think most competitors here who are
building models are really focusing on, OK, like how big
can we make this damn thing? How many data points can we get
in it? And you're taking the exact
opposite approach. You're saying, what do we need
to do to get this on the minimumviable piece of hardware that

(05:30):
someone is going to have in their pocket?
And I, I think that is going to deliver dividends long term.
And I've heard you describe these LFMS as being built with
computational units deeply, deeply rooted in dynamic
systems. Can you unpack a bit what this

(05:52):
means in, in practice for peoplesince I think some of us and
I'll put myself in this camp here.
We hear that and we go, OK, likewhat are we really trying to say
here? Like like, OK, great, we're
trying to get it on edge device,but what what's, what's really
happening here? What's the goal?
Yeah. So this is people who work on
architectures lingo and what what what they mean by that is

(06:17):
basically going back to mathematical operators.
And when you go back to mathematical operators, you
realise that the attention layeris super strong actually.
So it's very it's a very, very strong baseline.
And that's another lesson that is very important when you work
in this space is that theorical games might not be realised in

(06:42):
practice. And it's not because you have an
architecture that is supposed torun very fast, because the
algorithm, the math says so thatit's going to fully realise once
you lower the architecture on the target hardware.
That was a lesson that we learned with the first
generation of LFM models. And for the second generation,

(07:04):
something that we did right at the beginning was optimising the
models on a target hardware. So on a Samsung phone, for
example, So we would not be misguided into believing that
our our operator is really good.It's very fast nor like we could
measure it and make sure that it's actually working in

(07:26):
practice. And I think that's part plus
having a ton of pre training benchmarks like over 100
different evaluations also helped in converging into like
the best architecture for this task.
And I, I think many of us are much farther or much higher up

(07:47):
the stack where we are. We're using models and practice.
Maybe we're coding with them, maybe we're writing with them a
bit. But beyond surface level
architectural understanding, like many folks don't even know
how to get involved in the modeldevelopment space.
Now that's not true of everyone.We have some very smart
listeners. I'm not trying to call all of
you out, but what I'm driving athere, Maxim, is I think you've

(08:10):
developed a really unique and powerful skill set.
And I'm curious from your perspective, like what led you
into this path and to this not just edge compute, but I think
edge of the entire space this frontier?
Thank you. First of all, I think it's about
like the curiosity of knowing how these models are built.

(08:33):
I think it's it's shared by a lot of people.
For example, Hugging Face released a very long blog post
about how to pre train models, how to push train them to
actually. And that is widely popular
because I think there's like curiosity in everybody that
chooses LLM. Like, OK, like how do you train
one? Like how does it work in

(08:54):
practice? And if you're interested in
that, my recommendation would be, yeah, you can go into this
rabbit hole and try to explore the architectural side or maybe
the breast training side, like it doesn't matter, like what
you're interested in in particular, but just like Mac
projects. And that's how I really started

(09:15):
with my own projects in the opensource.
And then it leads to new opportunities that you wouldn't
have otherwise. So we are mostly driven by the
curiosity of understanding how you make these models.
And now I really enjoyed this seat of being able to to drive

(09:36):
for training, drive the the data, the evaluations and
iterate and release the models and see how they're used in real
life by end users. I would love to ask you about
just that. How are you seeing end users
leverage Liquid models on their Edge devices?
So we we started doing hackathons with liquid.

(09:57):
So if you're interested, know that we have regular hackathons
all the time and people are super creative, like extremely
creative. The last hackathon that we had
was in Tokyo and there's a team that won the 2nd place, but they
had like a crazy idea. They said we're going to fine
tune the LFM model and put it onthe bike and it's going to be an

(10:21):
AI bike. And at the beginning of the
presentation, I was, this is themost stupid idea I've ever heard
of, right? This is like so dumb.
And at the end of the presentation I thought these
guys are absolute geniuses. Like who am I to even think that
they're dumb? Like they're just like
hotsmarted me completely. And actually it's a model, It's

(10:42):
a vision language model built ontop of LFM 21.6 BVL.
And what they did is kind of role play of a bike.
For example, if you ask this model, what is the transformer
architecture, it's going to tellyou, I don't know, I'm a bike.
And that's that's crazy, right? You never have that in real

(11:04):
life. Like you never find you in a
model to refuse to answer a question.
So people are like extremely creative in the way that they
use these models. This is just one example.
There's another guy who made a, an app on your phone and it, it,
it reacts when you use your phone at night and it tells you

(11:25):
no to stop doing that basically.So you browse Reddit and it's
going to tell you the best kind of subreddit is sleep and this
kind of stuff, it's just poppingon top of your screen.
It analysis your screen and the content that you see and it
creates A snarky message to tellyou to stop using your phone.
Those are might might not be like the most useful.

(11:47):
Applications of ways and are allof these applications snarky
ones? We got a snarky bike.
We got snarky Reddit. But you can test them.
You can test them today. They're available online and
that's beautiful. Did they put giant googly eyes
on the bike to indicate that it had a mission model as part of
it? I think they did.
I think they did. All right, I I might have to try

(12:08):
that myself. That one sounds really fun.
I have to admit, I think my wifewould be like, what are you
doing in the garage today? And I'm like, don't, don't worry
about it. OK, this, this is fantastic.
I, I love to hear the creativitypeople are using this with.
And if if folks want to join in on these hackathons, is there
somewhere they can go to get more information?
Yeah, the official account of Liquidity AI on Twitter, on

(12:30):
LinkedIn. You will find all the
information there. So as much as I'm excited to
have an AI bike, and I think I'monly beginning to scratch the
surface of the possibilities as I add more and more googly eyes
to it and indicate that it can really see everything.
I'm certain there are other use cases that the folks are having.
And I've in fact heard, you know, several businesses that I

(12:51):
talked to who are excited about this.
Where are you seeing liquid models in the wild outside of
the hackathon? Yeah, there are a lot of
different use cases, right, Because we have different
modalities as well. So I talked about vision
language for the bike and for the app, but we also have an
audio model that is really good and is able to do this kind of

(13:14):
audio foundation model. We can build applications on top
of it. It can do speech to text, but
also text to speech or speech tospeech or text to text.
It can do like any combination of those and that allows you to
do a lot of creative applications.
For example, if you're in a car and you want to talk to an
assistant, it can replace Siri and make do a better job most of

(13:38):
the time. And we also have Nanos and this
liquid Nanos are small fine-tuned version of our
models. So for example, you have a data
extraction model and this data extraction model, you give it
some text and it will return a Jason object with it.
You can specify the Jason objectthat you want to use.

(14:00):
And an interesting application, because we've been talking about
on device deployment quite a lot, is that with this small
models like the three 50 millionparameter model, actually you
can deploy it on GPU. You can deploy it at scale and
do big data operations with it. And that unlocks a lot of
applications that were not really possible before with

(14:21):
generative AI models. So for example, if you are
e-commerce, finance, all this kind of fields and you have a
ton of operations, you can use this tiny model and do
structured, unstructured 2, structured text conversion with
that. So that's one way of using them.

(14:44):
And we have a list of these finetunes that people can download
and play with to understand, like how to adapt them to the
use cases. We have rag, we have function
calling and many others. Are you seeing devs leverage
liquid models within agentic structures?
This is harder to do with small models, right?

(15:06):
Because agency capabilities are pretty much frontier.
But it's possible to do functioncalling and tool use.
It's possible to add them as part of a framework.
I've got good feedback actually about the functional calling
capabilities of even the 350M model.
So it really depends on your task.

(15:26):
It will not be able to replace cloud models quality if you need
some kind of complex workflows. But we think of it more of a one
task thing. You have one task.
Maybe in a multi agent system? Exactly.
A multi agent system would fit the description here.
I'm curious more broadly about your perspective on agents.

(15:47):
Obviously that's been maybe the the topic of conversation for
almost the past year now within the AI community.
And yet I think there are many folks, and I'll include myself
on here with Carpathy and otherswho think, you know, we're a
little early here still, like there's a lot of work to be
done. What's your perspective on this
somewhat debate? Yeah, it's, it's funny because

(16:08):
people have different definitions of agents as well.
So it's also a very tricky topicto to talk about.
I am a bit skeptical about the success of agents so far.
I do like code assistance. I do like some workflows that
are a bit more automated, but itfeels like it's a engineering

(16:30):
problem as much as a machine learning problem, and we're
mostly focused on the machine learning problem so far.
So this is something that I would like to be able to power
with LFM models on the phone directly and have some kind of

(16:50):
simple agentic workflows. I think that's usually agents
are good when they're quite simple and straightforward.
They tend to degenerate quite fast when it's about more
complex workflows where there's a ton of reasoning involved.
But just being able, you know, to directly change the settings
of your phone with natural language or even just talking to

(17:14):
it, I think it's already like a big improvement.
You know, like adding, stacking up these little features is
something that would change the way that we interact with the
systems. And this is the core of this
agentic, not really revolution, but evolution in the way that we
interface with hardware. I think this is like the, the

(17:37):
main thing that excites me is understanding, OK, we can really
give models for people to power their workflows.
It can be agentic. It doesn't have to be, but this
is a way to go forward and I hope that it will not just be

(17:58):
with cat models but we can also do it with more on device LLMS.
Are you seeing liquid models be used in rag systems in
particular because of the injection of context that is
occurring? Therefore, I guess it probably
enables liquid models to succeedat these like extremely high

(18:19):
levels with great response quality while not requiring the
massive amounts of pre training and and data sets that would be
required to and kind of pull it off from being an edge device or
edge device capable. Yeah, Reg is a good use case
because it's not too complex andsmall model can do it very well.
We have a rack specific model actually to do it.

(18:42):
And the other thing that is interesting the Wright pipeline.
So you have like the model doingthe summarization of the
question answering, but you alsoneed at least one model to do
the retrieval part. And the retrieval part is also
super interesting. So we we released a Colbert
model, which is a late interaction retrieval model.

(19:04):
And the idea here is that you have three types of embedding
models. You have classic embedding with
Bert where you give a an input to the model and it will output
A vector and then you can do some competition with the
vector. You can calculate the similarity
with other vectors and this is how you do embedding and

(19:25):
retrieval. You have the the cross encoder
or reranker architecture where you have the query and the
document at the same time and then the it's processed across
all the layers and then you get the similarity score.
So this is good, but this is very expensive.

(19:46):
And finally you have this late interaction family where what
you do is that you take kind of the best of both world, meaning
that you can pre compute your vectors and also you can do re
ranking in the same model. So.
I think this is a very exciting way of doing retrieval.
And we thought, OK, since we have this super fast LFM 2

(20:09):
architecture, this makes a lot of sense to try to do it because
with this we can have a model that is bigger, so it's a lot
better, but it's as fast as a model that is a lot smaller.
So you kind of get like a super good trade off.
And yeah, I think that RAG is a really exciting application for

(20:30):
small models in general because you can get super fast retrieval
thanks to it. And you can also get very decent
quality, very good quality in terms of QA because it doesn't
require like a ton of reasoning.I think the latency point here
is particularly important because we see this with a lot
of production RAG systems where if they've tried to build on,

(20:53):
you know, let's say the Anthropic API for example,
there's just latency challenges with that compared to using a
smaller model that's hosted on edge or or elsewhere.
Because if you're continuing to try to, you know, call out to
other sources, I mean, dependingon your use case, that may not
be feasible. Like maybe this is a just in

(21:14):
time delivery system for someonewho's in a call and they need an
answer immediately based off of your document corpus, so they
can help get a deal. They can't afford to wait a
minute and 1/2 or, or is for themodel to degenerate the
response. I need it to be much faster than
that. So I, I, I love that you're
thinking about that. And it seems clear to me that
smaller models, models on the edge are key parts of how we're

(21:38):
going to bring AI into so many other opportunities, whether
that's robotics or whether that's, you know, simple chat
bots. There's, there's a lot of
opportunities here ahead, but let's talk about your specialty
event post training. This is where models can go from
general purpose to, you know, really useful for specific
applications. We, we've talked about a couple

(21:59):
of those applications here, but it's also where a lot of teams
struggle. You know, that infrastructure
work you mentioned earlier can be a challenge for people.
And then I think there are some fundamentals that may be slipped
by folks. So let's maybe level set there.
What would you define for audience is post training and
why has it become so critical inyour view?

(22:20):
Yeah. So post training is is quite
simple to define. It's the step that happens after
pre training. So you turn a model that is able
to do this auto completion stuffinto a useful assistant that is
able to answer questions and follow instructions.
So this is the idea behind push training.

(22:42):
It's OK, we have this model thatis good at modelling language.
Now we're going to turn it into something that I can actually
use in practice. And to do that we have a lot of
different training techniques. The most useful one, like the
first one is supervised fine training.
Supervised fine tuning is when you give the models the

(23:04):
questions and answers that you expect, and this should closely
mimic how users actually use your models in real life.
And then there's a ton of other techniques.
We can dive into them if you want, but more related to like
optimizing for preferences, optimizing for reasoning, or

(23:26):
having a teacher model that is able to distill some knowledge
into the student model that you're currently training.
How do you think about the decision making around when to
use different techniques? Obviously supervised fine tuning
is kind of industry standard at this point, but you know,
preference alignment and other other options you're mentioning

(23:46):
here are also used. What are the trade-offs that you
consider and and how do you makethose decisions because
obviously cost and time can be factors.
No, absolutely. I think it's really depends on
the end goal that you have. If you're creating a chat bot,
you really want to optimize for preferences.
This is not optional, it really makes a huge difference.

(24:09):
So preference optimization techniques like DPO, direct
preference optimization or more elaborate one like PPO can be a
good idea. If you want to do a reasoning
model that will output A reasoning trace, it makes sense
to use reinforcement learning inthe loop at some point, which

(24:31):
can also be used all the time tobe fair.
But once again, it really depends on the capabilities that
you're targeting. Doesn't make sense for you to do
it for math, for example. Not entirely sure.
And with small models it also raises a lot of other questions
like OK, will someone actually use a 1 billion parameter model

(24:55):
for math ever? Do we need to actually care
about this too much? And so, yeah, depending on the
target that you have, you might have a lot of questions like
this. And there's a a common maxim
here which is garbage in, garbage out, really saying that
data quality is, is everything. I mean, this is true for Reg

(25:18):
systems. It's true for, for so many
things that frankly, it's true for education for humans, right.
And so data quality seems to be at the heart of effective post
training. How do you think about the data
generation and acquisition process here?
How do you identify what a high quality training sample is going
to look like? Yeah, it's, it's the most

(25:39):
important question I think when we talk about training models is
how to define data quality, because as you said, this is the
cornerstone of everything that we make.
Training techniques are nice, but they will never replace good
data, and having the ability to curate the highest quality data

(26:00):
is essential here. So it's very difficult to define
it. I think about it in like 2 or 3
categories. There's the accuracy of the data
because of course you want something that is factual.
If you ask a question, you want the right answer, right?
So that one is obviously very important and sometimes very

(26:24):
difficult to verify as well. And then you have the diversity.
So this time it's not about 1 sample, it's about the entire
data set. And you want to have a lot of
coverage and it has to be like as diverse as possible.
Diversity is so important that you might actually want to
include wrong samples just because they're more diverse.

(26:50):
And so yeah, it's a difficult trade off to find sometimes and
there's a lot of other aspects. For example, multi turn
conversations are really good and all the data should be
natively multi turn right chain of thoughts or some kind of
reasoning. At least when you output an
answer is also very good for themodel to have this structure of

(27:13):
first I think about something and then I give you the answer.
Because a lot of models, they start with the answer and then
they retrofit an an explanation to to justify why they choose
this answer. And you see that it tends to
degrade performance, of course, because they didn't have this
compute budget to think about the and solution.

(27:34):
So yeah, a lot of different waysto approach this topic.
I don't think that anyone has cracked down.
What's. True.
Data quality is about because that would be like the most
successful lab in the world. Right, they'd already have one.
They they'd be an AGI if if we think that's happening.
And I will say thanks for the the chain of thoughts that out

(27:56):
here. Obviously it's the name of the
podcast for a reason. We're big believers in this idea
of don't come in predisposed to what you're doing.
Instead, work through the problem gatherer data and be
willing to dive into complexity.And you alluded to complexity in
data generation and its importance having multi turn
conversations. I know you've done quite a bit

(28:17):
of work around synthetic data generation, which feels like a
topic that you know, people go, Oh yes, of course, you know, we
will do synthetic data generation, but they don't
necessarily dive deep into oftenin these conversations.
When do you feel like synthetic data generation works well?
And then where are you seeing itfall short today?
It works well all the time, to be honest with you.

(28:40):
Like I think that 99% of pursuing data sets are
synthetic. There's not much human in them,
maybe for the prompt actually, but that's pretty much it.
The the problem with synthetic data generation is about the
diversity because it's very easyto just collapse into something

(29:02):
that you think is diverse because you have like a ton of
different seed data or a ton of different ways to modify it.
But actually it's the same data generation process.
And because it's the same data generation process, it will
limit the diversity. So to me, this is where
synthetic data generation falls short.
It's really in this idea that it's very difficult to reinject

(29:26):
some diversity. There are some techniques that
are super interesting, For example, this persona sampling
where you give the model a character to all players.
For example, you are like a truck driver and you're from
this country and you have like this background, blah, blah,

(29:46):
blah, and the model will then act as this truck driver.
This is very surprising. This is work from Allen AI, for
example, that discovered that itworks even in math, you know,
like you truck driver role play actually improves math
performance. So there are these techniques to
reinject some diversity. But still you probably want to

(30:08):
have. If you want to cover some domain
like math, you probably want some Dias many different data
generation processes to be able to have a good coverage of it
and not just one sampling technique that might just be
restricted. And this role play technique has
been used in quite a bit of prompting that you'll see as

(30:31):
well. If you see someone's prompt that
they're posting on LinkedIn, this is so great.
I mean, this is particularly true a few months ago, but often
you'd see them say, oh, let me start by defining who you are.
You're like, you are this incredible mathematician who can
do every problem I need you to do.
So we'll see people do that evenjust in their individual prompts

(30:52):
that they're leveraging. So it it makes total sense that
we're seeing that be successful within synthetic data generation
and and the difference to the data set as well.
I'm curious to dive into a bit more of where you've seen
challenges in this new generation of LFM models as you

(31:13):
were doing post training. What has been the hardest
problems to solve? You mentioned diversity.
Are there particular areas? It's very good question.
I don't think like we had like one major problem that we we
really spend a lot of time solving because we had this

(31:34):
experience with the first generation.
I think it was a bit smoother this time, but there's
definitely like some issues around, for example function
calling. Function calling is really
difficult to nail down, so I usefunction calling and tool use
indifferently. To me it means pretty much the
same thing. And for this you need to have

(31:56):
like a specific formatting rightin your chat template and you
also need to have data that is extremely diverse, even more
than for math for example. We realise that function calling
data is quite difficult to generate at a level of quality
and quantity that we we want to have.

(32:19):
And here the solution is pretty much the same.
It's really about reading the data, understanding when you
have benchmarks where this fails, trying to solve it with
new data. And this is very the iterative
loop, the cycle that you have inpost training where you start
with the data, you have to spendlike most of your time on the

(32:40):
data, honestly. Then you go through training,
you get the models, you get someevaluation, some feedback, and
then you can use that to go backto the data set and see how it's
possible to improve the results.And usually what people don't do
enough, in my experience, is reading the responses from the

(33:01):
models and reading the samples of of training data.
There's no shortcut from what I could see.
You can try to ask TGPT to help you with that, but it's not
going to be so good. Yeah.
So you really need to do this manual groundwork of reading the
data and understanding, OK, Likewhy?

(33:22):
What is the problem here? Why did the model?
Where was it the wrong answer? Or even like, why is it the
right answer? What makes it easy for the model
to be able to succeed here when it was not able to do it for the
other prompt? There's not a lot of magic here.
It's it's a ton of groundwork and understanding of the data

(33:46):
quality and and complexity. I resonate with that because I
think it's a very common and problem in software engineering
where everyone loves to submit APR and get their PR merged, but
they may not want to spend a tonof time reviewing others PRS.
They may not want to spend a tonof time editing others PRS and
that might be the most importantpart of the entire software

(34:08):
development life cycle. And it's certainly something
where I think all of us need to iterate and provide feedback,
whether it's in our day-to-day lives, whether it's in our our
software we're building and clearly in our LLM driven
architectures. I love that you explicitly
mentioned something that it sounds like you're going to
carry forward to your learnings for I presume the future LFM 3,

(34:31):
which is, look, we need to approach function calling this
way. We're going to learn from this
experience. How, what, what did you learn
from LFM one? What learnings are you now
taking from LFM 2? You see, feeling the future.
Yeah, it's interesting. I think there's all the

(34:52):
architecture work that is reallyimportant.
Like besides the first training,right?
I think there's a ton of inference work and compatibility
with the open source community that we got a lot better AT.
And we want to double down on that because if you create your
own architecture, the problem isthat it's not compatible with
anything. So you have to reimplement it

(35:16):
from scratch in every library that you want to use.
Hugging Face Transformers. It can be VLLM, it can be SG,
Lang, LAMA, CPP, et cetera, et cetera.
So this part was a big lesson, Ithink with the first generation
of models, the second generationit, it was still very costly to
do it. So with the third generation, I

(35:38):
hope that we can find operators that are maybe a bit more
friendly to to implement. And I think honestly this might
be one of the biggest lessons that we've had with the LFM
series. But about versioning in
particular, I think there's beena lot of lessons about data

(36:00):
quality, but also going from copying training techniques that
we see from papers and from likeother people into making our own
in house that are customized forour models and also small
models. I think this is really nice as a

(36:23):
scientific topic because everybody is kind of focused on
the big models on the Kimiki 2 and Deepsika ones of the world.
And I think this is very interesting, right?
There's no issue with that at all.
It's just that I think there's not a lot of interest in general
for these small models, but theyare truly interesting because
you have a lot more complexity that you need to work around.

(36:47):
Something that we found is like knowledge is really, really
depends on the number of parameters and you can try to
squeeze as much knowledge, for example with knowledge
distillation into 1B model. It will never be as smart as a
3B model. That's a fact.
And you need to find fixes around it so it doesn't

(37:08):
completely hallucinate when the user ask a simple question.
So this is one of the core lessons that I think we'll we'll
drive forward with the next generations is this expertise in
working with small models, knowing their capabilities, but

(37:29):
also their pitfalls and you needto play on their strength.
Basically. I think that's the main lesson
here. I think that's a fantastic
point. And too often we see AI
engineers come in and say, oh, Iwant to use the most powerful
model. And actually it's finding the
right model for your use case and understanding your

(37:50):
constraints. Maybe your constraint is it
needs to be on an edge device. And so maybe you can only run
that 1B model there. Or maybe you're actually better
off with, you know, we mentionedmulti agent systems briefly
earlier. Maybe you have 31-B model agents
are working together and maybe that outperforms a more
generalized agent there. There's a variety of

(38:11):
considerations here and I'm surethat another one is occurring
when you are doing your, your post training here as if you're
doing your evaluations of success as your understanding
essentially what, what you want to deliver with each of these
models. It sounds like you're having to

(38:31):
customize quite a bit from the more generalized benchmarks that
we're seeing. Can you talk a bit about that
process as obviously it's something that's near and dear
to our hearts as well? Yes.
So in terms of benchmarks, we have our own stack internally
because there are a lot of really good open source
evaluations, but you want your own to be able to be a bit more

(38:56):
narrow in the skills that you focus.
For example, it can be a specific type of function
calling that you want to be really good at because you think
it's, it's very important for small models to be good at that.
An example of this is web search.
We think that because you don't have the knowledge that is
needed to ace some questions, it's fine.

(39:17):
You can give a web search function to 1B model and that
might allow it to just Google it.
Google the, the, the, the, the topic of the question and which
you've the right answer. So this is the kind of
evaluation that we want to design and focus on.
And also we try to repurpose evaluations for frontier models

(39:40):
that are a bit older, because now that's, you know, that it's
the frontier models that create all these fancy evaluations, but
it doesn't make sense to evaluate a 1D model on it.
It would get like 1% at best. But The funny thing is that
state. H Have you had any liquid models

(40:00):
play the werewolf game or anything?
I am curious. Not yet, Not yet.
OK, OK, sorry I'm derailing but I was just like wait a second, I
got to know. But this is interesting what
you're mentioning, because I think that now small models
might be able to do a better jobat this kind of stuff because in
the past they were really bad atmulti conversations, for
example, they were really bad atlong context length.

(40:22):
But as we progress, it's not only the frontier models that
progress. I think there's even more
progress in terms of small models like they're catching up
also quite fast. It's still going to be disabled
by some elements like knowledge that I mentioned.
Sure. Still we can do it a lot with
them. And for example, if you have a

(40:43):
benchmark where you need knowledge, then fine, like
simple QA for example from Open AI, fine, you can give it tool
use and then it becomes a kind of tool use benchmark instead of
a pure knowledge benchmark. And I know a lot of what has
fueled this, as you alluded to is open source evaluations, open
source models, open source techniques coming.
I mean, out of DeepSeek and, andmany others that you're learning

(41:06):
from, you're customizing, you'reapplying to your own viewpoints.
You've obviously also been an incredible open source
contributor, whether it's your LM course, which as I mentioned,
39,000 plus stars on GitHub, Maybe by the Times comes out
it'll be 40,000 plus, I don't know.
You've worked on open source models, tools, courses, books.

(41:28):
What drives your commitment to open source, and how do you see
it as integral to the AI and machine learning community?
OK, first let me Fact Check you.It actually has 66,000 stars.
Oh man, so this is actually funny because this is what I get
for relying on an AI model to help me draft something.

(41:48):
Clearly, I clearly pulled something.
I didn't, I didn't Fact Check the number of stars.
So this is I, we should probablyleave this in actually.
So you've released an incredibleamount of open source work.
You've got 66,000 stars on GitHub with your LM course, as
you've done so many other thingsthat was well, whether it's
models, tools, courses, books. What drives your commitment to

(42:12):
open source, and how do you see it impacting AI and machine
learning going forward? Yeah, so open source for me is
quite selfish, honestly. It's mostly me trying to learn
stuff online and then it gets picked up or not.
But I think it's it's actually quite important to do it for
yourself and not for other people because, well, this is

(42:33):
your own interests, right? And this is an incredible way of
learning new things. Whether you write articles about
the topic that interests you, I think this is the most powerful
way of really learning about it,especially if it's technical
article where you also provide code, you also provide a
notebook. There's a ton of times where I
was sure I knew something and then you write about it and

(42:55):
you're like, actually I need to double check the implementation.
It can be your own thing as well.
Like sometimes I have to reread my own articles to make sure I
still understand something. No, I, I feel like until I write
or build something, I truly don't understand a topic.
And if I write about it, I will very quickly understand where I

(43:17):
don't understand things. I'll be like, oh God, these five
areas. I, I got to go dive deeper here.
I totally get it. Exactly.
And this is the beauty with opensource work.
Is that because you know that they're going to be a ton of
strangers looking at your work? You really do not want to mess
it up, so you double down. You're making me.

(43:37):
Sweat over here. You double down on it, but to be
fair, like people are generally nice online, except in some
communities, but usually they'rereally nice online.
Maybe not on Twitter, yeah. Twitter is not the worst
honestly, I have to say. But yeah, this is a good way to
learn. And it's the same with data sets

(43:57):
and models and tools that I've made.
I did them for myself, honestly.Like the tools, I made them
because I needed them. I needed a nice way to run
benchmarks because it was too complicated back then.
I needed a way to much models automatically because I was
writing an article about it and I wanted to and ablations to

(44:18):
understand better. OK, like do I need to use this
merging technique, this merging technique, etcetera.
And it's funny because then it was used in centrifug articles.
Like some authors reach out to me and said like, oh, thank you.
Like actually we run our experiments with your tool and
this is this is beautiful like because you you started doing it
for yourself. And then yeah, maybe if there

(44:40):
were better options they would have used not the one, but they
use mine, so this is cool. I I think that's fantastic.
Is there a particular open source project that you're most
proud of? I tend not to maintain them very
well, to be honest with you, butI would say that's a lot of
work. Yeah, exactly.
The LLM course is like what I'm proud of because it it's more

(45:02):
popular than me, honestly. People tell me, oh, you're the
guy from the LLM course. OK, I know now.
And there's also the LLM data sets for people who are
interested in fine tuning. It's another repo that has a lot
of data sets for different stuff.
Then be for math, for function calling, and even preferences.

(45:25):
So this one I think is truly useful and I I will try to
update it soon now that I've talked about it.
Yeah, we're we're holding you accountable.
I'm making you talk on the podcast, but we'll definitely
link it to in this the show notes.
So be careful here. But as we wrap up, I want to
look forward a bit. Obviously you have this unique
vantage point, you're building next generation architectures,

(45:48):
you're working across the full stack in many ways, and you're
deeply plugged into the open source community in AI.
Are you most excited about or most concerned about?
I think something that excites. Me right now, and I've started
thinking about it more and more,is how we're going to build
operating systems in the future.Operating systems for phones,

(46:13):
tablets, wearables, laptops, whatever, because you probably
cannot do it without AI anymore.And it feels like these tools
are becoming so useful that either you do it on an
application level and each application downloads and loads

(46:35):
its own LLM or has API core. But it's it's not always
possible, right? Latency, to your point, cost,
yeah. Exactly and online.
Connectivity. If, for example, if you do it in
a car, well like goodbye church ability, it's not going to be
very useful most of the time. So there's a way that we might

(46:57):
have a foundation layer of AI models.
It can be LLM, but it can also be other things that you provide
to developers so they can build the applications with this,
knowing that they always have this model that they can call on
AOS level. And yeah, I think this is
really, really exciting because it will deeply shape the way

(47:18):
that we create apps. And I hope that in the future
the apps get a bit smarter and, you know, you don't have to
fiddle with your settings and yeah, you just have better
interaction with this systems. I think that could be really,
really cool. Maxim, this has been an
incredible conversation. I've.
Had a ton of fun. Thank you so much for diving

(47:40):
deep with me. Two questions to close this out.
One, where should listeners go to to follow your work?
You've got a lot of places wherethey can find you.
Yeah, you can search Maxim Labon.
On LinkedIn or Twitter, this is where I am most of the time.
Fantastic and I'll recommend your.
GitHub website as well. I will link all of those in the
show notes, but you've got a tonof great links and some blogs in

(48:03):
there. The final question I'd love to
ask you is what you're seeing inthe future.
So you've alluded to a few things like, oh, you know,
obviously you believe in this idea of models on the edge of
changing how we interact with them at times, but what about
new architectures? What about, you know, new tools

(48:23):
that are coming out? What are you seeing in the
coming years or or months that you expect to change how we all
think about AI and build with that in terms of architectures?
It's it's sad. To say, but it's always a
question of trade-offs, so having a new revolutionary
architecture like the transformer seems to be unlikely

(48:44):
to me honestly. But there's a lot of interesting
work with tiny recursive models and that kind of stuff for
reasoning. And I like this models because
they they correlate knowledge from reasoning.
And I think this is a powerful paradigm.
The problem is I would like to integrate them into LLMS.

(49:09):
So there's been this paper by Francois Florey called the free
Transformer that tries to integrate reasoning that this
reasoning engine into an LLM. And I think that could be in
terms of architecture, somethingthat we see more and more
because it's not just about building blocks.
It's like another idea of kind of offloading the reasoning part

(49:31):
of LLMS into another component. And that might be a huge change
if it's successful. It's interesting you you say
that because I I. I don't know if I believe
there's going to be a short termchange around transform
architecture. I agree with you.
The next couple of years I thinkwe are where we are.
There's going to be, you know, adaption hybridizations,
obviously we're seeing that already liquid and many others.

(49:53):
Well, actually not many others. Y'all are really at the bleeding
edge here. But I, I do think long term
we're going to see a change fromthe current transform
architecture. But I, I, I, I think it's a
question of how far out is that?Is that 10 years?
Is that five years? Is that, is that 40?
We've seen this happen throughout AI history, right?
Because really it goes back to, I mean, back to the 50s and

(50:15):
elsewhere where we've, we've hadthese continual moves forward,
forward. So I don't know, I, I, I look at
how humans learn and how LMS learn.
It's still so different that it feels like we are just
scratching the surface of the possibilities to me.
But maybe maybe I'm a little tooout of my sci-fi here.

(50:37):
No, I think I I agree. Yeah, like this definitely
better training algorithm that we haven't found yet.
And that might be also even moreimportant than changing the
architecture really like RL, forexample, right where it's like
we, we. Basically go and say, oh, like
the the model succeeded on this route.
Let's upvote everything it does in this route, regardless of
whether all the steps to get there actually worked as well as

(50:59):
they got to. And I mean, we see, I look at
like, I don't know, silly videosof, of kids where they, they all
have to reinvent crawling from first principles in order to
begin to move about the world. And sometimes they do it a
little differently. Some people roll, some people
are kind of scrabbling around before they, they start

(51:19):
crawling. So, yeah, I, I don't know,
there's such an interesting openopportunity here around
learning. And then I will say it makes
this field so fascinating because not only are we learning
all the time, but we're, we're basically trying to teach
machines to learn, right? So it's it's it's a lot of fun
and I see my I'll say I've I've learned a lot just talking to
you here. So thank you so much for coming

(51:41):
on the show. It's been a distinct pleasure.
We will, we will link your LinkedIn, everything else and
the show notes here. I will definitely make your
hugging face profile as well. I know you've got quite a few
things on there. And of course, the LM engineers
handbook listeners, if you enjoythis conversation, deliver it
and then hopefully deliver the technical depth you're looking
for, let us know. Please drop a comment on

(52:02):
LinkedIn when you see these that's posted or Spotify telling
us what resonated or YouTube andif there are other deep dives
that you want to see on Train ofthought.
We are listening. Maxine, we're going to have to
have you back sometime too because this was a ton of fun.
Thanks so much for joining us. Thanks a lot.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

Las Culturistas with Matt Rogers and Bowen Yang

Crime Junkie

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Beyond Transformers: Maxime Labonne on Post-Training, Edge AI, and the Liquid Foundation Model Breakthrough

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

Las Culturistas with Matt Rogers and Bowen Yang

Crime Junkie

All Episodes

Beyond Transformers: Maxime Labonne on Post-Training, Edge AI, and the Liquid Foundation Model Breakthrough

Stuff You Should Know