Finding Nemotron

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Jerod (00:04):
Welcome to the Practical AI podcast, where we break down
the real world applications ofartificial intelligence and how
it's shaping the way we live,work, and create. Our goal is to
help make AI technologypractical, productive, and
accessible to everyone. Whetheryou're a developer, business
leader, or just curious aboutthe tech behind the buzz, you're

(00:24):
in the right place. Be sure toconnect with us on LinkedIn, X,
or Blue Sky to stay up to datewith episode drops, behind the
scenes content, and AI insights.You can learn more at
practicalai.fm.
Now, onto the show.

Chris (00:48):
Welcome to another episode of the Practical AI
Podcast. I am your host, ChrisBenson, and today we have a
wonderful guest from Nvidia.We've had some other guests
along the way, as everyoneknows. And today I would like to
introduce Joey Conway, who isthe Senior Director of Product

(01:09):
Management for AI Models atNVIDIA. Welcome to the show,
Joey.

Joey (01:13):
Yeah, thanks, Chris. Good to be here.

Chris (01:15):
I'm looking forward to it. I know we're gonna talk
about a couple of recentlyannounced models that you guys
have have put out there. Butbefore we do that, I always like
to get a sense of kind of, like,you know, your own your own
background, how you came toNVIDIA, and specifically, you
know, this particular area ofwork. I'd love to know how you

(01:36):
got into this and what thatspecial sauce about what you do
is for yourself.

Joey (01:40):
Yeah. I think from my background, I have done some
software development in the pastand also done some product
management in the past. And Ithink in looking at
opportunities, say maybe tenyears back of exciting things in
the future, one thing I waspersonally excited about was
machine learning and AI. And Ithink looking at opportunities,

(02:03):
NVIDIA, this is almost a decadeback. NVIDIA was at great spot
of they were involved in manythings and things were just
getting started.
So I had a great opportunity tojoin NVIDIA. And then being
here, the company works on allsorts of amazing technologies. I
think one space that our teamhas focused on has been

(02:24):
essentially the non visionworkloads. And so we started
many years back with things likeBERT and NLP and maybe more
simple types of language modelsthat could do classification of
intent and those types ofthings. I think we've been on
the journey for a while andwe've been excited that, there's

(02:44):
been great research andbreakthroughs the last, say,
five years that I think havemade, we'll we'll say,
exponential improvements andbrought it to a much more
mainstream type of use case.
And so I think the thebackground there on my side of
being being familiar withsoftware development and kinda
comfortable with newtechnologies. And then the

(03:04):
excitement of new opportunitiesand places to grow, NVIDIA has
been very well positioned atthat. So I think it's been kind
of a few factors coming togetherat the same time. And if you had
asked maybe five, six years agowhen we first started on some of
this journey, I probablywouldn't have guessed we'd be at
such a great inflection pointthat we are now. But I think

(03:25):
we're very excited to be hereand there's a lot of fun stuff
happening we can talk about.

Chris (03:29):
Gotcha. So I know today we're gonna dive into I'd like
you to introduce to the audiencethe two models that were
announced. But if you couldframe them a little bit in kind
of like the current landscape ofof open foundation models and
kind of where AI research is atthis point, and why why NVIDIA

(03:51):
is putting these models out atthis time. What is it about them
that's different from all theother stuff out there? And why
have you made some of thechoices in terms of open versus
closed, things like that?
So if you would tell us aboutthese models.

Joey (04:07):
Yeah. And I'm happy to start with kind of the landscape
or where the world is at, and Ican give a little bit of context
there too. So on the NVIDIAside, we've been working on
publishing models and kinda openweight checkpoints and to some
degree datasets for many yearsnow. It's been quite a while,
five, six, seven years, probablyeven longer. And we've trained

(04:30):
many large language models aswell.
I think the first one I'm tryingto remember the formal name. I
think it was MegatronLM orMegatron NLG. There was a few
variations of it, but that wasprobably four or five years ago.
And we do it for kind of a fewreasons. One is we want to
understand how to take the bestadvantage of our infrastructure,

(04:51):
So from compute and storage andnetworking, we also wanna prove
out the software stack and makesure the software runs great.
And so we do that ourselves. Welearn a lot along the way and we
can make improvements. And thenwe also do that because we want
the community to benefit andlearn. And so We publish all
that software, those techniques,the papers, and we do that so

(05:12):
everyone else has higherconfidence and can start from a
better beginning spot than wedid. We've been doing that for
many years in many differentdomains, so things like speech
or transcription and largelanguage models and even simpler
ones of smaller language modelslike things like BERT.
So we've been doing that forquite a while. I think in
parallel, there's lots ofcompanies in the space and they

(05:34):
all have different businessmodels and our goal is to
support them. And so I thinkthere's been a few big moments
in kind of the language modelspace. I'd probably say BERT was
a big one quite a few years backnow. That was where we kind of
had an inflection point wherelanguage models could do,
essential classification tasksthat previous to that we weren't

(05:56):
able to do.
And so being able to parse outlanguage from people typing or
or speaking and being able tohelp understand what they want,
what they're looking for, whattypes of actions they're asking
for, that was a greatbreakthrough moment. We're very
happy with that. We've publishedlots of software to help support
it and make sure it runsefficient on our infrastructure

(06:17):
and people can benefit from it.I think probably another big
moment for the world wasChatGPT. I think we were super
excited to see all that happen,and OpenAI is a wonderful
partner.
And so when that happened, itwas an inflection point where
many people started to realizethe capabilities of what was
possible. And there was amazingresearch that went in behind
that. And so that was kind ofanother big milestone that

(06:39):
happened along the way. And soas each of these kind of occur,
we're always asking how we canhelp. So how we can help more
people take advantage of thetechnologies and benefit from
them.
And so as kind of that chat GBTmoment happened, many companies
were starting to ask how theycan take advantage of these
technologies. And as we spendtime working in that space, we

(07:00):
love to support our partners andwe think it's great for
companies to use them, westarted finding that there were
some scenarios that not everycompany could use some of the
solutions out there. And soscenarios where, say, a company
has proprietary intellectualproperty that they don't want to
leave their premise, that theythey need to keep on premise.
There might be scenarios wherethey want control over the

(07:22):
model, the model architecture.They want control over the data
that goes into it as well aswhat they fine tune it.
And so in those scenarios, therewas a a lot of open source
contributions a few years back.Many companies are training
foundation models, we're reallyexcited. And so we did our best
to kind of support that both inthe software we published to

(07:43):
make sure it runs well and allof these partners can build the
best models. We also do some ofthat ourselves too to make sure
that we're not just publishingthings for others, but we use it
ourselves to make sure it runswell too. We're always trying to
stretch the scale of thingsgoing forward, so we're always
trying to push the limits ofwhat's possible.
And so in kind of that broadereffort of pushing the limits,

(08:05):
what we found is that there'sopportunities for us to
contribute that out. So say newinfrastructure comes, we can
sometimes be the first to showpeople how to do that. In terms
of the large scale contributionswe've been making over time,
that's one of the incentives andreasons we have to keep
participating in the space. Andso going forward from the
moments of, say, two years backwhen lots of companies and

(08:28):
partners were publishing openmodels to where we are today,
the biggest breakthrough we'veseen happen was probably around
January, February in terms ofopen weight models now
supporting reasoningcapabilities. And this came
through DeepSeek as kind of oneof the leaders in this space of
being able to add reasoningcapabilities, meaning we can

(08:49):
take complex queries and we cannow start to break them down and
think through them and come upwith answers that previously we
couldn't.
Previously, we often just hadone question in and one answer
back out, we had to be fast atit. And now with reasoning, the
models can take some time andthink about it. And so that's
probably the next big milestonethat we're really excited about

(09:10):
seeing and one of the mainreasons that we're publishing
models at this kind of juncturein terms of wanting to help move
that technology forward.

Chris (09:19):
I'm curious. That was a fantastic answer, by the way,
and there was so much there thatI'd like to dive into with you.
I think a very first thing is iswe're hearing about reasoning a
lot now, you know, you know,from various organizations. Glad
to hear it from you. But I thinkthat it's one of those phrases

(09:40):
that in the context ofgenerative models, what does
reasoning mean?
Could you talk a little bitabout what what is the word
reasoning in this context fromNVIDIA's standpoint? And how
does reasoning in that contextdifferentiate from some of these
really powerful models we'reseeing from NVIDIA and other

(10:02):
organizations that have comethat have been able to do
amazing things, but weren'tnecessarily classified as
reasoning models?

Joey (10:10):
Yeah. And I can give a few answers here. I think there is a
little bit of varying indefinitions among the community,
but I think I'll try and sharewhere I see the most consensus.
So maybe I'll go back a fewstages. So I think Sure.
Say going back four or fiveyears, when we had these models,
we'll say like GPT typearchitectures that were

(10:30):
autoregressive, meaning thatthey would go through a loop.
And so they generate one word ofa sentence and then they feed
that back in, generate the nextword and the next word. And this
was kind of the technique theyused to be able to generate
paragraphs and kind of this longgenerative content that we
hadn't seen before. And and thisallowed them to write sentences,
to write stories. And at thatkind of original juncture, the

(10:54):
challenge we had is it wouldjust do, say, this next word
prediction.
We had struggles knowing how tocontrol it, how to direct it,
how to guide it, and how to keepthose answers with a high
accuracy. And so one of thegreat breakthroughs that OpenAI
achieved with ChatGPT that theworld got to experience was you
could tell the model or give itbetter guidance and directions

(11:16):
and it would adhere to it. Andso everyone was very impressed
and excited about some of thesetechniques like alignment and
reinforcement learning. It waswonderful breakthrough. And I
think we've all benefited fromthat technology.
And that next stage then allowedus to take these models instead
of just doing the next token,they now would actually stay on
topic of what we asked. And sothey could follow directions. If

(11:38):
you said, now take your storyand do it in a bullet point
format or do an intro, a body,and a conclusion, it would now
actually do that instead of justgiving you the next word and
giving a big paragraph. And sothat was one of the big
breakthroughs along the way. Asof this year, the big
breakthrough with reasoning, theway we think about this is that
there's kind of these sets ofquestions or challenges that

(11:59):
models have been able to solveup to today.
What we see with reasoning isthere's a whole another set of
questions and challenges that wecouldn't previously solve. And
kind of the rationale behind it,and then I'll go into some
examples. Previously, we usuallywhen we would kind of interact
with a model, we would eithergive it like a prompt, so you
give it a question, or you couldgive it a few examples in the

(12:23):
prompt and then a question. Youcould say like, I want to do
math. Here's some examples ofhow math works and now here's a
math problem.
And the models were pretty goodat that. What we've seen though
is that the more complex thequestions are, the more
difficult it was for the modelto solve it in the first pass.
And so often people would justgive it these complex queries,

(12:45):
say like a word problem, theclassic of two trains coming at
each other with differentspeeds. And the reasoning you
have to walk through of thefirst train is at this speed,
the second train is at thatspeed, what is their directions,
what is their rate? The abilityto then walk through and ask
four or five sub questions thatare needed to answer that
question, the models weren'tvery good at doing that.

(13:07):
And often what we would have todo is we would have to manually
ourselves or try and use anothermodel to break down the question
into sub questions and then tryand find ways to answer each of
those sub questions withdifferent models. And there was
just a lot of manual stitchingand kind of ad hoc work that had
to happen. And with reasoning,what has now the big
breakthrough is we're now ableto train the models at kind of

(13:28):
training time around the skillof we show them here's a
question and then here are thedifferent ways to break it down.
These are sometimes calledthings like reasoning traces,
where we show there are multipleways to solve it, but we give
all those examples in there andthen we give the answer.
Previously, was very muchfocused on here's the question,
here's the answer, and that'show we teach.

(13:49):
But it kind of makes sense ifyou think about how people
learn. When you're doing mathproblems, it's always good to
get the right answer, butsometimes it's even better just
to understand how to solve it asopposed to the right answer. And
so that's been the bigbreakthrough with reasoning is
we now can teach the models,here's a way to think through
complex problems and be able tonot just give the right answer,
but give all the supportingthought and process you use to

(14:12):
reach that right answer. Andthat applies to like scientific
domains, say biology, chemistry,physics, applies to math,
applies to software development.It should apply to the majority
of domains.
And it's kind of the next tierof challenging problems that we
haven't been able to solve verywell. And in the open space,

(14:33):
these are a big breakthrough interms of both the data and the
techniques of how to teach themodel as well as just the model
capabilities themselves and thefinal checkpoint that people can
download and use.

Chris (14:45):
So that was pretty fascinating from my standpoint
to kinda hear that laid out. Ithink that's the best
explanation I've ever heard interms of of what reasoning is in
this modern context. Could youkind of dive in a little bit to
introducing the actual modelsthemselves? And for each of
them, kind of describe how theyfit into the ecosystem, what

(15:07):
they're trying to solve. I knowat least one is a reasoning
model, and kind of talk about,you know, why them versus some
of the others that they arewithin their kind of subgenres
ecosystem wise.
Could you go ahead and introducethe models themselves?

Joey (15:23):
Yeah. In terms of what we've worked on at NVIDIA and
kind of where we wanted tocontribute, there's a great,
thriving community of openweight models. And there's many
great partners out there, fromChina to The US to Europe. I I
think we're in other parts ofAsia. We're very excited to see
these ecosystems growing.
And where we wanted to focus atNVIDIA were on some of the

(15:46):
harder problems that we knew itwould be difficult for people to
solve for and do it in such away that we could benefit all of
the community. And so in in kindof this path of seeing the
growing capabilities of openweight models, we tried to think
through what all the techniquesand skills could be to create
even better reasoning models.And so what our focus was is we

(16:08):
wanted to be able to take thebest of what's open and make it
better. And so one of the themesyou'll see as we go forward is
we're constantly evaluatingwhat's the best in the open
community and how do we improveit. And so kinda leading up to
how we, decided to publish thesemodels, I'll share maybe high
level some of the key techniquesthat went into thinking about

(16:29):
where we could contribute, andthen I'll explain the models
that we published and and why wedid that.

Chris (16:35):
That'd be perfect.

Joey (16:36):
Yeah. Great. So in thinking through what's out
there in the community, what werealized is that while there are
some great open weight models,often the datasets aren't
necessarily all open and thetooling isn't necessarily all
there and the techniques aren'tnecessarily as transparent or
published so everyone canreproduce them. And so in kind

(16:57):
of thinking through thesechallenges, we took kind of the
life cycle of a model, of lifecycle of creating a model. Say
you start from some data, youstart from some architecture,
and then at the end you producea checkpoint that people can go
deploy and use and gain valuefrom in a business setting.
And so kind of walking alongthat journey, we saw things on
the pre training side where it'susually a large set of unlabeled

(17:19):
data where we're teaching themodel just kind of general
knowledge and skills of theworld and languages and content,
on that side, there's not asmuch open data there and the
techniques of how to do thataren't necessarily as public or
as published as we'd like. Thatwas one place we thought about.
The kind of next stage wethought about was once this base

(17:42):
model that has some set ofknowledge of the world and
capabilities, often people thentake it and fine tune it or make
it kind of an expert of how toeither interact with people or
how to solve a certain domain ofproblems. And so we wanted to
focus heavily on what we feltwas important to enterprise
companies. And so some of theskills that we decided would be

(18:02):
really helpful for the communityand for enterprise companies,
which is kind of our focus inthis place is on enterprise
adoption and growth, were thingslike being able to work through
scientific question answerproblems, things like math,
things like coding, things liketool calling or instruction
following, and then beingconversational.
Those are some of the key placesthat we felt that enterprises

(18:25):
would benefit the most from. Andkind of the flip side of those
challenges are in the enterprisesetting, they're very much
looking forward to having themost correct answer. They wanna
avoid hallucinations orincorrect answers. They want the
models to follow directions. Ifif I'm asking for three bullets,
I I want it in three bullets,not five and not a paragraph.

(18:45):
And then also on the scientificquestion answer side, there's a
whole domain of companies whoare working from things like say
drug discovery or other kind ofdomains that are quite technical
where they have complex problemsthat they can benefit from the
reasoning capabilities of themodel, being able to think
through and and have more timeto run these kind of inference
calls and reflect and progressthrough complex challenges. So

(19:08):
those are kind of the on theaccuracy side of the
capabilities and skills wewanted to make more available to
the community. And then on theinfrastructure side, we knew
that these models are supercapable. And with reasoning,
another challenge we introducedis it requires more compute and
more iteration. So every time atoken is generated, it takes
some compute.
And when you think, it generatesmore tokens. And so the

(19:32):
challenge here is that the morethe model thinks, the more
compute and potentially moreexpense there is. But the upside
and breakthrough is we can nowanswer more difficult problems
we couldn't before. And so wewanted to think through how to
optimize the model to be moreefficient on the compute side so
that as we spend more timereasoning, we don't actually
grow all the expense for the endcustomer when they want to solve

(19:54):
more complex problems. And sothose were kind of the key
challenge sets we were thinkingthrough.
And so as we went on thisjourney with the Nemotron family
of models, what we published andwhat we started publishing back
in March and kind of celebratingthe beginning of this venture is
what we're calling LamaNemotron, meaning that we

(20:15):
started from a base LAMA modeland then we use the best of the
open models and datasets in thecommunity. So we pull data from
many of the public models,things like Mistral, things like
our own Neumotron, as well asthings like DeepSeq and Quen,
where there's amazingbreakthroughs in the open
community. And we use those togather the best data and the
best knowledge and then tooksome of the state of the art

(20:37):
training techniques in oursoftware stack that's open and
available called Nemo Framework.And we're able to take the LAMA
models and improve theircapabilities and skills for
reasoning and be able to publishand win many of the leaderboards
in those domains. And along thatway, some of the other work we
did was shrinking the modelarchitecture.

(20:58):
So what we call kind of neuralarchitecture search and being
able to take what LAMA did as anamazing and quite common and
popular transformerarchitecture. There were ways
that we were able to essentiallyshrink that model architecture
while keeping the accuracy thesame. And that allowed us to
reduce the cost and the computefootprint a bit as well too. So

(21:18):
at the same time, we introducedreasoning and make the model
more capable. It also slows itdown a bit.
And so we were able to shrinkthe model architecture to try
and keep that speed as quick aswe could. And at the end, then
we published kind of a family ofthree models. We have what we
call a nano being generally aquite small model that would fit

(21:39):
on, maybe a smaller data centerGPU. And then we have the super,
which fits in the middle, andthat fits on one more common
large scale data center GPU,like, say, an h 100 or an a 100.
And then we have the Ultra, thethird of the family.
The Ultra fits within one node.So eight of the H100s or eight

(21:59):
of the A100 GPUs. And the Ultrais often the model that shows
the best capabilities, the stateof the art accuracy. And then
the nano and the super are oftenwhere we see most people begin
and start and put intoproduction and build and fine
tune on top of. And so as wepublish these three models in
this family, we also publish thedata we use to post train them.

(22:23):
All of that data we made opensource and available that
includes all the math and thescientific question answers, the
chat, the instruction following,reasoning and non reasoning. One
clever thing we wanted to setout to do here was prior the
models that were open wereeither reasoning or non
reasoning and they were separatemodels. And we could kind of

(22:44):
empathize with enterprises thatdeploying two models is twice as
much work as deploying one. Andso one thing we did when we
first published these was putthem into one model. And so that
way you can ask the model, canyou reason through this?
This is more complicated. I'mwilling to spend the time and
wait and invest in the answer.Or this is a super simple answer
like what's two plus two? Youdon't need to reason, just give

(23:05):
me the answer and don't spendthe compute on it. And so we
published the datasets tosupport that capability as well
as the model checkpoints.
And then some of the software weused inside NeMO framework,
things like NeMO RL, there'straining techniques inside there
we also published as well too.So all of this made this family
of models and data and toolsthat we published under the

(23:27):
umbrella we're calling NVIDIANemotron.

Chris (23:29):
Gotcha. And just for reference, we often to give
people a sense of size and whatGPUs, we'll talk about input
parameters. Could you assign foreach of the three versions the
input parameters in terms of howmany to are we talking like an
8,000,000,000 for nano orsomething like that? I'll let
you run with that.

Joey (23:48):
Yeah. And I'll I'll tell you where we're at today and a
little bit about where we seethings going. So where we're at
today is for the nano, it's an8,000,000,000 model. We we do
have a smaller 4,000,000,000variant we just published, but
we're likely expect to stay atthe 8,000,000,000 parameter size
when it's a dense modelarchitecture. And kind of the
the rationale there is we'retargeting, say, like a 24

(24:13):
gigabyte NVIDIA GPU kind ofroughly memory capacity wise.
And in that size range, we wannamaximize the accuracy
capabilities. And so likelyaround 8,000,000,000 dense is
probably where we're gonna staythere. Yep. On the super side,
we're targeting one more commonand larger data center GPUs. So
like the h 100 with 80 gigscapacity or a 180 gig.

(24:36):
And so in that space, we expectprobably around 50,000,000,000
parameters of a dense model willbe the best fit, and we
published a 49. So we'll likelystay in that in that ballpark
going forward. On the ultraside, what we published and
these are all, I should mention,they're variants of LAMA. So,
like, the nano is a eight b. Wedistilled down to a four b, but

(24:59):
we realized, kind of thecapabilities of reasoning at a
small scale.
There are some challenges there,and so the eight b does do quite
well. On the super side, westarted from the LAMA 70 b,
which was a great size, but wewanted it to fit in one GPU, and
so we distilled that down to the49 b. And on the ultra side, we
started from LAMA's four zerofive b from last summer, which

(25:20):
running at, say, FP eightprecision does roughly fit
within one node. But our goalwas to see if we could shrink it
and maintain the accuracybecause one node is still quite
a large deployment footprint.And so with our ULTRA, we have
two fifty three billionparameters on the dense side.
And so that fits in roughlyfour, so about half a node. And

(25:41):
so we're excited about thosebreakthroughs because it does
kinda relate to the cost that ittakes to run the model, and
we're achieving the same, if notbetter, accuracy, from what we
built on. I think going forward,there'll likely be some changes
in this space. I think there'swork that NVIDIA's published on
the research side recentlyaround hybrid dense

(26:03):
architectures where there's sometechniques around, say, SSMs or
say, Mamba style architecturewhere we can make the output
generation much more efficient.And we expect that with
reasoning, the longergenerations of reasoning traces
and the ability to think, thatoutput generation will continue

(26:23):
to be more of a challenge.
And so I think we'll likelyexpect to see on our side say a
10 to 15% throughput speed up onthe output generation going
forward in kind of neweriterations of this using some of
the latest research. And thenthe other big exciting thing
we're looking forward to is onthe mixture of expert side, we

(26:44):
expect that at the very largescale, so likely around, say,
the ultra size range where we'veseen a lot of the community,
say, Lama four is there and theDeepSeq and Quinn, they all have
mixture of experts, especiallyat the large scale. We expect
that will be a new trend goingforward. And we think we'll
probably also be participatingin that space too of at the very

(27:06):
large scale, mixture of expertsallow us to get great accuracy,
also allow us to be moreinference efficient, at that
larger scale too.

Chris (27:15):
I'm curious, as as you've talked about kind of building
off of that LAMA 3.1 base as yougo, are you aware like, are you
and the meta teams that producethe LAMA kind of are you
targeting the same types offeatures going forward and
performance metrics? Becausethere's so many different places

(27:36):
to allocate, you know, theeffort. Are you very much in
alignment, or do you findyourselves deviating a bit from
Meta? You know, as as two largecorporations that are partners
and working together and bothproducing, you know, the the
same line of open source things,at least at at a base level? How

(27:57):
does that work?
And and do you is there anycollaboration with Meta, or do
you guys each just kinda say,I'm gonna go do my own thing and
build off? Because, you know,they had they had built the best
base so far for what you guyswanted to build off of next.

Joey (28:10):
Yeah. Meta is a great partner. And so we do work
really closely with them in lotsof different ways. And so we've
been very excited about all thellama work and they did have a
conference llama con, probably amonth and a half ago now. And
we're very supportive.
I think in their keynote there,you'll see there is a slide
there on Lama Nemotron kind ofcelebrating some of the

(28:30):
collaboration and achievements.And so I think there's
definitely overlap and those arethe places where we try and
collaborate as much as we can.And I think that they're also
very focused on some of thesechallenges like reasoning and
some of these enterprise usecases. And so we're always
excited to see the nextiteration of LAMA because it
gives us an even better startingpoint for us to think about

(28:51):
where else to contribute. So Ithink going forward, I expect
that will continue to be a greatcollaboration.
I think we're always excited forthe next versions of their
models to come out and wecelebrate them both in our
software stack, making sure theyrun efficiently and we can help
enterprises deploy themdirectly. And then we try on the
Nementron side to see what elsewe can contribute from the rest
of the community and sometechniques and what kind of

(29:14):
breakthroughs we can make. So Ithink some places where we might
see differences going forwardcould be perhaps in the model
architectures. I think thosecould be places where there's
different research breakthroughsthat come at different points in
time. And so I think there mightbe timing differences there.
In terms of, I think, likeaccuracy or capabilities,
generally speaking, we'relooking at very similar type of

(29:36):
achievements. And so I thinkthat will feel more like an
incremental growth, say everyfew months. And so I think
that'll be a place that wepublish all the data. So we make
it in such a way that everyonecan benefit from. And so I
expect going forward, we shouldsee more achievements.
Beyond LAMA, think part of oureffort and we did have a

(29:56):
conference last week in Europe,in Paris, and there we announced
partnerships with a handful ofmodel builders over in Europe, a
little bit over 10. And so ourgoal over there is also to try
and enable a similar ecosystemwhere there's many different
languages and culture andhistory in Europe. And so what
we'd like to be able to seehappen and what our partners

(30:19):
over there are super excited tobe able to invest and do is take
some of these models and thesetechniques and datasets and say,
bring reasoning to Polish or todifferent languages in the
regions there where some ofthese are more nuanced and
complicated. They have thehistory and the culture, and we
have kind of the general skills.And so I think going forward, we

(30:39):
expect to see a lot more of thatout in the community where
people in certain countries,certain languages and cultures
can benefit from a lot of thebreakthroughs that happen in
English first in such a way thatthey can bring those skills.
There are some things generallytransferable like math,
generally speaking, is prettyconsistent across languages.
Software development is anotherone of those. We're pretty

(31:01):
optimistic that the work that'shappened in English and the
datasets we publish should beable to help, say, bootstrap, so
to say, other languages and getthem up and going. Each of those
countries and domains havepoints that they can celebrate
and places that they can adoptand different challenges or
obstacles, say scientificquestion answer in Polish that
they're trying to work through,for example. So I think that'll

(31:23):
be the other place we expect tosee a bunch of growth and we're
excited about.

Chris (31:26):
Alright. So, Joey, that was that was a great
introduction to the models andlaying them out. And to build on
that a little bit as we as weget kind of get a little bit
more in-depth on them at thispoint, I think it is often cast
in the industry as as and maybedepending on the organization,
maybe it is, you know,competition. But there's also,
you know, as you've laid outvery well, a clear sense of

(31:49):
partnership across organizationshere. So if you're someone
listening to this right now, andyou've and you've you're you're
very interested in NeMatron, andmaybe you already have Llama 3.1
deployed in your organization,how should people and you may
have the proprietary ones.
May have, you know, from Geminior ChatGPT or whatever deployed

(32:13):
as well. How, with the modelthat you have produced here, how
should people think about thatin the sense of like, is, you
know, there is obviouslyprogress keeps being made and
models build on each other. Andso I think everyone's quite used
to the fact that you'reiterating on the models that are
in your deployment in yourorganization. But now, you know,

(32:35):
as you are looking at Neutron,you may have the LAMA model.
Where should they be thinkingabout LAMA?
Where should they be thinkingabout Neutron? Where might they
think about other things? How doyou fit into someone's business
today when they have all thesedifferent proprietary and open
options available? What kind ofguidance would you give on that?

Joey (32:55):
Yeah. I'll give two answers. I think one, I'll talk
about generally how we think ofjust evaluating models and
understanding capabilities. Thensecond, I'll answer specifically
for Neutron. I think generallythe mental model we encourage
people to have is think aboutmodels as say something like a
digital employee.

(33:16):
Like there's a set of skills andcapabilities that they were
taught, that they were trainedon and things that they're
really good at. And so thosecould be from say OpenAI or
Gemini or Claude. There'samazing models out there. They
could be from LAMA, they couldbe from Mistral, Quen, DeepSeek.
There's a whole variety ofoptions.
And I think the way we thinkabout it internally and where we

(33:36):
encourage our customers to thinkabout it is all these models
were trained on differentdatasets, different sets of
skills. There are things thattheir publishers are proud of
and excited about. And the mainchallenge often is for companies
to understand where these modelsare great and then match them up
with where their internalopportunities are to use them. I

(33:58):
think that's kind of the biggerexercise that knowing these
iterations will keep happening,we really want enterprises to
get comfortable with kind ofthis discovery and opportunity
fitting process. So to do that,we have software called Nemo
Microservices we've beenpublishing where there's some
evaluation techniques and toolsin there and some ways for
enterprises to take internaldata and create evaluation sets

(34:20):
out of it.
And so I think that's a greatplace that we hope to see more
people be able to invest inbecause just like you interview
an employee, you're looking fora set of skills and
capabilities. You should be ableto interview models. And so
we're hoping that's somethingthat, people will become more
and more comfortable with overtime. And then the the second
piece there to talk aboutNeMotron, the places that we're

(34:42):
really excited about NeMotronare gonna be around enterprise
agentic tasks. And so if thereare scenarios where you're
trying to look at things likecomplex tool calling or there
are scenarios where you havemore complex queries that will
benefit from the ability toreason through, meaning you have
a query that might requireanswering from different data

(35:02):
sources or from using say like acalculator plus a search plus a
data retrieval.
In those more complex scenarios,I think we're very excited that
Neutron should be one of thebest models to work out there.
The other things we wouldencourage people to think
through are where you're goingto deploy it. If you have
constraints around the data orconstraints around your compute,

(35:24):
maybe it has to be on premise orit has to be in a certain
geographic region, or if there'sregulatory constraints, I think
the Nemotron family of modelsgive a lot of flexibility of
being able to move where youneed them, whether that's on
prem or across cloud differenttypes of cloud deployments in
different types of regions. Andso those are probably the two
key places where we wouldencourage people to think

(35:44):
through using them. I thinkthere often are places where we
see many enterprises usingmultiple models.
And I think that's often the waywe encourage people to think
about it because generallypeople think, oh, I'm using
OpenAI, I'm all set. And thenthey don't realize that there is
maybe a different set ofproblems or different set of
challenges that there could beanother solution to use in

(36:04):
addition to. And so our kind ofview is we expect the use cases
and opportunities to grow. Wedon't view this as a kind of a
fixed pie. Like every day we seemore and more places that models
can solve for and more and moreopportunities to grow.
And so we expect kind of in theend, there'll be a world where
there are many different modelsall working together on
different tasks and enterprisescan find the models that work

(36:26):
best for them. They might eventake, say, a Neutron model and
fine tune it. They might say,hey, here's a task that it's
really good at, say toolcalling, but I actually have all
of my own internal APIs, my owninternal tools inside my
company. I needed to be anexpert at those. And so they can
take some of the dataset wepublished, mix it with the
dataset they can create usingsome of the Neemo software and

(36:47):
then fine tune it.
And then this variation ofNeemotron becomes their expert
at tool calling across theirdomain of tools internally. And
they still could even use thatin a workflow with say OpenAI or
Gemini. So I think we see aworld where all of these models
get used together to help solvebusiness problems and outcomes.

Chris (37:03):
I love that. I think that's great. I think that is
where we're going. But I think alot of organizations that aren't
AI global leader, organizationslike NVIDIA and stuff, are
trying to find their way intothat. They've kind of gotten
into using a model or maybe acouple of models, and they're
they're working on that kind ofAI maturity level of how do they

(37:25):
get their internal processeskind of aligned with this
multimodal future that we have.
So I think there's a lot ofstories unfolding in that arena.
One of the things I wanted tobring up real quick, not to
deviate you necessarily off ofNemotron, but you also I know
you guys have a new speech modelcalled Parakeet. And I was

(37:47):
wondering if you'd talk a littlebit about that as well and kind
of share what that is and wherethat fits in as well.

Joey (37:53):
Yeah. Thank you. We do quite a bit of work and there's
a lot of research that comes outof NVIDIA and it varies across
model architectures, types, usecases, datasets. And on the
speech place, we've been workingthere for quite a long time as
well too. And in thetranscription domain, the
challenges have often been, canwe transcribe the audio accuracy

(38:15):
across different accents anddialects across different
languages, and can we do thatvery fast and efficiently?
And so in terms of what we'vebeen publishing, we've been on
that journey for many years andthere's a great leaderboard on
Hugging Face, I think calledOpen ASR, where it's, an English
dataset, an English use case,And we've been working very

(38:37):
diligently over time to keepimproving the models that we
publish there. And so I thinkyou'll usually see us in the
majority of the top 10 withdifferent variations of models.
And often we get to trade firstplace with other companies and
we're happy to see kind of thecommunity pushing things forward
and we're going to keep workingon that. But I think the of the
latest breakthroughs we've hadin that space that we've been

(38:58):
excited about is on the parakeetside, there is some
architectural improvements thathave made a significant improve,
kind of kind of leap forward forus, so to say. And I think to
talk a minute about those, cango in a little bit of technical
depth here.
On the parakeet side,essentially it's based off of a

(39:19):
fast conformer architecture,which improves the original
conformer from Google. Whatwe're excited about is that in
terms of the model architecture,and you'll see us doing this
with LMs too, we always oftenexplore model architecture
spaces in terms of what's themost compute efficient on the
GPU. And so on the parakeetside, there's changes we made to

(39:39):
the way we do depth wiseseparable convolutional
downsampling, essentiallymeaning like at the start of the
input, there are clever ways toshrink that input so we can cut
some of the computational costas longer audio segments get
streamed in and we can keep thememory down. So in doing that,
we're able to see roughly, inaggregate, a two to three x

(40:03):
speed up in inference speed,meaning we can ingest two to
three x more audio in the sameamount of time and transcribe it
without reducing the quality ofthe audio. And then there's
other work we've done in there.
There's a whole bunch of cleverwork in there around things like
changing the attention window tomake that more global. And then
there's work we've done aroundsome of the, frame by frame

(40:25):
processing in there. So some ofbeing able to chunk audio and
properly chunk up that audio. SoI have a long list of great
things we've done in there. I'llmention a few other things too.
There's been some work we'vedone in terms of the decoder
part of the model architecture.There's a set of software we
call CUDA graphs where we'reable to take smaller kernels and

(40:45):
more efficiently schedule thoseon the GPU. That gives us as
well another about 3x boost inspeed. And so I think at the end
of this, you'll notice,especially in that open ASR
leaderboard, kind of the RTFfactor there of real time audio
were quite high, especiallycompared to the alternatives up
there. And that's because wespend a lot of time and have a

(41:05):
lot of insight of how to do thaton the GPU.
We try and do that in such a waythat we can open it and publish
it so ideally other companiesand partners can adopt some of
those technologies and pull theminto the models that they build
and release as well too.

Chris (41:18):
Fascinating. Well, I appreciate that. Thanks for kind
of laying that out. As we arestarting to wind things up, I
know that we have we havealready delved a little bit into
kind of the future and wherethings are going and stuff. But
I'm wondering, you know, fromfrom your chair, as you're
sitting there driving theseefforts forward in with at

(41:39):
NVIDIA, and you're looking Imean, this is probably the most
fascinating time in history, inmy view, when you think about I
mean, there's all sorts ofthings going on in the world.
But in the technology space, thedevelopment of AI and related
technologies here is just goingfaster and faster, broader and
broader. And as you are thinkingabout the future, I often say

(42:01):
kind of like when you're goingto bed or you're taking the
shower at the end of the day andyou're kind of relaxing from all
the things you've been doing,where does your mind go on this?
What are the possibilities thatyou're excited about over the
next few years? And what do youthink might be possible that
isn't today? If you just kind ofshare as a final thought kind of

(42:25):
your aspirations in this space,I'd really appreciate it.

Joey (42:28):
Yeah. And I think I'll probably a little bit higher
abstraction level and then tieit back here. I think going
forward, what we're reallyexcited about is the idea of
having a digital set ofemployees or digital workforce
to help the current workforce.And so we view going forward the
idea that we continue to havepeople doing great work at great

(42:50):
companies and then augmentingand improving that work with
digital employees. And so in inkind of that future view of the
world where, say, we interactwith these digital employees
either for simple things likeretrieving information from
complex systems across acompany, just doing simple data
analytics to maybe more complexthings of being able to do, say,

(43:10):
forecasting or helping predictthings coming up in the future,
I think there'll be a wholemassive space around having
these digital employees solvemore complex tasks and being
able to either hire them or rentthem across companies.
You can imagine there arecertain industries where people
are experts in their domain.They might rent out digital
employees to other companies whoare building products with them

(43:32):
as a dependency or with them asa partner. And so in kind of
that future world of having allof these digital employees or
agents working together, we viewbacking into things like like
Nemotron, the idea of being ableto improve the capabilities
across single models, acrossmany models, and across the
ecosystem, all of that in theend helps us be able to get

(43:53):
these more accurate and moreproductive digital employees.
And there's a whole set ofsoftware that goes around just
not just the model, but havingmultiple models work together.
There's a whole another set ofchallenges of as you have these
digital employees that are basedon these models, how do you keep
them up to date?
How do you ensure they staycurrent? They know the latest

(44:13):
information about your business,if your supply chain changes or
if your inventory changes. Andso there's opportunities there
we're looking at around dataflywheels where we have a set of
software we published a monthback called NEMO Microservices
to help people take thesedigital employees and keep them
current and recent oninteractions, enterprise
knowledge, and data changes overtime. But I think going forward,

(44:34):
we're really excited for thatspace because often there's a
lot of, difficult or mundanetypes of challenges and tasks
today that prevent us fromgetting to the things we're more
excited about or where we addmore value. And I think we all
can kind of relate to that inour day to day.
So I think going forward, expectthat these digital agents or
employees will be able to helpus significantly get past a lot

(44:54):
of the mundane repetitive thingsthat we end up having to do
because systems are hard ortechnology is hard or things
haven't been built as well asthey could. And then focus more
on the more exciting placeswhere we can move efforts
forward, move businessesforward, and contribute much
more to the community and theeconomy.

Chris (45:10):
That's an amazing vision you have there. I love that.
Thank you for sharing that.You've given me yet again some
more things to be thinking aboutas we as we finish up here. So I
just wanted to to thank you verymuch, Joey, for coming on to the
show, sharing your insight andtelling us about the new models
that you got here.
And I hope that you will comeback when you have, the next

(45:33):
things, that you might wannashare, and share them with our
audience. Thank you very much.

Joey (45:37):
Yeah. Sounds good. Thanks for having me.

Jerod (45:46):
All right. That's our show for this week. If you
haven't checked out our website,head to practicalai.fm, and be
sure to connect with us onLinkedIn, X, or Blue Sky. You'll
see us posting insights relatedto the latest AI developments,
and we would love for you tojoin the conversation. Thanks to
our partner Prediction Guard forproviding operational support
for the show.
Check them out atpredictionguard.com. Also,

(46:09):
thanks to Breakmaster Cylinderfor the Beats, and to you for
listening. That's all for now,but you'll hear from us again
next week.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

Las Culturistas with Matt Rogers and Bowen Yang

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Finding Nemotron

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

Las Culturistas with Matt Rogers and Bowen Yang

Dateline NBC

All Episodes

Finding Nemotron

Stuff You Should Know