Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:10):
Hey everyone.
Welcome back to the EdgeVerse Techcast,your go-to spot for the latest in
NXP Software, Tools, and Tech thatbrings processors and cus to life.
I'm your co-host Bridgette Stone.
And I'm Kyle Dando, your other co-host.
Today we're getting real about the AI.
Oh no, not that cloud stuffyou're busy working with, but
(00:31):
AI that runs right at the edge.
We're talking about eIQ GenAIFlow, that's NXP's new tool for
deploying large language models.
Yeah, sure.
It's the ones that you know aboutand you probably thought they only
ran on those power hungry servers inthose data centers down in the desert.
But we're gonna talk aboutsomething more interesting for you.
(00:53):
And helping us unpack all the layers ofquantization, tokenization, GPTs and RAG
magic is Davis Sawyer, who spends hisdays wrangling AI models and LLMs and his
nights presumably dreaming in PyTorch.
Welcome Davis.
Okay folks, thanks for
that introduction, Bridgette.
My name is Davis,
(01:13):
I'm Canadian, but I grew up in Houston,Texas, and I joined NXP in 2024 to
make the intelligent edge a reality.
It'll probably become pretty obvious overthe course of this conversation that I
just love working on AI powered products.
And before NXP, I was co-founderand Chief Product Officer of an AI
model compression startup, whichwas acquired earlier this year.
When I'm not working on the intelligentedge, I love to spend time with
(01:33):
family and work on my golf game.
Well, we're so happy to have you today.
Thanks for being here.
I just wanna kick us off with the basics.
What exactly is eIQ GenAI Flow and how isit making it possible to run powerful LLMs
where we never thought we could before?
Meaning at the edge?
So.
eIQ GenAI Flow, short for generative AI,as the name implies, is an end-to-end
(01:58):
pipeline for deploying optimized, secure,and responsive generative AI applications
on embedded systems, AKA, the edge.
It breaks down barriers to deployingcomplex models like these LLMs and VLMs
or Vision Language Models at the edge.
What Gen EI Flow is all about isenabling the latest models like Llama
and Qwen from Open Ecosystems to runeffectively on NXP application processors
(02:21):
or NPUs, like the i.MX 95 for example.
To do this we use quantization, whichyou alluded to earlier, and NPUs or
Neural Processing Units for acceleration.
Furthermore, the flow integratesaudio tasks like wake word detection
and speech recognition, uh, withthis impressive, super impressive and
domain specific question answeringcombination that is empowered by RAG,
(02:42):
which we'll break down a little bit.
The whole point is it's designedto work in constrained environments
with the reduced dependency on thecloud or those power hungry GPUs.
And most of all, it promotesprivacy preserving AI by
processing data locally, right?
And that minimizes therisk of data trip exposure.
So we're basically giving LMS apassport to travel light and stay local.
(03:02):
I absolutely love it.
So I've seen a lot of articles . Iguess RAG stands for the retrieval
Augmented generation, and from whatI read, improves how large language
models can interact with local data.
It seems like that could bethe secret sauce behind making
(03:22):
edge based AI devices smarter?
Can you do just a little bit better jobexplaining what RAG is for our listeners
and maybe uh, zero in on some of the keybenefits of NXP's GenAI Flow with RAG?
I'm happy to.
That's a great question, Kyle.
So in a nutshell, RAG allows LLMsto access domain specific data
(03:43):
or, or private knowledge withouthaving to retrain the model.
And that's super important.
And we'll come back to why nothaving to retrain benefits both the
customers or end users and individuals.
But first and foremost, it's aboutimproving the quality of these models.
What RAG really does is it helps eliminatewhat, what they're called hallucinations.
Basically, you know, it's nota psychotic episode, but the
LLM just gets things wrong.
(04:04):
Their, their outputs, their statistics,their computation gets the answer wrong.
So RAG helps eliminate these wronganswers or wrong the outputs by grounding
in reliable real-time data sourcesthat a human operator can provide.
This keeps sensitive information secureby processing everything locally, right?
Like Bridgette, what you talked aboutearlier, that local knowledge processed
locally is a really good combination.
(04:24):
This also can support dynamic use cases.
Think of stuff you might be doing onour PCs, like document based Q&A but
through a conversational interface.
Like a user manual for a vehicle or theservice log for a manufacturing machine.
Doing this on demand.
This also makes AI flexible and modular.
I think we'll use those words acouple of times in this conversation.
(04:46):
This means developers can plug innew documents, basically A PDF,
and get a data source as needed.
All right.
Well, here comes Kyle'ssilly analogy, Davis.
It sounds like RAG is the ultimate coach.
What I mean by that is, our NXPdevice is there and it wants to use
the LLMs, but it doesn't know howit can use that in its environment.
(05:06):
Well, RAG allows the NXP Edge deviceto use the LLMs with the information
and the resources it has locally.
The RAG can take that information,understand what's available, provide
it to the edge device, but protect itfrom having to retrain itself, or in my
analogy, have to get all new equipment.
(05:27):
What do you think?
Kyle, we know you likea good coaching analogy.
Perfect.
The, these LLMs are just comingoutta grade school and now they
need to enter the real world.
They need a good coach to do so.
I think that's what we're providing.
So Davis, let's talk reality.
Edge devices aren't exactlypacking server racks.
So how does GenAI Flow shrink thesemassive models to fit into the power
(05:48):
and compute limits of the edge?
Great question.
One of those keywords we use thatquantization, we can use these techniques
and basically it shrinks the complexityof the data while preserving quality.
But this can actually reduce thecomputation cost significantly.
In the AI space, we love fun names, right?
So we have things like SmoothQuantat INT8 precision or SpinQuant
(06:10):
at INT8 or INT4 precision.
Which is actually an 8x improvementon the original training
size, using these techniques.
And we take advantage of the hardwaresupport at the NPU level such as NXP's
eIQ, neutron NPU to accelerate andboost the throughput or the inference
time, or if it's conversational,the response time of these models,
especially a key metric in thisspace, known as Time to First Token.
(06:33):
The combination of this quantizationplus this NPU acceleration
really, to your original question,makes this stuff a reality.
By supporting these small languagemodels, which are a subgenre of
the large language models that arelighter, yet still capable, like say
1 billion parameters or half a billionparameter models from open ecosystems.
We offer flexible execution orinference backend, so you can use
(06:55):
the CPU or nmp, or both, dependingon the power and performance needs.
Life's about trade offs,Edge AI is no different.
I think that's what we need touse to bring this to reality.
So to make this real for everyone,what are the demonstrated improvements?
This isn't on paper, this is actuallyon real hardware, real products today.
We've gone from 9.6 seconds Time to FirstToken, TTFT, using your off the shelf
(07:15):
CPU at float 32, so 32-bit precision.
To less than 1.5 second using the NPUat that INT8 precision I mentioned.
So you can see almost 10xin there, with just a couple
really powerful technologies.
Uh, yes.
I think that our listeners are reallygonna understand that, and that's what
they are listening to this episode for.
It's like, what is thetangible improvement?
(07:37):
That time to first token, from9.6 to 1.4, that's ultimately
what they're looking for.
It's the difference between goingto get a coffee during a break
or just the blink of an eye.
And that's the kind of accelerationthat our listeners are looking for as
they're planning their new products.
The solution that you're providing withGenAI Flow, will allow them to get that
(07:59):
type of performance into the architecture.
To solve all those aggressiverequirements that their marketing
teams are throwing at 'em.
All right, Davis.
I think the best way to entice ourlisteners to consider this GenAI Flow that
we're talking about is to share with themwhere they can find it in the real world.
So maybe you can start with a few standoutexamples where you see GenAI Flow already
(08:23):
making a difference out in the wild.
Absolutely.
This is the fun one.
I'll break it down by differentdomains that NXP is a major player in.
And then a couple use casesthat use this technology.
So right outta the gate, we were one ofthe first implements, automotive cockpit
for infotainment using an AI drivensystem that can respond to voice commands
using a proprietary service manuals.
(08:45):
So basically think of that big book that'sin your dashboard or glove box that you
never read until you absolutely need to.
We put that into a LLM friendlyknowledge base with RAG that could then
respond to real time voice commands,in the vehicle, in the e-cockpit.
So I think that's a pretty cool one thatwe've been working on for over 12 months.
And it doesn't stop there.
(09:05):
With healthcare you could think oftouchless interfaces for equipment or
medical diagnostics to prevent infection.
So preventing contact, we canuse voice triggered AI to access
patient or procedure info securely.
Again, no data leaving, simply theAI inference happening locally.
Another really big growthmarket that I'm personally quite
excited about is mobile robotics.
So robotics machines that can beautonomous, that can interpret
(09:28):
human instructions throughdocuments or maybe even gestures.
Think of like.
Making a drawing of something andthen having understanding with
OCR + RAG to understand how toexecute on what you're trying to do.
I think that's a pretty cool HMI that'spossible through this GenAI Flow and RAG.
Then just to maybe round thingsout, Industrial Automation.
AI assistance answering realtime machinehealth and maintenance diagnostics.
(09:51):
A non programmer operator being ableto leverage ChatGPT level technology
in a remote area in the factory.
I think that's pretty powerful.
That highlights the versatility ofwhat GenAI Flow is for industries
where latency, privacy, connectivity.
They're all critical.
Awesome.
Very cool to see current usecases, and it's already a reality.
(10:12):
When we wrap up the show, we always liketo talk about what's also on the horizon.
So what is on the horizonfor edge based Gen AI?
Where are we heading next, and how isGenAI Flow geared up for that future?
I'm a big believer that trueprogress is always interdisciplinary.
And one AI embodiment of this isthe growing demand for multimodal
(10:33):
AI that combines vision, combineslanguage, combines voice, and
inputs it into a single edgedevice or edge centric pipeline.
That's gonna be really leveled up bythe reasoning and cognitive abilities of
LLMs that are now connected or pluggedinto multimodal multiple domains.
Compounding that is quantization.
There's only one place to gofrom fast, and it's faster.
(10:56):
I think we're already finding waysto optimize for larger language
models to have better quality, butalso optimize the smaller models
to make them still performant,but more responsive and faster.
Expanding that I think when we combinethat with agentic AI and physical AI.
Two, not just buzzwords or hype drivendomains, but real applications that
(11:16):
are being enabled through Gen AI thatwe're already as a technology innovation
driven organization, creating withcontext that is enabled through RAG.
So AI that can act and reason toadapt locally, that's not just
shipped once and then updated in 15years, but it's constantly improving.
That's super exciting.
I think to round it out theenhanced developer tooling.
That's always in hot demand, always neededbecause AI itself is a moving target.
(11:39):
We have to move with it and that'sdone by focusing on scalability, by
bringingGen AI, which I think, youalluded to at the beginning, Kyle,
most people might not suspect is real.
But it is.
And I think by focusing on thatand compounding it to a wider range
of devices, possibly MCUs, I'm notsure, but definitely MPUs today.
And I'm curious what could bedone when a lot of creative
people put their mind together andmake AI smaller, faster, better?
(12:00):
Awesome.
Well, thanks for being withus today, Davis, and I'm gonna
wrap it up for our listeners.
We explored how eIQ GenAI Flow isbringing generative AI to the edge:
Securely, Efficiently, and Locally.
From RAG powered Q&A to multimodalAI and robotics, we're seeing what's
possible when LLMs escape the cloud.
(12:22):
And whether you're building forAutomotive, Healthcare, or Industrial
Automation, NXP's new GenAI Flowis setting the bar high for real
world AI performance at the edge.
So consider today's episode aninvitation for you to explore how
your edge devices could benefitfrom having access to ChatGPT-like
functionality without the keyboard.
(12:44):
Thanks again to Davis for joiningus and to all of you for tuning in.
Be sure to Subscribe and checkout more edge verse episodes
wherever you get your podcasts.
Thanks for having me, guys.
It was a really fun conversation and lookforward to seeing what's in the future.
Yeah, we'll have to have you back, Davis.
Thank you very much.
And to our listeners (13:03):
Stay curious,
keep learning, maybe faster than those
machines out there, and we'll seeyou on the next EdgeVerse Techcast.