Inference, Guardrails, and Observability for LLMs with Jonathan Cohen

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:06):
Welcome, and thank you, everybody,for joining us today on the AI
Explained on Inference, Guardrails,and Observability of Generative AI.
I am Krishna Gade, Founder and CEO ofFiddler AI, and I'll be your host today.
Without further ado, we have a veryspecial guest on today's AI Explained.
That is Jonathan Cohen, VP ofApplied Research at NVIDIA.

(00:29):
Welcome, Jonathan.
Hi, good morning.
Good morning.
Thank you so much for having,uh, being on this show, Jonathan.
Yeah, my pleasure.
Awesome.
So Jonathan, uh, is an engineeringleader of the NeMo platform.
Uh, he has incubated AI technologyinto products, including NIMS, which

(00:50):
is NVIDIA Inference Microservice, andhe's worked on a lot of foundation
models for human biology and alltogether spent about 14 years in
NVIDIA and various different roles.
And prior to that at Apple as well.
So Jonathan, maybe for our viewers,Uh, NVIDIA, for many people, comes
across as a GPU and hardware company.

(01:11):
What is NIM and NeMo?
These seem like software tools.
Could you, could you share, uh,what these are and how they fit
into the NVIDIA AI strategy?
Yeah, thanks, uh, that's a great question.
So, NVIDIA has always been ahardware and software company.
You know, we've, we call ourselves anaccelerated computing platform company,
and computing platform means not just achip, not just a computer, but a platform,

(01:36):
everything you need to do your computing.
Um, so we've always included APIs andtools and acceleration, and in fact,
the reason why our platform is sosuccessful, has been so successful,
um, and delivers so much value isprecisely because optimizations and
improvements you make at any level ofthe stack, they all multiply together.

(01:58):
You know, if the hardware istwice as efficient, and then the
software algorithm I use is twiceas efficient, Then, cumulatively,
I had a 4x improvement, right?
And that's how we always think aboutit, these full stack improvements.
So NeMo is just, you know, the latest ina long line of NVIDIA software platforms,
um, kind of following this strategy.
Um, NeMo is our platform foraccelerating the creation and

(02:21):
operation of modern AI systems.
You could call them, you know, AI agents,or generative AI, or Large language
models, you know, they're not, thesearen't all synonyms exactly, but obviously
they're all kind of in the same bucket.
And I think of them as modern AI systems.
Modern AI systems, uh, youknow, touches a lot of things.

(02:44):
So there's training large languagemodels or foundation models.
There's Um, customizing existingmodels, so there's many great
community models out there, forexample, the Llama models from Meta.
Um, you, you might want to startwith one of those models and
then somehow customize it andfine-tune it based on your own data.
There's deploying these models, there'smanaging the lifecycle of the deployment.

(03:07):
I, I have a model, I need inferencing, Ineed to observe it in action, I need to
log what it's doing, I need to take thoselogs and somehow, you know, um, maybe
evaluate good and bad examples, maybehave a human correct it when it made a
mistake, turn this back into trainingdata, retrain, redeploy, you know, there's
a complete, what we call, flywheel,typically, around any deployed AI system.

(03:30):
Um, and NeMo is the softwarethat manages all of that.
And so NeMo covers both some open sourcePython components, That are basically
PyTorch based, uh, tools for training, um,fine-tuning evaluation, and then also a, a
microservices platform, um, which is, uh,at, at this point the, the only component

(03:52):
of the microservices platform that's ingeneral availability is nims, which, which
stands for NVIDIA Inference Microservice.
Um, and NIM is a very simple idea.
The, the idea of NIM is just take amodel, whatever model, um, package
it up behind an uh, an API server.
So, in the case of, uh, large languagemodels, something that's, let's say,

(04:14):
OpenAI, um, completion, endpoint, youknow, compatible API, or Llama stack
inference compatible, or whatever it is.
So there's a little, there's alittle API service that makes
it easy to talk to this model.
Um, behind that service, you havehighly optimized, uh, inference.
So, because it's NVIDIA, you know,we're going to always give you the

(04:35):
best performance, best throughput,best latency, um, And so we'll use
technology like TensorRT or whatever thebest technology is, um, add all the uh,
enterprise grade features like security,you know, up to date with the latest CVEs
and security patches, um, connectionsinto logging and observability platforms.

(04:58):
And that whole thing then is a container.
So we don't, this isnot a managed service.
NVIDIA is not operating this as a service.
We take all this code, we package itup in a container, and you can get
that container and run it yourself.
And so if you want to have your own, youwant to manage your own LLM inferencing
endpoint, it's as simple as grab thecontainer, docker pull, docker run.

(05:20):
That's it, and you're up andrunning in, you know, a few minutes.
Um, and because it's portable, youcan host it in any cloud, you can host
it on prem, uh, you have control, uh,you can, you can host it, you know,
geographically, or in terms of networktopology, you know, as close to whatever
you need to host it as, as you want.
You can host it in a VPC, you can hostit in your own private infrastructure

(05:43):
that's air gapped, whatever youwant, it's just a container.
Um, and that's, I think that's a verypowerful idea, and that's something
that our customers really appreciate,because if you think about, you
know, what are we, what are thesemodels actually being used for?
Increasingly, they're, youknow, to quote Jensen, our
CEO, these are digital workers.
They're, they're these agentsthat are, that are operating

(06:06):
alongside humans at companies,um, doing things that humans do.
You know, would be doing or couldbe doing or, or helping humans do
things, which means they have accessto all the very sensitive data
that our employees have access to.
Um, you know, if, if it's in a healthcaresetting, it might have access to, or
you'd want it to have access to medicaldata, or you'd want it to have access to

(06:27):
proprietary company data, whatever it is.
Um, and oftentimes for this most sensitivedata, controlling where that data goes
and where it gets sent and, and who mightbe looking at it and who's storing it
is, is really important to customers.
And so being able to have total controlover how you deploy these things,
where you deploy them, what you dowith the data that gets sent, what
you do with the logs, or maybe, maybeyou don't log, or whatever it is, or

(06:49):
you know, what, where you run yourobservability, all these things is, is I
think very compelling to our customers.
And so that's the idea behind NIM.
Um.
Absolutely.
This is great.
Uh, so you touched upona bunch of things, right?
One of the things that I heard, uh, Jensenand some of, and even you talk about
is this NIM as model in a briefcase.
Essentially, it's like a large languagemodel packaged into this container, uh,

(07:15):
and where you can go and run and deployin your favorite cloud environment.
So, how does it work?
For example, You know, let's say, youknow, if I'm a customer running, you
know, my workloads are in Amazon orGoogle Cloud, you know, how, how do
I, you know, what are, what are theadvantages of using NIM, you know, how
do I use NIMs for, for my, you know,large language model inferencing?

(07:38):
Yeah, so it's all NIM needs.
This is, uh, Kubernetes environment.
So anywhere where you can, or, I mean, andyou can probably make it work even in a
non Kubernetes environment as long as it'ssome kind of container orchestration, but
let's just say for simplicity Kubernetes.
Um, so anywhere where you canorchestrate and launch containers

(07:58):
and, and a NIM container.
That's the, that's whatmakes it so simple.
When people say, Oh, it'sa model in a briefcase.
I mean, what they mean is it's notour model that we're holding for you.
Okay.
We give it to you, you know,digitally, and you're free to
move it to wherever you like.
So, any cloud.
Now, you know, you need aGPU, you need some accelerated

(08:20):
computing hardware to run a NIM.
Um, the software stack that's ina NIM today is GPU accelerated.
Um, but as long as you have aKubernetes, uh, environment that
can orchestrate containers that hasaccess to GPUs, You can run NIM.
Not all NIMs will fit on all GPUs.
They have memory requirements, youknow, if you want to run, uh, Llama

(08:43):
405B, that's a very large model, uh,it's not going to fit on, you know, a
small GPU in, in some cloud somewhere.
So, so there's requirements likethat, but other than the physical
requirements around memory, You cando whatever you'd like with a NIM.
And so does, do you offerlike pre packaged NIMs with
like, say, Llama or Mistral?

(09:04):
Exactly.
Where you can download the containers and,you know, run it on your, uh, Kubernetes?
That's exactly right.
So we have a catalog if you goto build.nvidia.com, you can see
a complete catalog of all of themodels that we have NIMified.
Um, and that process, uh,we have something internally
we call the NIMFactory.
Um, and what NIMFactory does is we takethese models when they, when they get

(09:25):
launched or, you know, in some cases weknow about them shortly before they're
launched, and we put them throughthis process where we, um, we pre
build, uh, optimized TensorRT engines.
Uh, we measure them, we, we, so we doall the hardware level optimization,
um, we make sure they work, wecertify them, uh, across different

(09:47):
hardware, uh, SKUs that, that we have.
We package it and thenwe put it on this build.
nvidia.
com and so you can try it andyou can interact with NIMS.
NIMS are not just large languagemodels, we have NIMS for, you know,
the concept applies to any model,um, that you may want to inference.
Even computer vision modelsand things like that.
Absolutely.
Computer vision, um,image generation models.
We have speech recognition models as nims.

(10:09):
We have biology models as nims, proteinlanguage models, molecular docking models.
All kinds of things are, are asnims and, and for many of those
nims you can download them,them, uh, download them yourself.
So on build.nvidia.com,if it says run anywhere.
That means that it's a model that'sbeen, uh, you know, packaged and we've

(10:29):
tested it well enough that we, uh,that you can actually download it and,
you know, we guarantee some level ofquality, uh, that it, that it should
run with low latency and, you know,high throughput and, and accuracy.
Awesome.
So now, if, okay, so I have, like,gotten a hold of a NIM server,
maybe like Llama server, so I'vedeployed a bunch of containers.

(10:51):
Now, now I want to, as an enterprise,I want to make sure that I'm taking
care of all the security issues andthe things that you touched upon,
you know, I have, you know, maybe myhealthcare records and how do I, as
an enterprise company, you know, solvesome of these security challenges
when deploying these LLMs, right?
You know, how can guardrails becustomized to address these and where

(11:12):
would some of the NeMo framework help me?
Yeah, so, um, you know, the answerdepends a lot on how you're deploying it.
So, not everything that you deploy,not every large language model, let's
say, that you deploy is actuallygoing to be used as a chatbot.
Some of them may be veryinternal endpoints that are
doing some very specific task.

(11:34):
Um, and so the security, you know,the attack surface, or the, um,
you know, anomaly detection, thesekinds of things, they may be harder
or easier depending on your actualuse case and deployment scenario.
But let's just pick an example.
Let's say I have a, you know, thesimplest, most obvious example,
a kind of customer service, youknow, customer facing customer

(11:54):
service chatbot that I've built.
Uh, okay, now I have a really, um,I have this probabilistic AI system
that I'm exposing to the public.
Uh, I think probably everybody hasseen by now news stories, you know,
that sort of trickle out aboutcompanies that have deployed these
and, you know, the customer talks thechatbot into offering it a discount

(12:15):
or, or something like that, right?
And now suddenly I have, I havethis problem where this digital
representative, you know, digitalrepresentative of my company has done
something against my company policy.
And, and, and so this is a good exampleof the, you know, the challenges and
the risks of deploying AI models.
Um, you know, I always think about it,just the analogy with a human, so if I

(12:38):
had a human customer service worker, Igive them a book, I tell them, you know,
what are the, uh, what are our policies,what's our refund policy, you know, how
do you handle a rude customer, how doyou handle an irate customer, right?
We train our humans, we trainpeople, we do all these things.
Um, And sometimes people get it wrong,and it's, you know, quality assurance,

(12:59):
and you, you know, this is why whenyou, whenever you call a line that says,
you know, this call may be monitoredfor quality assurance purposes,
that's what they're talking about.
They do record these lines, andsupervisors listen in, and they
review calls, and they're constantlytraining their staff, right?
Um, why would you expect a digitalworker to be any different, right?
So when I'm interacting with adigital worker, I certainly want to

(13:19):
be able to monitor what it's saying,monitor what the people are saying.
Um.
That's on the monitoring side, butI also want to somehow control and
constrain what this AI is doing.
You know, for example, I say, youare a customer service, uh, chatbot
that is only to talk about productquality issues with our products.

(13:41):
Don't talk about politics.
Don't offer your opinion aboutwhat's the best wine or, you
know, whatever it may be.
Um, especially, you know, when you're,when this chatbot is based on a modern
large language model, these, thesemodels are incredibly sophisticated,
able to carry on conversations aboutall sorts of things, perform many tasks,
but that's not what you want, right?
You typically want to highly, highlyconstrain, um, the domain, what you

(14:05):
might call the ODD, the operationaldesign domain in which this chatbot
is actually going to operate.
Um, so we have a product, I mean,there's, there's many solutions out there.
From NVIDIA, we've developed thistechnology we call NeMo Guardrails, which
is specifically designed to do this.
Um, and NeMo, the way NeMo Guardrailsworks is, um, it's actually built on
top of a very sophisticated technology.

(14:27):
It kind of hides a lot of the underlyingsophistication, but there's a very
sophisticated technology calledColang, um, not, not to be confused
with Golang, so C -C O L A N G.
And Colang is a dialoguemodeling language.
And Colang is incredibly powerful.
Um, but essentially in Colang you candescribe, using this, this um, specific

(14:48):
modeling language, the structureof a dialogue, of a conversation.
You can talk about, you know, the topicsthat, that you would want to cover, the
tone, and, and you can write triggers.
So, for example, you can say, you know,if If the, um, the customer says something
in an angry tone, then trigger some rule,like, you know, make sure the bot responds
in a, in a very conciliatory tone.

(15:10):
If the customer says something rude, andlet's say the customer says something
rude, um, you know, five times in a rowwithout making any substantive actual,
uh, uh, request, then maybe your botwould go into some predefined, you know,
hey, I can tell that you're really upset.
Why don't I disconnecttoday and, you know, we can
reconnect in the future, right?
So whatever rule you may have, um, youcan essentially, in Colang, you can, you

(15:35):
can describe these very, very complicatedand potentially sophisticated rules.
We also have a lot of,um, pre built rules.
So you have to know all this, right?
And, and we're, one of the things thatwe're working towards, um, before we, we,
uh, release it for general availabilityis, Um, sorry, the open source toolkit
is available today, but we're workingon a microservice version of this, and

(15:57):
what we're working on there is reallyto have a bunch of these pre built best
practices to make it easy to deploy.
But guardrails also can, for example,look for prompt injection attacks, or
different security vulnerabilities, orUm, um, you know, jailbreak attempts and
all these things that are also kind ofsecurity issues, you know, it's this,
it's this very interesting world wheresecurity and dialogue management are kind

(16:22):
of overlapping for the first time ever.
Uh, you know, one way I like tothink about it is, in the past,
when you think about computersecurity, you think about an API.
I mean, really, securityis all about APIs.
I have an API.
Have some, you know, way ofcommunicating with the computing system.
So it's much more structured, but nowanything can be done through your...
and security flaws, security holes,are some way of accessing that

(16:46):
API in a way the developer didn'tintend, that can get the system to do
something the developer didn't intendit to be able to do through that API.
You know, leaking information or, orgranting permissions or whatever it is.
This is kind of what computersecurity is really built around, this
notion of APIs and, and uh, access.
And now we're doing this, we'reputting this conversational
engine in front of an API.

(17:07):
So I can talk to a chatbot that has ahuman like conversation with me, and
then the chatbot's translating whatI'm saying into calls to some more
structured API behind the scenes.
But now my conversationitself is the attack surface.
Correct
and an attack, um, might look liketrying to convince a chatbot to do

(17:30):
something it's not supposed to do.
There are these do attacks anddo anything attacks, DAN attacks.
Yeah, exactly, exactly.
You know, they don't look like computersecurity attacks of the past, right?
And so, so there's a veryinteresting overlap between computer
security and dialogue modeling.
And NeMo Guardrails is reallytrying to embrace that and, and so

(17:51):
that's why the core technology isactually a dialogue modeling system.
And then we layer on top of thata number of computer security
concepts and techniques.
So, so on that, right, so, so theGuardrails seems like a rules engine
framework, you know, it's almost like a
It's a fuzzy rules engine, yeah.
Fuzzy rules
A dialogue based rules engine.

(18:11):
So now, if I'm saying that, okay,if the user is abusing or if there
is a jailbreaking attack, how isthat detection capability happening?
Are you actually using languagemodels behind the scenes or other AI
techniques to detect, you know, if thisis a jailbreaking attack or something?
So, Guardrails is actually relativelyagnostic about, the software is
relatively agnostic about this.
And it relies on um,access to existing models.

(18:34):
So, for example, you could have ajailbreak detection model, but, you
know, given an utterance, is thislikely to be a jailbreak or not?
Um, we have topic models that, um, youknow, look at, again, a conversation
and decide what, you know, given abunch of options, which topic most
quickly matches or, or, um, uh, italso can sort of generally use large

(18:55):
language models to, to do these sortof assessments, like, what is the tone
of this conversation or whatever it is.
Um, but the, the power ofguardrails, in fact, is whatever.
Whatever model you have that you wantto use as part of your, um, guardrailing
system, Nemo Guardrails can call.
It can, it's very easy for NemoGuardrails to become an orchestrator

(19:16):
of existing techniques you have.
So I have a jailbreak techniquethat I've jailbreak detection
technique that I've developed or...
so you can bring your ownjailbreak detection model
whatever
other API said you could integrate
That's right
Guardrails
and we've developed some but but infact you know that the way I would
recommend you to play NeMo guardrailsis that you use Llama guard for
this and there's a huge communityof these Um, models out there.

(19:38):
In fact, that was the, the originaldesign of NEMA Guardrails was, was
based on this idea like, NVIDIA's notgonna solve the guard railing problem.
This is like immense problem.
It's like saying, we're gonnasolve computer security.
Of course we're not.
There's a huge communityof people working on this.
So, from the very beginning, wethought it was really important
to be able to tap into that.
And to provide a technology that, youknow, maybe had some of our own tech, um,

(20:01):
techniques inside, but fundamentally wasmore of an orchestrator of the community's
techniques as they're being developed.
Yeah like, for example, we haveintegrated within Guardrails so that
Fiddler's intelligence techniquescan be available for Guardrails.
So
And actually, just, just a commentabout that, you know, the other,
the other really important part ofGuardrails is logging and monitoring.

(20:21):
The NeMo Guardrails itselfdoesn't really do that, right?
It it's kind of a rules engine,as you say, more sophisticated
maybe than a simple rulesengine, but it's a rules engine.
But you, you, you also reallywant to be able to have something
that's monitoring it, lookingfor anomalies, um, uh, you know,
dashboards, all these kinds of things.
That's not what NeMo Guardrails does.

(20:42):
You can think of NeMo Guardrailsas kind of more of a an endpoint
monitoring like node, but you reallyare going to want to connect it into
some larger platform like Fiddler.
Right, right.
Makes sense.
And, and so, uh, When it comes to,like, uh, runtime deployments of AI,
one of the design patterns that we areseeing is, um, in enterprises, they're
building this gateway service, almostlike a brokerage service, that calls into

(21:05):
many of the LLM endpoints, so there'salmost like a model garden that evolves
You know, for different use cases.
And the gateway service federatesrequests back in the day.
So it seems like a guardrailing systemframework could really help you build
a really powerful gateway, right?
So have you seen examples of customers?
Yeah, you certainly cando things like that.
Um, you know, you can alsochoose to use some other systems.

(21:29):
So, you know, there's a lotof these, I think, what people
would now call, like, agentic.
AI systems, um, or, or compound AIsystems, I think they're kind of synonyms
in a lot of ways, where you, again, havemultiple AI models that are interacting.
NeMo Guardrails could be usedthat way, but you don't have to.
You can, you know, build your, your,your software on, let's say, um,

(21:52):
Langchain or something like that, orwhatever platform you want, and still
connect it into something like NeMoGuardrails, and connect it into NIMS.
And, and again, I think that.
The concept of NeMo, just the broaderplatform, is NeMo is not monolithic.
There is no, it is very carefullydesigned to be a set of modular,

(22:14):
independent microservices.
Now, we designed them allso they work well together.
But you can say, I'm going to use NIM,and I want to use my own monitoring,
my own guardrails, my own fine-tuning,my own deployment, that's fine.
NIM is just a microservice, right?
You can connect it into whatever.
Or you can say, I want to use NIM and NeMoGuardrails, but I want to use my own, you
know, toxicity detection models, and Iwant to use my own fine-tuning, whatever.

(22:37):
That's fine, too.
It's, it's really designed, it'sreally Explicitly designed to be
something where you don't haveto embrace the whole platform.
You can take...
A bunch of Lego blocks and you canpick and choose whatever you want.
Yeah, or you can embrace the wholeplatform because the whole platform
has been designed with some kind ofcoherent vision about how you're going
to build and deploy these systems.
But if you have an existing agenticplatform that you're using and you

(23:00):
like this one model and you want todeploy it with a NIM, that's fine too.
Or, you know, you know, I have, let's saymany models interacting and I already have
some guardrails that are working for me.
But I want to add, you know, Llama405B with guardrails, and you like
the guardrails, the configurationsthat we provide, you can use that too.
So it really is veryflexible in that sense.
And that was a very importantdesign point for us.

(23:22):
So maybe switching gears, right,so we talked a little bit about
evaluation and observability ofthese generative applications.
You know, do you see like thesetechniques of evaluation and
monitoring differ from domains?
Like, so there's a question from theaudience, you know, how does it differ
from healthcare LLM apps to likefinancial services LLM apps, you know?
You know, what's like, whatsort of advice or guidance

(23:43):
that you can provide, you know?
Well, I guess it dependswhat you mean by evaluation.
So, um, Let's, let meanswer that in two ways.
So, one level of evaluationis this kind of monitoring.
You know, I'm looking for a problem.
The kinds of problems that you mighthave are extremely domain dependent.

(24:04):
Right?
So in healthcare, I might havea monitor that's specifically
looking for PII leakage.
Um, you know, I designed my systemso that it should never happen, but
I want to just add another layer ofsecurity there where, hey, if the
guardrail, if, if, um, if the chatbotever provides PII that's not linked to

(24:26):
the current patient, then flag that.
Um, that, that would be a, uh, aguardrail, you know, uh, uh, uh, an
evaluation, an online evaluation thatwould probably make a lot of sense
in like a healthcare setting, right?
Um, or compliance, you know, in finance, Ihave a lot of rules about what information
I can and can't include, or you know,this conversation is not allowed to talk

(24:46):
about this, or, or we have an internalexample in a video we've been working
towards for, for a long time, whichis a, um, an HR benefits chatbot that
can answer questions about benefits.
And there's a lot of things that you'rejust not allowed to answer, you know?
If you say, in what stock shouldI invest my 401k, you can't, you
know, your HR partner can't answerthat question for you, right?

(25:08):
So the HR benefits chatbot shouldn'tanswer questions like that.
Um, and, and again, you'll programit that way, you know, you might.
Fine-tune it and provide it, you know,some guidelines, whatever, not to do that.
But you probably also want to adda guardrail and also a monitor to
check for, you know, are peopleasking these kinds of questions?
What percentage of the time that someoneasks a question that we're not supposed

(25:29):
to answer, are we actually answering it?
You know, is there a sudden spike inpeople asking inappropriate questions?
Whatever it is.
So there's a There's a, there'slike a real time evaluation that
I think is very context specific.
The other way I can answer that question,though, just in general, is, you know,
you want to evaluate your system.
How accurate is it?
Yep.
System evaluation, so thisis more, like, like, offline.

(25:51):
You know, I built a model, andbefore I deploy it, let's say, I
want to just know how good is it.
Um, and that is alsoextremely domain specific.
I mean, it's a use case specific.
And, and I think this is a great exampleof, you know, what the, how I think
about AI is, is, you know, generic,general AIs are great, but I think most

(26:12):
business use cases are not generic.
I don't want an AI that is going to opineabout, you know, religion and history.
I want an AI that does this one task,you know, It takes receipts, and my
reimbursement policy, and the requestedreimbursement, and checks that this
receipt matches this reimbursementpolicy, you know, this receipt with this

(26:32):
requested reimbursement matches my policy.
I want an AI that just does that.
I don't want it to tell me about,you know, the history of the French
Revolution or whatever, right?
Um, and, and therefore my evaluationis also going to be extremely good.
Task specific and the way you're gonnabuild your AI is you're gonna first
like very clearly define your task.

(26:52):
You're gonna probably collecta lot of like training samples.
You have, you know, probably lotsof human experts at doing that.
'cause a lot of people have been doingthese broke tasks for a long time.
Probably have a lot of datacollected over the years that you
could turn into training data.
Some of it is going to be usedto train, some of it is going
to be used for evaluation.
And so that's, that's why I said there'ssort of two ways to answer this question.

(27:13):
I think they're both very important, andthey're both, in fact, domain specific.
So this is an interesting thing, right?
So when it comes to, like, I used towork on search before, and we would
have human raters that evaluate if thesearch quality was, like, Good or bad.
And, you know, and so now, uh, thereare companies that employ red teams
to do this, uh, to sort of, uh, giveyou the domain specific answers.

(27:35):
But how do you think, like,the, the field will mature?
So there is now thisapproach of LLM as a judge.
to use as a, as an evaluator, youknow, uh, what's like the future here?
Like, how do you think like theevaluation, because there's scale
that we're talking about, right?
As GenAI hits scale, now you havelots of things to evaluate and
observe, you know, what, what'slike the future technology here?

(27:59):
Well, I think LLM as a judge, aspeople call it, makes a ton of sense.
Um, you know, and I think what you'redoing there is you're, you're sort of
saying, well, it's probably less accuratethan a human, but I can evaluate way more.
And so I'm going to trade off, you know,volume of evaluation for quality and that,
on balance, that's probably a good trade.

(28:20):
And I think there's a lot ofevidence that that is true.
Um, I think there's still kind of a goldstandard of humans evaluating, and I
don't think we have any AIs that are asgood as humans at evaluating responses.
Um, you know, you have, uh,these things like Chatbot Arena,
which are fundamentally humans.
The interesting thing about ChatbotArena, as a good example, um,
is, you know, humans have biases.

(28:41):
So, so, you know, Chatbot Arena is whereyou can go and play with it, and you
ask questions and a bunch of differentchatbots answer, and you rate, you
know, how well you like the answers.
And so, this is kind ofconsidered like the gold standard
for how good is your chat.
But, you know, humans have preferences.
We like, you know, friendly tone,that's not overly pedantic, that's,

(29:03):
you know, the answer's long enough,but not too long, it's helpful,
all these kinds of things, right?
And so, Chatbot Arena, you know, stronglyselects for these things that people like.
Is that a better chatbot?
I don't know.
You know, it's kind of subjective, right?
So, so I think, I think there'ssome strengths and weaknesses
to humans and LLMs as a judge.

(29:24):
The other thing I would say isnot LLM, not all large language
models are equally good judges.
Right.
And you can build large languagemodels that are specifically good
judges, and you can fine-tune andimprove your large language model
as a judge for a particular task.
Um, and so I think, and again, you know,if you think of the analogy of a large

(29:45):
language model, it's like a person.
Again, I'm not trying to anthropomorphizethis technology, but I think it's
helpful to think about it this way.
Um, you know, some people arebetter teachers than other people.
Some people are better judges.
Uh, you know, some people are betterat grading papers than other people.
You know, like a high school historyteacher that grades papers all day long is

(30:06):
probably a lot better at grading historypapers that I would be um, and so LLMs,
you know, you can make them better orworse at things, and so I think, I think
the future of LLM as a judge is, I mean,very, they're going to be used for lots
of judging, but I think we will continuerefining our techniques for building
better AIs that are great at judges.

(30:29):
So, so far we've talked aboutguardrailing and monitoring in the
context of accuracy and security, right?
You know, you want to make surethat your GenAI app is accurate.
You know, you mentioned examplesof HR, chatbot, you know,
sticking to what it needs to doand filtering out other things.
Or like security issues where, youknow, do you move, you know, eliminate
PII from showing up in responses.
But then there's a whole, you know,gamut of things, you know, ethical

(30:52):
considerations that, you know, peopletalk about, you know, I've seen Jensen
go on stage, talk about ResponsibleAI and AI Safety quite a few times.
And then there's the legal aspects of it.
And as, you know, you know,countries and continents come up
with regulations, there's this, youknow, who owns the liability, you
know, is the, is the developer ofthe large language model or deployer?

(31:12):
There's like a lot ofthese considerations here.
And then if you are providing theguardrails, You know, would the
guardrail provider own the liability?
So there's lots of theseimplications here, right?
So how do you think about it?
You know, when a company, especiallyin regulated sectors, Fiddler works
with a lot of regulated customers,some of them might be on this call.
How do they think about the responsibleAI aspects of generative AI?

(31:35):
And, and will some of thesetools, you know, help with them?
That's a, that's a complicated question.
Um,
you know, I think, how do you think aboutthe responsibility of your employees?
And in a future where you have humanemployees and digital employees,
you're responsible, you're responsiblefor the behavior and the processes

(31:59):
and the policies of both of them.
Um, I think what's important isthat you have some confidence
that your digital employees areactually following your policies.
That one, you know how to, that youhave a technique for explaining to your
digital employees, you know, I'll sayin quotes, explaining to your digital

(32:19):
employees what your policies are.
So we have techniques forexplaining to our human employees
what our policies are, right?
Training manuals, I know I thinklegally we all have to do our sexual
harassment training every two years.
NVIDIA has a code of conduct, and I haveto retake my code of conduct training.
You know, and it's like a little 30minute, like, training course designed
by, you know, some, someone in thelegal department somewhere, right?

(32:40):
I mean, this is like, there's awhole industry, there's companies
whose, whose business is like helpingenterprises train their employees
in their own policies, right?
And, and like we were talking earlierabout in call centers, you have quality
assurance and again, whole businesseswhose business is helping companies
enforce quality assurance or implementquality assurance, quality assurance

(33:03):
plans, enforce the level of quality,and all this kind of stuff, right?
The same thing has tohappen for your AI workers.
Um, how do I train my AI?
How do I teach my AI my corporate values?
How do I ensure that it's complyingwith all the relevant regulations?
All these, these aredifficult things to do.

(33:23):
And, you know, we have sometechniques for it today.
Um, we will continuerefining those techniques.
Uh, you know, and I think it'salways a trade off between, um,
let's say, the ease with which youcan constrain your AI versus the
quality and intelligence of an AI.

(33:44):
So let me give you an example.
You know, I could have somethingthat's not an AI at all.
It's just a straight up traditionaldialogue system with hard coded
responses and dialogue trees.
I am 100 percent confident it will neversay anything that I don't want it to say,
because I literally wrote every line thatit will ever respond to, you know, this is
like your old school, I mean, old school

(34:05):
decision trees
decision trees and state machinesand all this like well understood
technology, you know, um, typical dialoguemanagement, uh, you know, you chat with
like, Um, some companies online, youknow, help a lot, and most of them are
like these kind of hard coded things.
So very easy to control and constrain.

(34:25):
But sometimes extremely frustrating.
Exactly.
We'd all agree they're not going to beintelligent, and therefore, you know, they
don't know how to fulfill their mission.
On the other hand, I can have a totallyopen ended chatbot that talk about
anything that seems like a human.
It's so smart and can solve my, youknow, do my math homework problems
and write me a speech and give mecustomer service in the form of a
sonnet, you know, whatever, right?

(34:45):
Um, but really hard to constrain.
And so, you know, there's like thisspectrum in between, and obviously,
we're not satisfied with AIs thatare dumb but trivial to constrain.
Because if we were, we wouldn't haveinvented large language models, and the
world wouldn't be excited about this.
So, it's really that, like, how do youtake this thing that's very intelligent
and very flexible, but also constrain it,and again, my, I just come back to this,

(35:11):
like, we figured this out with humans.
Companies employ lots of humans,and for the most part, you're pretty
confident that your humans are,in fact, following your policies.
You know, that's not a thingthat's like, keep CEOs up at night.
That, oh, my employees, what if myemployees don't do what I tell them to do?
I mean It's not easy, but it's tractable,and I don't see any reason why the AI

(35:34):
version of that is any less tractable.
Yeah, makes sense.
So it starts with putting thecontrols and guardrails and
observability in place to get
These are all very importantingredients, absolutely.
Yeah, absolutely.
Awesome.
So maybe we can switch gears a little bit.
So today, in the last, I wouldsay, maybe in the last six to nine
months, there's has been a lotof talk about agentic AI, right?

(35:58):
And so if you kind of think aboutthe evolution of generative AI in the
enterprise, um, there's people starteda lot with RAG and then fine-tuning,
but still I think a lot of 95 percentof genAI apps people are building
are still RAG, but then there's noagentic workflows have come about.
Could you talk about what you're seeingthere and what is this agentic workflow?
How does it differentiatewith RAG and how do companies

(36:20):
need to think about, you know?
Well, they're very similar.
Um, you know, I think whatpeople call agentic workflows
are just a generalization of RAG.
Sorry, now I should say RAG,Retrieval Augmented Generation.
So the idea, maybe I shouldexplain what these all mean.
So the idea of RAG, Retrieval AugmentedGeneration, um, is If I, what is an LLM?

(36:43):
This is a very philosophical question.
So LLM is some neural, fundamentallyneural network, typically transformer
based neural network, that'sbeen trained on a large amount
of text, human text, usually.
Um, and what it does is it learnsthe structure of that human text.
And, and the way it learns thestructure of that human text is

(37:06):
it memorizes a lot of things.
A lot of, you know, that a,that a mother and a father are,
are learning from each other.
Uh, related, and, you know,they're different genders.
Just like a king and a queen arerelated, but they're different genders.
And, and that, um, uh, I don't know,France has a capital called Paris.
And that, you know, like, there's justtons and tons of facts that are in our
heads that allow us to communicate.

(37:27):
Um.
And these LLMs have memorized so manythings, um, sometimes they memorize
specific facts, a lot of times theymemorize sort of conceptual facts.
And so when you ask it a question, it'll,you know, in its sort of vast memory
that's encoded in the weights of theneural network, it'll spit out an answer.

(37:47):
And oftentimes those answers areplausible but totally made up.
And people refer to this phenomenonas hallucination or confabulation.
Um, and so that's not very useful.
I have this thing that sounds reallygood, and speaks my language, and can't be
relied on to actually tell me the truth.
Because everything itsays sounds plausible.
And that's just kind of inherentin how large language models work.

(38:09):
And so then people said, well, okay,but what if I actually have facts?
You know, I have, I have like Wikipedia.
I'll say, I'll considerWikipedia to be the ground truth.
And instead of just asking my largelanguage model a question, I'm I will,
um, first find relevant passages fromWikipedia, let's say, that are likely
to have facts relevant to the answer,and I'll prime the large language model.

(38:30):
I'll say, hey, hey, large languagemodel, here's a bunch of facts,
articles from Wikipedia, here'sa question, now answer it.
And what, what you find, it's like,intuitively makes sense, is that the
large language model is much more likelyto use the facts in the passage that
you fed it rather than make them up.
And so this led to this idea ofretrieval augmented generation, where,

(38:53):
um, when you interact with a largelanguage model, the first thing it
does is it goes and retrieves a bunchof, hopefully relevant information
somehow, um, before it answers.
And this makes a lot of sense.
And, and it's most LLMs that I'veseen deployed in production are
retrieval augmented in this way.

(39:13):
But then people started to think, well,if I have this large language model
that's, like, retrieving information,you know, maybe the thing I ask it is,
um, you know, what time is it in Paris?
Well, that's not stored in a database.
To find out what time it is in Paris,it needs to, like, first of all, realize
that I'm asking it to go look on aclock somewhere, figure out what time

(39:34):
zone am I in, what time is it rightnow, do the translation, and answer.
So that's, like, on some level it'sretrieving information, but it's not
retrieving it from a database or text.
It's using a tool, looking up, you know,your geolocation, looking up the current
time, all these sorts of things, right?
And so, what started as RAG systemskind of evolved into this more general

(39:58):
notion of, what if I just allow my AIto use tools to call other systems,
computer systems, to get information?
And maybe those other computersystems themselves are AIs.
You know, I have an AI that's specializedin figuring out what time it is.
It's all this AI does.
And it has access to clocks and whatever.
And my, my master AI, I don't know whatyou'd call it, my, my, um, front level

(40:20):
AI, uh, gets the user's question andsays, Oh, I don't know how to answer
this, but let me go ask the time AI,because it's good at this, right?
And so now instead of one AIthat, You know, go back many,
many, many steps in my story.
One AI that's just memorized a bunchof stuff and maybe makes things up.
Um, I actually now have a networkof AIs that are all experts in their
different domains that themselvesmay have access to external tools

(40:45):
or facts or databases or whatever.
And this is now what we calllike an agentic system because
it's made up of these agents.
And it's a very naturalevolution, I think.
And so the communication ishappening through these unstructured
interactions between these agents.
Very interesting, yeah, Ithink that's right, you know.
It's almost like humans speaking toeach other in a workplace, basically.
I just keep coming back to this idea of,um, you know, the future, uh, like, the

(41:10):
history of computers is APIs, you know,Application Programmer Interface, uh,
Application Programming Interface, whereyou have a structured way to communicate.
of sending a request to a computerprogram to get a response.
You know, I form my packets this way.
JSON, you know, we have all theseformats, and YAML, and HTTP, whatever

(41:31):
it is, you know, all these structuredways of things communicating.
And now we're kind of saying, well,It turns out that we have computers,
we have automated systems thatare pretty good at just speaking
human languages like English.
Um, so instead of formulating my requestto it in a very structured way, I can
just kind of talk to it in English.
And it is funny to imagine youmight have all these agents, you

(41:53):
know, that are communicating,speaking English to each other.
And that's super interesting because nowall the problems I had in monitoring and
security, um, and logging and guardrails,I actually probably want to start
to monitor the communication betweenmy agents using all the same tools.
So, so my example earlier, I said,you know, I have a customer facing

(42:15):
chatbot, customer service chatbot.
So I have a human talking to an AI.
And we want to monitor them.
We want to say, what is a human saying,what's the AI saying back, you know,
compliance, and on topic, and allthese kinds of things we care about.
Well, now I might have a human talk toan AI, and then that AI goes and talks to
a bunch of other AIs, maybe in English.
And maybe those AIs talk to other AIsin English, and maybe one of those AIs

(42:36):
talks to a computer system in JSON, right?
But I have many links inthese communication chains.
And I probably wanna monitor allof them because one, you need to
put those safeguards andcontrols in place isn't
That's right you know.
What if this AI starts speaking inpoetry to another ai or somehow,
you know, it's plausible, right?
One, this AI starts doing somethinganomalous and interacting with other AI

(42:59):
in the system, and now I've got a wholebunch of AI doing weird anomalous things.
I really wanna know thatthat happens, right?
And, and it should be doable because you'dexpect these agents, as I've described
them, they're pretty specialized.
You know, if this agent.
If my front agent talks to my timetelling agent and starts asking it

(43:19):
about the weather, that probablydoesn't make a lot of sense.
Or if it starts speaking in rhyme, thatprobably doesn't make a lot of sense.
Or I'm expecting it to sendrequests in JSON, and it starts
sending them not in JSON.
That's probably a weird thing,that I'd want to have an alert
flag, and I'd want to go debug.
Well, what happened?
Right.
So I think
those connections are still being done bythe humans and programmers today, right?

(43:42):
You know, like, which, which questionto route to which agent and whatnot.
But in the future, that can getautonomous and things can be, yeah.
Absolutely.
All these things.
And, and I think, you know, testingthese systems is going to be
more complicated, um, because youcan have much more, uh, variety.
In the kinds of data flow, youknow, I send a request in and the

(44:03):
communication pattern can have a lot ofvariety based on what the request is.
Um, so kind of asserting thatthings look right, I think,
will get harder and harder.
Um, which means, again, thatyour monitoring needs to also
be pretty flexible and, um,probably, you know, fuzzy.
And so, again, I think that thetechnology that's kind of underlies
NeMo Guardrails, which is dialoguemodeling, I think it's probably the

(44:24):
right paradigm for this future as well.
Um, but again, you know, tools likeFiddler, observability, monitoring, super
important, anomaly detection, these thingsare super important, and, and, and will
be even more important in the future.
There's this, um, concept in, uh,programming, people, software engineers,
in the, in your audience might, mightknow, called, uh, pre and post conditions.
So this is kind of like assertsin languages, where you say, you

(44:45):
know, we call this function, Andthere's some conditions, just
logically, that must be true.
You know, my computer is a giantstate machine, and the state of
the computer at the time that thisfunction could be called should have
this, this, this, this, and this true.
And I, and I typicallycode these as asserts.
And at the end of this function, I shouldhave these conditions should be true.
And inside the function, you know,these conditions should be true, right?

(45:05):
And this is, like, absolutely,like, a best practice from a
software engineer perspective.
Because most bugs, almost every bug,is your system is in a state that you
didn't expect it to be in, and allthese asserts help you find these.
Um, I think these agentic systems,uh, I mean, they're computer systems.
They also have states.
Now, the states are much fuzzier andmore amorphous and complicated, but I

(45:29):
think that the future version of preand post conditions is going to be
some form of guardrails, monitoring.
You know, alerts.
Yeah, absolutely.
So, so, I think, so it seems like there'sa question from the audience, which, which
I think is related to what you just told.
The question is from Philippe,you know, finally, where do you

(45:49):
see the role of AI agents evolvingin the next five to 10 years?
Do you envision them being fullyautomated in critical sectors and do you
anticipate ongoing human oversight asa necessity for safety and alignment?
You know, I, so these are, that's honestlya more policy question than anything else.
I think, I mean, we alreadyhave computing systems.
That are, you know, autonomousin some ways, right?

(46:13):
Um, I mean, think of just likesoftware that you run that, I
don't know, like my spam filter.
It's pretty autonomous.
I'm not tweaking my spam filter.
I just trust it.
Now, sometimes it gets things wrong.
So do I every once in a whilego check my spam folder?
Occasionally, you know, less and less.
Ten years ago I wouldhave done a lot more.
I trust my spam filter a lot more now.
So I kind of check itless and less, right?

(46:35):
Um, but I think that's an example, youknow, I, we're surrounded, like, I feel
like we, we look towards this future aslike, it's going to be wild, but a lot of
these things we already live with, right?
Um,
but I think the question is getting,like, a deeper, a deeper question
now, is, is more and more of the worldgoing to be automated away from us
in a way that will, you know, willbe increasingly opaque to humans?

(46:57):
I think that's, I thinkthat's dangerous, right?
I think there's, like, a lot of value inhuman oversight for all kinds of reasons.
Um, and I think complianceis going to force that.
You know, until you are reallyconfident in the technology.
You know, you're going tohave humans overseeing things.
And, you know, what does it mean tobe really confident in technology?

(47:18):
I don't even know.
Um, you know, I guess I don't wantto mention specific industries, but
there's a lot of things that's hardto imagine, you know, any responsible
company would allow an AI to go executesomething without human oversight, you
know, some critical business functionwithout human oversight anytime soon,
you know, will it happen eventually?
I, gosh, I couldn't answer that question.

(47:40):
But I, I think I think over the next,you know, let's say five years, I don't
even want to give a time horizon, butin the medium term, five ish years,
um, I think what we will see is a lotof the rote work that we do that's
time consuming, um, be automated.
And the role of humans will beoverseeing a lot of this kind of work.
And checking, like my example earlieron checking, you know, a reimbursement

(48:03):
receipt, uh, you know, for compliance.
I mean, that's something humans do todaythat requires kind of human cognition
and, and, and brain power, or hasrequired human cognition and brain power.
But you could totally imagine that an AIwould be pretty good at doing that task.
Would it be 100%?
No.
Is there a risk of it not being 100%,it's low, you know, like, if that, if
that AI gets it slightly wrong, well,there's probably an appeal, the person

(48:27):
who filed for that reimbursement mightsay, hey, you know, the AI rejected me,
but, and then a human would step in,you know, so I feel like we, you can
imagine how a lot of these processes couldstart to be automated, and you can also
imagine how human oversight, you know,isn't going away anytime soon, right?
That's kind of the future I imaginein the next couple of years.
Awesome.
This is great.
I think we've touched upon the past, thepresent, and the future a little bit.

(48:51):
But maybe, maybe we can get backto a little bit of, uh, you know,
success stories, and especiallygoing back to the, the NIM and the
NeMo stack that you're building.
You know, could you share, like, a successstory of some of these projects that have
impacted, uh, customers AI capabilitiesin the recent few, recent past?
Sure.
Yeah, so, I mean, one of,one of our, um, earliest, um,

(49:11):
Successful customers is Amdocs.
So Amdocs is a very largecompany, um, that is a service
provider to the telco industry.
Um, so a lot of, like, the bills thatyou get from Verizon, or actually, maybe
I shouldn't mention companies becauseI'm not actually sure who's their
customer, but, you know, when you getyour cell phone bill, a lot of times it's
actually processed using Amdocs software.

(49:32):
Yeah.
Um, customer service, they, theydo a lot of the a lot of the,
the software that runs the telcomindustry, um, is operated by Amdocs.
Um, and so they, they've been a very earlyand great partner of ours adopting NIMS.
We have something called NeMoRetriever, which is a collection of,
um, retrieval models that, so, so inthis RAG system, Retrieval Augmented

(49:55):
Generation, there's this questionof how do you retrieve information?
How do you find documents that are likelyto be relevant to the, to the conversation
that you're having with your AI?
And so there's AI models that do that.
And so we have somethingcalled NeMo Retriever.
And so they've used NeMo Retriever andNIMS and they built a bunch of different
RAG systems and they've seen huge success.

(50:15):
So I think the number they told us,we have a blog post about this on
our technical blog on nvidia.com.
But the numbers they told us wasI think an 80 percent reduction in
latency versus deploying a comparablesystem using all of the various
managed services that were out there.
So so what they really wanted wasrather than hitting all these different

(50:36):
endpoints managed by different serviceproviders, you know, and different
networks wherever they may be.
They just wanted to take all the modelsand run them themselves um hit the
endpoints on their own infrastructure.
Um, you know, you reduce network latency.
They could ensure they had, you know,optimal size models and They had control
over what the models were, how big of, youknow, there's always a trade off between a

(50:56):
bigger model, which is higher latency andslower, is more accurate, versus a smaller
model, but maybe it's overkill for thisone thing, and, or I could take a smaller
model and fine-tune it and make it reallygood at this one task, and even though
it's smaller, it's actually more accuratethan a bigger, more generic model.
So all these kind of factors.
And so they did a lot of this workwith our help, um, and they saw an
overall end to end reduction of 80percent in latency, while, while

(51:19):
no loss in accuracy or quality.
Um, I think their cost, uh, I thinkit was like a 60 percent reduction
in data preprocessing cost and a 40percent reduction in inference time
cost, if I'm getting this right.
So, you know, and it, again,it just makes sense, right?
Deploying all this, beingable to optimize the models.

(51:40):
Deployment on optimized infrastructureis going to have benefits.
And I think they're a great example of a,of a company that has a real production
system with, with, um, demandingcustomers, uh, and they deployed it and
they're seeing some really great results.
Awesome.
This is great.
Uh, you know, thank you so much,Jonathan, for, uh, being on this session.
Um, I learned a lot, uh, and, uh, fromall the way to you know, how the NVIDIA

(52:05):
AI strategy is in building these Legoblocks to, you know, fine-tune, prompt
engineer, build RAG applications, andalso, you know, inference them at scale,
and also the guardrailing aspect of it.
And I think, you know, what we areseeing, honestly, in enterprise
use cases, whether it's banking orfinancial services, people are building
these internal search applications,Q&A applications, customer service.

(52:27):
These are some of the dominantgenerative AI applications.
And this aspect of the thingsthat we talked about, you know,
security, guardrailing, you know,inadvertent keywords showing up, or
topics that the chatbot should notmention, um, that are coming through.
And then just the sheer accuracyof these LLMs are, are really,
you know, top of the mind issues.

(52:48):
We are very excited aboutthe partnership that we have.
So if you're thinking about NeMoGuardrails, like many of our customers
do, you could integrate Fiddler withNeMo Guardrails for monitoring, and
also get the guardrailing and all therules framework that NeMo offers today.
Thank you so much, Jonathan.
Thanks again.
Thank you.
That was a really enjoyable conversation.

All Episodes

Episode Transcript

Popular Podcasts

On Purpose with Jay Shetty

Las Culturistas with Matt Rogers and Bowen Yang

Crime Junkie

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Inference, Guardrails, and Observability for LLMs with Jonathan Cohen

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}On Purpose with Jay Shetty

Las Culturistas with Matt Rogers and Bowen Yang

Crime Junkie

All Episodes

Inference, Guardrails, and Observability for LLMs with Jonathan Cohen

On Purpose with Jay Shetty