Metrics to Detect Hallucinations with Pradeep Javangula

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Joshua Rubin (00:06):
Welcome and thank you for joining us on today's AI Explained, uh,
on Metrics to Detect Hallucinations.
This is kind of a fireside.
Um, I'm Josh Rubin, PrincipalAI Scientist at Fiddler AI.
I'll be your host today.
We have a very special guest today onAI Explained and that's uh, Pradeep
Javangula, Chief AI Officer at RagaAI.
Welcome, Pradeep.

Pradeep Javangula (00:28):
Hi, Josh.
How's it going?
Hi, everyone.
Thank you so much.
I appreciate the invitation.
So looking forward tothis conversation, so.

So, uh, maybe before we jump to intros, uh, I'm just
looking at the poll questions here,um, I guess as advertised by the, the
name of the fireside, hallucinationsare really top of mind for people.
Um, but, uh, I don't know if you'relooking at the responses also, but
it looks like, I'm actually, it'sactually interesting to me that we
have some coverage over all of thedifferent, uh, all of the different

(00:59):
questions that people are worried about.
So, latency, cost, privacy.
Um, but I think I would have probablypredicted a distribution like this,
uh, with some, some significantconcern about safety and toxicity.
It's certainly top ofmind for us right now.
Um, so probably, maybe, maybe you cando a quick introduction of yourself.

Yeah.

Yeah.

Well, first of all, thank you, Josh.
Uh, appreciate the invitation andthank you everyone for joining.
So, so my name is Pradeep Javangula.
I'm the Chief AI Officer for RagaAI.
Um, we are a comprehensive testingvalidation and verification sort of
like product and service that, um,we're a sort of like a fledgling

(01:41):
company that's been around for roughlyabout two, two and a half years.
And, uh, consisting of a variety of datascientists, machine learning engineers.
My background is I've been in the,in the Bay Area for lots of years.
Since 1996, I often like to callit the dawn of the internet.
I've been a founder of differentcompanies and I've been fortunate to be

(02:03):
involved with many aspects of data, datascience, machine learning platforms,
primarily through Search engines.
So I used to be at a company calledInk2Me, which was a big web search
engine before there was a Google.
We used to power Yahoo,Microsoft, and AOL.
And after that, started acomputational advertising

(02:23):
company and spent time at Adobe.
As the head of their ML andAI efforts, and it worked in a
variety of different domains.
So, enough of that, but, but yeah,like I said, it's like, you know, so
if you'd said 20 years ago that like,oh, this is the level of, um, sort of
interest and progress and progress thatwe would have achieved with AI, um, In

(02:49):
2024, I'd have been a bit surprised,but here we are, so it's all good.

Yeah, it's an amazing, amazing sort of resume for
the stuff that you're doing now.
I often point at Adobe as, uh, you know, asort of, uh, a really impressive, uh, You
know, organization for having mobilizedAI, especially generative AI, you know,
for a lot of really valuable use cases,you know, I think maybe more than most

(03:18):
applications, um, Adobe's products now arebenefiting from, you know, years invested
in, um, you know, generating reallypowerful, useful generative AI tools.
Um, but I guess that's not,not so much the topic of today.
Uh, right, right.
So, um, I, you know, we're, I thinkwe're going to talk, at least start

(03:39):
with, um, hallucinations today.
And, uh, you know, I think as everyoneis racing to roll out one kind of, you
know, LLM or generative applicationor another, Um, you know, it's sort of
this new, new paradigm of programmingwhere, uh, you know, the output
is not particularly deterministic.
So, uh, maybe, maybe we start withthe obvious question, uh, you know,

(04:01):
what, what's a hallucination anyway?
You know, and maybe why shouldpeople be worried about it?

Right.
So, um, I guess maybe it's a goodidea to start at the beginning in
terms of sort of like, uh, defining.
First of all, I'm not a huge fan ofthe word hallucination as a way to
describe the phenomenon that occurswhen leveraging, large language
models or foundational models interms of generating, content, right?

(04:30):
And so the generative AI terminologyis probably a better sort of like in a
use case for what this thing is, right?
You know, it is, primarily given thata large language model has been trained
on a very large corpus of text, right?
Where there's Some discerningand determination of what
the underlying patterns are.
It is attempting to predictsort of the next token, right?

(04:53):
Or the sequence of tokens, right.
So depending on what the question is,what the prompt is, there is content
that is being generated and, and in thecontent that is being generated, it is
maintaining a certain level of fidelity tofacts, or actual truth, is something that

(05:15):
a large language model has no sense of.
It has no idea that it is actually spewingout something that is, sort of like,
appears statistically, stochastically,as the next logical token to show up.
So.
Because of that, you would end upwith text or images or multimodal
sort of like output that is generatedthat doesn't jive with the truth.

(05:37):
Therefore, it appears tobe making stuff up, right?
And that's what's sort of likebasically referable as hallucination.
And as you rightly pointed out, sothe risks associated with producing
something that is untruthful or somethingthat is not sort of like grounded
in reality have adverse implicationsthat really do need to be addressed.

(05:58):
So, which is I guess the reasonwhy 70 percent of our audience
says hallucinations is ournumber one problem, right?
Um, and, um, yeah, so, so I woulddefine hallucination as, as, as, as
basically the, the generated contentNot being grounded in truth, not

(06:18):
being grounded in anything that isactually contextually relevant, right?
Um, and so it'll often appear to be thatthis thing that's on the other side,
the model that's on the other side is,um, is producing arbitrary content, so.
Um, so we can call it chat.
In some circles, it is actuallyconsidered a feature, not a bug, right?

(06:42):
You know, because it's actuallyspeculating, and it is sort of like
generating things that are, that anormal human would probably not do.
So it's not a scientific or mathematicaldefinition, but there it is.

Yeah, right, right.
I think, I think the kind of fundamentalmisunderstanding here is sort of that,
uh, You know, a large language model ora generative AI application and it is by
definition supposed to be doing somethingsort of factual and well constrained
like computer software usually does.
Um, you know, and it's sort oflike, it's sort of like making

(07:19):
stuff up in one flavor or another.
In a sort of statistically,uh, justified way is sort of
the nature of the beast, right?
So, you know, instead of programmingwhere it's sort of this additive process
where you're building things thatbehave in a well specified way, it's
a little more like sculpting, right?
Where you start with something thathas this behavior of spewing plausible

(07:43):
statistically grounded stuff and thensort of carving away the bad behaviors.
Uh, to get at, uh, something that'sa little bit more, um, appropriate
for a specific business applicationor, uh, whatever application.
Um, you know, another kind of relatedconcept that, you know, that I think is
interesting are, um, You know, sometimesit can be factually correct, but sort

(08:07):
of contextually inappropriate, right?
Like, you don't, if you're a business,you don't so much want, uh, you
know, if you're an airline, youdon't want your, uh, customer service
chatbot giving medical advice, right?
Even if, even if that's kind ofbaked into the brain of the chatbot.
There's also this very related,interesting sort of issue of, you know,
constraining the model to be sort ofcontext appropriate or policy following.

(08:29):
Um, you know, some of those things wedo with kind of fine tuning, uh, but,
uh, you know, a lot of LLM developmenthappens through, uh, through prompting,
um, anyway, uh, you know, and basicallyasking the model for what you want.

That's well put.

So, um.

Really well put.
Yes.

Yeah, yeah.
So, uh, I don't know, do you want to,maybe, maybe you can say something
about how, um, how RagaAI thinks about,uh, sort of, uh, evaluating these
things, like, you know, how, how,how do you measure this and, uh, you
know, sort of business specific way.

Maybe it's like a brief introduction to RagaAI
would be appropriate, right?
Meaning that, first of all, we serve sortof the enterprise community, meaning that
enterprises that are building applicationsthat are AI enabled or AI driven, either
for consumers or for other businesses.
So it's essentially B2B, sort oflike, is our primary orientation.

(09:26):
So in that context, as you rightlypointed out, unless you are an OpenAI or
Anthropic or someone else that is buildingsort of general purpose things that are
aimed at everybody, most enterprises areleveraging these large language models
of different stripes for the purposes ofsolving specific problems, conversational

(09:49):
agents, or some sort of reasoning agents,or workflow agents, or insight deliverers,
um, based off of their enterprise corpus.
Whatever that corpus may be.
It could be large volumes of text,it could be some structured data,
it could be image data, and so on.
So RagaAI is primarily involved withAttempting to figure out what your

(10:17):
problems are in a machine learningor an AI app development cycle.
And our primary focus is pre deployment,although there's some overlap with
Fiddler and other, and other companiesin terms of like what we end up
doing on the inference side as well.
Primarily, we look at what we deliver asa, as a platform to deal with data issues.

(10:39):
Model issues and the operational issues,meaning that as you are contemplating
on building a RAG application orbuilding a much more of a classical
machine learning application, whatis the level of rigor that you apply
with respect to sampling, with respectto the segmentation between the two.

(11:00):
test and train, or train and test, andfiguring out how well your model is
performing with a certain level of all themetrics that data scientists care about.
Precision, recall, AUC ROC curves, and,and, and all sorts of things, and think
of sort of like a suite of tests that wedeliver to to the developers or to the

(11:26):
machine learning engineers or to datascientists that they wouldn't have to
hand code but given data, given model,and given some sort of like inference
input, we would assess and will tell yousort of like where things are, right?
So, so that's the, that's the thrust oflike what we're doing and as a consequence
of like what we do, we kind of have toplug ourselves into the stream of the

(11:50):
application development life cycle.
So the ML or the AI appdevelopment lifecycle.
So, so we end up often being askedto sort of like deliver on compliance
and governance oriented sort of likesolutions in specific verticals.
And I should also point out that thisproblem of being able to detect all
problems and point to the root causesof like what's causing those problems

(12:15):
is highly non trivial, number one.
Number two, it is.
very verticalized in nature right.
And so we by no means are claimingthat we've solved all problems for
all verticals for all times right.
So and on top of that, ourecosystem continues to evolve
very rapidly and very quickly.

(12:37):
Our focus has been on, on,on, on specific verticals.
So we've started out with computervision and then expanded to sort of
like large language models and text.
And we also have supportfor structured data sets.
So that's, that's what we are.
So maybe I actually missed outon, on, on the question you asked.
So, so yes.
So it's a pretty...

I lost track of it already.

So pre deployment is, is, uh, is our primary focus,
so to probe into the applicationdevelopment lifecycle with respect to
data and model and identify issues.

Nice.
Um, I can relate to your sentimentabout the challenges associated with,
uh, you know, drilling down and findingproblems and offering remediations.
I think, you know, sort of the firststep is And maybe you would, I'd be
interested in, you know, your perspectiveon this, but I think we sort of see

(13:33):
a first step as observability, right?
Like, can you characterizehow well the thing works?
You know, for us, we pay a lot ofattention to, you know, production time.
Has the world changed in some way?
Has some new topicemerged that's a problem?
Uh, you know, the first step is like, youknow, having some sort of comprehensive,
uh, observability into the performance ofthe model by some number of dimensions.

(13:57):
Um, You know, but of coursethat's not enough, right?
A customer, you know, what we hearall the time is, okay, you've told
us there's a, an issue, uh, butthat only gets us halfway there.
Like, you know, where,where do we take this?
Um, so I don't know if you have any,any, any thoughts around that, or
we can maybe move on a little bit.

Like I said, this is a hard problem, but as I said,
ultimately, sort of like an applicationis evaluated in terms of how well it is
or how satisfactory its functionalityis as discerned by the user, right?
So, you know, but we can knowsomething about how well.

(14:40):
It's being received or howsatisfactory the responses are from
sort of like inference data, right?
Which is, which is what you guys doan amazing job off and try to point
to sort of like other interpretabilityor explainability aspects of sort of
like, what, what, what's going on?
But then you want to takeit all the way back, right?
Meaning that, how do I re.
So, you know, in what way should I changemy training data set so that it's much

(15:05):
more uniformly representative of the kindof model that I ought to be building?
Is there how much of a drift is there?
How much of, how much ofsort of like failure mode
analysis is actually occurring?
So, how do you deal withthings like active learning?
So this is what we do.
This is basically the communityrefers to as active learning.

(15:25):
It is taking what's happening outin production and identifying the
things that you ought to do, um, interms of improvement or a continuous
sort of like improvement in termsof trying to make that happen.
And often it is gated by How muchsort of like volume do you have right?

(15:45):
It's like, it would be great if you,we could have sufficient traffic
to be able to sort of distributeit across a swath of models.
It could be like just an A/B test or itcould be like an A/B/C/D test or even
multivariate testings on for differentsort of like champion challenger models
out in production so that would beprobably the best way to deal with it

(16:06):
but often you are dealing with sortof like one at a time and then you are
trying to come back to some correctiveor remediation measures so uh so we given
that we were involved in the core oftheir Sort of like development pipeline.
So the development life cycle,uh, we have observed what

(16:27):
they did, how they've labeled.
For instance, if we do supervisedlearning use case of how well they have
labeled things and what the labelinginaccuracies may be, and using very.
A very methodical statistical measuresabout in and out of distribution
of the training data set, right?
And of looking at a time seriesevolution of the metrics, um, of, of the

(16:54):
different types of model metrics, andeven of multiple models to be trained.
to be, to be tested on, on, on, onthe specific data set and so on.
So that's the kind of thing that we do.
And given that we also follow sortof the lineage of the data as it
gets processed through, uh, right.
And, uh, so, and, you know, like Isaid, that's, that's why it's like, you

(17:16):
know, uh, there's a, there's an overlapbetween what Fiddler does and what,
what Raga does from the perspective ofsort of like observability in general
i would even argue that what we doshould be called observability as well
but then you know that it is right so

To my taste i'm happy to I'm happy to call you observability
also Okay, my questions are online, I'mkeeping notes, and they're branching
in a very, many different branches ina non linear way, so I'm trying to,
let me see how I want to slice this.
So, you know, one question that's sort ofinteresting to me is, you know, when we

(17:55):
think about things like basic, you know,uh, classifier metrics, precision recall,
very under the curve, that kind of stuff.
Um, I'm curious how the largelanguage model is different for you.
So when, so, so, you know, oftenwhen we're thinking about things like
active learning, uh, gosh, yesterdaywe were, uh, discussing a problem

(18:19):
with a hard classification use case,and there's this old paper from, uh,
Facebook Research, uh, this Focal Losspaper, where they figured out a bunch
of stuff for computer vision models.
Uh, but basically by asking themodel, during its training, to
take the cases where it was unsuremore seriously than the easy cases.
So there's an adjustment to thelock function that causes it to sort

(18:41):
of, uh, get really good at thingsit finds challenging or really
focus on things it gets challenged.
Um, so I guess maybe one, one question foryou is how, how is the, um, generative AI
or large language model world, how is thatdifferent from a sort of, you know, you
may not be training these models, right?
You might not be fine tuning, um, youknow, and, and, and what knobs do you have

(19:04):
for, um, do you think about in terms of.
You know, ways to improvean underperforming model.

Right.
So firstly, you know, in the, inthe enterprise scenario, and I'll
limit my comments to sort of likethe enterprise, the non general,
general types of use cases, right?
So the most predominant of LLMadoptions are really happening
In terms of retrieval, augmentedgeneration use cases, right?
And, and the LLM is largely being calledupon to generate, summarize, to sort of

(19:40):
like, uh, uh, to, to, to, to provide ananswer that isn't a, a blue link, like in
a web search scenario with some sort oflike a snippet that surrounds it, but much
more about a summarization of a co, of a,of a set of results that are appropriate
to the prompt that has been given.
So given that, that is the, that's theprimary use case that we see, right?

(20:00):
Um, in terms of sort of likethese foundational models being
deployed, you basically start theworld with your context, right?
Meaning your context DB or the corpus thatis relevant to the kind of rag application
that you want to develop, right?
And you go through a variety of sortof like embedding generations of those

(20:22):
documents and put them into a intoa vector db and then when and and
you know you go through yet anothersequence of steps that allow for a user
to be somewhat guided through a setof prompts that are relevant to the
domain or the context in question rightand hopefully you can apply a variety
of some personalization techniques aswell knowing who the user is or what

(20:45):
they're attempting to do, and so on.
And then, uh, uh, from, from aresponse perspective, you want to
leverage the results of such retrievalmechanism to be summarized in a
cohesive manner, depending on whereyour, what your style of choice is.
Now, so what are the potentialproblems that could occur?
So first is, I wouldcall coherence, right?

(21:08):
So, you know, we were justjumping into sort of like the
metrics associated with that.
What would cause hallucination, right?
And so the things that could causehallucination first This looks
like, you know, hallucinationneeds to be expected, right?
And so what you want to do is to tryand figure out how much hallucination is
likely to have happen, and that requires areasonable amount of rigor as far as your

(21:33):
quality assessment process is concerned.
And, you know, to the extent possible,like you pointed out, have kind of hard
guardrails about sort of like, I don'twant to answer a question about politics
or about healthcare if this is a customerservice bot or, something to that effect.
so, coherence is really sortof the measure of complexity

(21:55):
of the generated text, right?
Meaning that the hallucinatedtext may actually exhibit a
high level of perplexity due tosort of like inconsistencies or
kind of nonsensical sequences.
Now, you know, The quote unquote notionof nonsense is really a human judgment
thing, right, which is the reasonwhy what you want to do is to surface

(22:18):
something like a perplexity metric,and allow for a human to be able to
figure out what's going on, right?
So now, if in response to a given prompt,the probability of such perplexity
arising is really high, then really sortof, you know, given that we don't often
have sort of like the innards of, of amodel, of an LLM that we are leveraging.

(22:44):
We might just want to back off,right, saying it's like, look, you
know, I am not trained to answer thisquestion or that, you know, so you
do some sort of like a boilerplateresponse in order to make that happen.
Yes, it's not a particularly satisfactoryanswer, but like, you know, so there
are potentially other ways of dealingwith it in terms of maybe you meant
this, right, in terms of, and sinceyou, since we are dealing with these

(23:04):
sort of like large dimensional vectors,we can always think about things
that are within its neighborhoodaccording to a, uh, and a metric.
So one is sort of like perplexity.
That's, you know, among the coherentsub bucket, I would put perplexity.
Second is word overlap, right?
Meaning comparing the generatedtext with reference corpus.

(23:26):
I mean, I'll often go back to, so thereference corpus being sort of like,
quote, unquote, the ground truth or usbeing able, being able to evaluate up
front how closely it matches up withthe corpus that is being used is one
of the most sort of like foolproofways of, of, of dealing with it.

(23:46):
You know, I've seen less instances ofthis sort of the, uh, sort of third
element inside of, uh, inside ofcoherence is basically grammar and syntax.
So I think all of these large languagemodels seem to be just almost impeccable.
In terms of the grammar withthe degenerate, it seems like,
you know, really hard to detectcases where it is the case.

(24:08):
But I did see some cases where it'slike, no, I wouldn't write that.
That's not sort of like English,you know, and not in that model.
So that's sort of like onebig bucket of coherence.
The second bucket I would say isbasically fact checking, right?
Which is assuming that everything thatyou fed into your context DB is the truth.

(24:31):
This is, this is, this is, this is, thisis your facts and this is the enterprise's
fact or the data scientist's facts.
Knowing, uh, knowing that and, and,and, and understanding or analyzing
that corpus sufficiently or preppingthat corpus sufficiently with All sorts
of techniques and I've seen people donamed entity recognitions or knowledge

(24:52):
graph building or other forms of sortof like embedding generations that
are true to that specific domain.
These are all things that areimportant and which will guide
in terms of saying it's like ofthe response that was emitted.
How do I actually effectively compareit against the context that was provided

(25:12):
and, you know, in a pure mathematicalgeeky sense, you can think of it purely
in terms of distance or in terms of,uh, of nearness or, uh, or, uh, or other
ways of thinking of it from a classicalsort of like NLP perspective, right?
So that's sort of like fact checking.

(25:32):
And so, you know, semantic coherence,semantic consistency is also another
thing that I would point to asan important part, which is that
there's, you know, uh, like, likeboth of us were pointing out, this is
basically a stochastic pattern, right?
It is attempting to generate largerand larger sequences of words that,

(25:53):
that cohere with some level of Not tobe confused with semantic discipline
or linguistic discipline, but thesemantic nature of the content that
is being generated can often be weird.
It's like, you know, I'veseen responses where the first
paragraph and second paragraph arecompletely unrelated to each other.
It's like What are you doing?

(26:14):
And, uh, and, you know, we're putin this sort of like this awkward
place, Josh, right, where there's thisthing called the foundational model.
We don't know what all went intotraining and specifically not even
mostly familiar with the types ofthings that go into the weights or
the corpus that is associated with it.
So we can control what we can control,and therefore the things that we can

(26:38):
control are the corpus, And like yousaid, so like guardrails and other sorts
of things that, uh, that make sense.
So those are sort of like the, and, andthere are a few others like contextual
understanding and having some form ofhuman evaluation, which is some level
of human evaluation upfront in thedevelopment process is the thing that is,

(27:00):
that's kind of clearly important, right?
So it's like, you know, if youdon't spend sufficient amount
of time on that front, then.
One can end up in a place where it'slike, well, we don't know how it's
going to behave in the wild, right?
And that's when Fiddler is broughtin saying, it's like, explain
to me what's going on here.

Yeah, well, this is that, that question gets hard, uh, with
the large, large models, of course.
Uh, and I think all the, uh,Frontier Labs would agree.
Um, so I guess I, maybe you've alreadysort of touched on this, but, you know,
uh, you know, one question I would askabout these kind of evaluations are, you
know, human feedback, simple statisticalmodels, uh, you know, uh, simple ish.

(27:45):
You know, classifier models, do youevaluate with a large language model?
Is it models all the way down?
Uh, you know, how, uh, I'd justbe curious to hear about how you
think about this, um, you know.
Who's responsible for the eval?
I know that, you know, we're alwaysencouraging our customers to build
human feedback into the loop asmuch as possible because it's

(28:07):
sort of the gold standard, right?
If you can find out, uh, that humanbeings were actually disappointed by what
happened, um, you know, that's treasure.
Um, but not always so easy.

So, so, so the question was, how would you

Oh, the question, the question was, where do you, so,
you know, what, what kinds of toolsdo you use, you know, for, uh, is
it really all, the whole spectrum?
Uh, maybe the answer is just yes.
Um, from simple statistics.
Some of those things like, uh,that you mentioned, like, um,
you know, word overlap, right?
Those are, those are nice metrics becauseyou can compute them very quickly, right?

(28:43):
I mean, a lot of other things.
First, you know, it follows the keep itsimple, uh, rule, uh, it's effective,
but it's also fairly computationally,uh, um, inexpensive, right?
But some of these problems we canaddress with other big models.
Um, you know.

Right yeah.
So I, you know, uh, I think it'ssort of like, you know, tread
carefully would be my high level.
And, and, and, and the thingis, it is, uh, um, the, the, um,
evaluating, Your application, inconcert with an LLM, um, is first,

(29:29):
size and data sufficiency problem.
Just in terms of like, whatkind of corpus do you have?
Does it have the types of thingsthat can effectively answer the
questions that you want, right?
And, you know, back in the day, I usedto run sort of like enterprise search
engines, and this is very similar to that.
Part of the use case, except that sortof like, you know, um, in that case, just

(29:51):
like web search, we would just returndocuments with sort of like snippets.
And it's a bit of an unsatisfactory thingwhere there is no state that's maintained,
where there is no dialogue involved.
There is no aspects of, thereare no aspects of personalization
that can be employed and so on.
So I think we're in a differentera of, uh, of, of, of enterprise
search and enterprise retrieval.

(30:12):
So, uh, I would, I would argue that.
You know, that knowing what your datalooks like and how well you prep the
data for the purposes of the retrievalapplication that you're providing
is, is going to go a long way interms of eliminating sort of like
this strange hallucinatory effects.

(30:33):
That's number one.
And I think, you know, and you guysdo this too, sort of like, or other
observability platforms do this justin terms of sort of like bias detection
and mitigations on their classicalapproaches of trying to figure out
sort of like, Whether there existsbias, sensitive variables, and PHI and
PII data and analysis of that nature.

(30:54):
And, you know, but when I first sort oflike met the Fiddler founding team back
in like 2019, I was thoroughly impressedwith the sort of like tremendous focus
that you guys had on explainability.
I was at Workday at the time and uh,We've done multiple sort of like POCs
with your team in terms of trying toarrive at explainability as it relates

(31:17):
to compliance with employment law.
Many of our lawyers and the securityofficers were sort of like mystified by
like, you know, this, your application,your machine learning application
appears to be doing reasonably well.
However, So, what did you use, right?

(31:37):
You know, what kind of data did youleverage in order to make that happen,
and sort of the what if scenarios of ifI took this stuff out, or if I suppressed
this variable, or this feature, and soon, what would it do to the overall model?
So that I think is actually important.
Diversity of model evaluation is somethingthat I would, uh, I would definitely
take, and to your point, right, it is,Sometimes it's, it's helpful to sort

(32:01):
of like simply say, look, if I was a, ahuman attempting to sort of like answer
this question, is there a methodicalprocess that I can actually write down?
It might involve some mathematics.
It might involve some computationsand some statistical evaluations in
terms of trying to make that happen.
And attempting to.

(32:23):
See if your overall AI application,whether it's reasoning or whether
it is generating and so on, adheresto some such sort of like simpler,
uh, simpler world would make sense.
And what you pointed out is actuallyquite, uh, quite appropriate as
well, meaning that just start withsort of like a linear regression.

(32:43):
It's okay.
You know, so perturb one of thesefeatures, right, sufficiently, and try to
fit a linear model against it, and thenfigure out what your overall response,
um, responses look like in relation tothat sort of like smaller space, right?
Um, and, um, yeah, so there's a, there'sa bunch of other stuff, I mean, it's

(33:04):
like, and, and the beauty of this stuffis like the, the, the level of research
that's going on in this domain is.
is, is, uh, is really strong and actuallytrying to keep up with what's going
on is itself a hard problem, right?
You know, like you pointed out, this,the, the meta paper, the Facebook paper
has been around for a while, but there'sa whole bunch of other stuff and, uh,

Oh, absolutely.
It's, it's a firehose.
I think a firehose is anunderstatement for the rate at
which new stuff is happening.
Very challenging.
Um, you know, just a, just a sort ofa side comment that I think is related
to, you know, I'm, I'm a bit of ameasurement fanatic and we've talked
about sort of rigor in, uh, sort ofmeasurement and characterization.

(33:49):
Um, you know, the other end of thespectrum from that keeping it simple
solutions to evaluation is depending on.
Models of one kind or another, youknow, maybe even having, in some cases,
large language models, do some of thathallucination eval, ask the question,
is this model response faithful to thecontext that was provided to it, right?

(34:11):
Um, these models can reasonin very sophisticated ways.
Um, and, you know, uh, to, to mytaste, Um, if you're going to go to
that length to evaluate in those kindsof complex, sophisticated ways, um,
you know, then there's also a kindof a question of characterizing how
well those tools work at evaluation.
Like, there's this sort of meta evaluationlevel is, you know, if I'm going to

(34:34):
bring in some large language model basedsystem for, you know, So, this is a great
example of a model that we're working on.
And, you know, if you're just evaluatingmy application, then, you know, whoever
the provider of that is, whether it'sdeveloped in house or some third party
application, you know, you might askto, uh, see how well that model is
characterized on some public data, whereyou can verify, you know, you have, you
know, it might not be sufficient forthe lawyers, uh, who are worried about,

(34:58):
you know, the corner case, where it'sgoing to do something bad, but if you
can say something I think that goes along way to building trust and confidence
in a sophisticated set of tools.

That's actually what we do.
Yes.
Uh huh.
Our platform is really sort of likea, on the, on the modeling side of
it is actually a model evaluator.
And you can use sort of likean adversarial approach, right?
So you have your model, which is,which can be a black box to, to
our sort of like testing platform.
And then we can challenge it with anothermodel, To try and determine what the

(35:40):
fidelity of the responses look like.
And it is, it's alegitimate technique, right?
It's like, you know, the, the, in termsof being able to champion challenge this
or, or, and the other approach that youpointed out is also something that we
use, which is that, you know, because ofour testing platform and, and, and our
goal is to be embedded in the developmentcycle, so we really don't follow the

(36:04):
classical, sort of like SaaS model, right?
In the sense that we.
install or allow our applicationsto be deployed in a native
environment for a specific client.
Right.
So, and given that that's the case,since we have to show stuff about
like what we do, we often leverageopen source datasets, right.
You know, from AR, VR, or from sort oflike publicly available sort of like

(36:28):
corpus, um, um, medical data or medicalimaging data, those are actually harder
to come by, but, but like, you know,news feeds or Reddit or other sorts of
things in order to show off what our,uh, what our actual platform does.
And that's a good, good enough, in myopinion, way to evaluate whether your
model evaluator or this meta level thingthat you're talking about is going to be

(36:51):
useful, uh, uh, to your purpose or not.

Nice.
Yeah, yeah, no, I think we're in asimilar boat in how we train and evaluate
the, you know, the tools that we use.
Um, so probably in the next few minutes,we should cut over to questions,
but I had a couple more littlethings that were interesting to me.
Um, and so maybe we, uh, we try to addressthose in the next five minutes, and then

(37:15):
we give our audience a chance to type somequestions in if they haven't already, and
then we'll jump over to the questions.
Um, so, so one question is, uh, you know,do you think about this hallucination
evaluation from the perspective of,like, is it, is it a model developer
problem, or do you think about the, thebusiness user in, in, like, the, um,

(37:38):
you know, there's all these differentstakeholders and organizations, and
one question we get often is, like,Who's sort of your end user, right?
Like, and to what degree are theyworried about the same things?
Do you think about, um, you know, is thata distinction that, that you think about?
Like, uh, certain, certain metricsbeing of more interest to business

(37:58):
users, or like sort of KPI orientedmetrics versus, you know, the sort of
strict science y metrics that a modeldeveloper might be interested in.
Does that factor into your thinking about

No, that makes sense.
That actually makes sense.
And I think, you know, I don't know ifit's necessarily about which metrics are
more relevant to business users and so on.
Ultimately, everyone cares aboutthe quality of the application
and the satisfactory thingthat it is delivering, right?
So, um, I would approach the sort ofproduct management or the business
team in terms of saying, it's like,how do you characterize success?

(38:32):
for this application.
You know, start at that highest level.
And of course, they have othermetrics in terms of, well, potentially
we're, uh, we're making our overallorganization that much more productive,
or that we're enabling a certainlevel of transparency, right?
Uh, and so people measure, for instance,these RAG oriented applications in

(38:53):
terms of the diversity of questionsthat are being answered effectively.
And, you know, to, uh, to someextent, sort of the productivity that
this thing is actually enabling interms of emitting an answer, right?
Um, uh, uh, you know, uh, uh, I'm, I'mspeaking from experience in terms of
sort of what we are encountering inmany of these application scenarios.

(39:14):
So the product managers or the businessexecutives are largely interested in,
uh, The, the, the, the fidelity ofthe application and the usage of the
application and the kind of satisfactionthat is being derived from it, right?
We, which are one set of metrics, right?

(39:35):
You know?
Mm-Hmm.
of like, you know, even if you do sortof an inline survey about did this really
answer your question or then you can alsohave a slightly deeper measure in terms
of, in order to answer this question.
I needed to retrieve a response fromlike three distinct datasets that were
not always interlinked to each other,but you somehow correlated all of those

(39:56):
things and emitted a coherent response.
And that's, those are thetypes of things that, uh, that
people are using to evaluate.
On the data scientist side, andgiven that I'm a math geek like you.
So we probably end up being much,much more rigorous in terms of
actual quantitative things thatare that, that are involved.
I mean, whether it be distance metrics,based distance, or sort of like, you

(40:20):
know, the levels of clustering that'sinvolved are the types of models and
algorithms that are being leveraged.
You know how.
Comprehensive were the number of modelsthat you used to evaluate and then
back on to sort of the things that youguys do in terms of sort of, you know,
shapley values or integrated gradientsor other forms of ways of explainability.

(40:42):
Those are the types of things thatI have seen data scientists and
machine learning engineers obsessover more, which is, which is fine.
I think, I think that'swhat you should do.

I would, for me, for me, I would say that kind of most
commonly used, I mean, it really, Inthe, in this LLM world, more than the
explainability metrics, since that's sucha, a, a challenging problem right now.
Um, one of the most useful things is,you know, uh, semantic similarity.
It's, you know, this, it's the same,the same thing that sort of undergirds

(41:11):
this, the vector search, right?
It's converting prompts and responsesinto, into a mathematical representation
that, where, where proximityrepresents semantic similarity.
Um, because that ends up helping tolocalize, um, common failure modes.
If you can identify where a bunch ofproblems are happening and you realize

(41:35):
there's some semantic commonality, likethis is, uh, you know, questions that
have to do with this particular topic.
often have a hallucinationproblem, or often our users
thumbs down the response, right?
That's an incrediblypowerful diagnostic tool.
Um, so, um, okay, so I'm going tomaybe jump to a question here, and

(41:57):
I think you covered maybe some ofthis, but the first question is,
there are many types of hallucinationscores for evaluation and monitoring.
How should each be used?

Um, let's see, um, did, uh, the valuation metrics ought
to be used, um, first in terms of,like I said, detecting, um, detecting
things such as perplexity, you know,uh, determining, uh, a sense of sort
of, like, how contextually relevantthey are, how semantically relevant

(42:32):
they are, um, and what level of semanticconsistency is being achieved, right?
So, uh, you know, we can characterize eachLLM application with those metrics, uh,
in tow, then, um, you can be reasonablyconfident about how well your thing
is going to perform out in production.
So I know it's a very general purpose30, 000 foot level answer, but

(42:55):
that's, uh, uh, That's what we are.
Yeah, that's interesting.
Um, let's see, what do we have also?
What have you seen, by the way?
I'm curious to hear yours on that front.
Just in terms of, like, how haveyou seen sort of, like, these
metrics being leveraged or utilized?

Yeah, um, I mean, people want hallucination detection.
We hear that a lot.
This is, again, this is not going to be asurprise given the, um, the title of this
and the, the, uh, what do you call it?
The survey from the beginning, right?
Yeah.
Um, you know, we talk to a lot of, a lotof potential customers and our existing

(43:39):
customers who, you know, really wantsomething sort of ironclad with high
classification, uh, capability, um, youknow, and oftentimes we turn to things
like, you know, in the literature now,people talk about, you know, Um, uh, you
know, response faithfulness, you know,it's the, does this answer faithfully

(44:02):
reflect the information that was in thecontext provided to the model, right?
Um, and that breaks down evenfurther into, into questions like,
um, you know, is there a conflictwith the information inside the, uh,

That's right.

The context, or is it, or is it just baseless, baseless
made up additional stuff, right?
So baselessness versus conflict isa, is a sort of a nuanced metric.
And then there's also relevance.
And I think this, this toucheson a lot of the things that
you were, you know, mentioning.
And some of these, you know, this is, uh,you know, you can use things like, um, uh,
you know, traditional statistical metrics.

(44:39):
There are things like blue and rougethat were developed like, uh, 20 years
ago that are, that are sort of these.
You know, and, and these, so, so thequestion of, like, what, this response
may be factual, it may be grounded,but didn't answer the user question.
Um, you know, the other one thatI think is hot is, um, LLM aside,

(45:02):
did my retrieval mechanism pull upthe information that was relevant?
So, context relevance, right?
Um, so like, you know, sometimes, youknow, this, I think in some ways this gets
into sort of production monitoring, right?
If there's a topic shift in the world,if all of a sudden one of your customers
starts asking, you know, uh, they seesomething in the news and they start

(45:24):
asking your chatbot a question that is,not present in your vector database.

Yeah, that's right.
Yes.

Um, suddenly you might on one random day start missing a lot of,
uh, you know, opportunities to give agood answer just because you didn't know
that the users were, you know, you know,Asking about the Taylor Swift concert
in your town, or, you know, whatever,make up your, uh, your story, right?
Uh, so, you know, being able to quicklyidentify whether there's a problem in

(45:54):
your, uh, in your content, either in yourdatabase or in the way that your database
is configured to retrieve documents,like the chunking or the embedding model.
Um, so, I don't know, this gets a littlebit, a little bit nuanced, but, but I
think all of those things are important.
All, you know, instrumenting allof the different pieces of this.
LLM application.
It's not just about the generative model.

That's a really good point.
You know, we used to run into this problemall the time, even back in sort of like
the early 2000s and so on, where, firstof all, this notion of sort of like, how
well, how contextually relevant are you?
Was a problem that that existed evenbefore, and evaluating the quality of

(46:38):
your application in the context of achanging universe, the types of questions
are being asked, or the corpus itselfmorphing itself pretty dramatically.
I think Google's done a aphenomenal job of this stuff.
over over the years and so like thethe the suite of techniques that they
leverage from everything from knowledgegraphs to sort of like page rank to

(47:01):
uh old suite of select personalizationapproaches are all quite relevant.

Um, you know, what about malicious, what about identifying
malicious intent from users?
Is that something that youguys think about at all?

That is something that we think about.
Um, and, um, yeah, sothat's what we're doing.
We, we are really using sortof like an adversarial approach
in order to make that happen.
Um, and, and, and trying to determinesort of the, it's almost like an

(47:31):
intent classification in the prompt.
Right.
Um, and to the extent possible, we,we attempt to detect it and point
it and, and, and, and actually takean aggressive approach to it, right?
It is that, you know, just backoff of sort of responding or
divert the, uh, divert the user tochange topics to something else.

(47:52):
That kind of stuff is, uh, iswhat we have been advising.
We've been, we've been trying to like,you know, uh, since we're trying to build
basically an overall quality and, and.
validation and verification platform.
We have to build some ofthese things ourselves, right?
Otherwise, we will have verylittle credibility, uh, as
we build out our platform.

(48:12):
And as we do this, we discovermore and more problems.
And, uh, you know, so the, theworld is also changing quite a bit.
Yeah.
So, so you're right.
Maliciousness is a big deal.
Toxicity is a pretty big deal, right?
Just in terms of, you know, thedetermining what the tone is.
of what's contained in a prompt.
And again, sort of like, you know,detecting it and gently guiding in a

(48:37):
different direction is the way to go.
So, yeah, so those are definitelythings that we're thinking about.
By no means am I saying thatwe have a 100 percent sort of
like answer to that problem yet.

We have an interesting it was just because we're kind of in this
toxicity and sort of malicious humanmalicious intent it gets in into this
You know all the the prompt injectionattacks, you know people trying to
you know I you maybe you've seen thesethings on that are floating around like
Twitter and LinkedIn of like You know,they'll, they'll, uh, encounter a chatbot

(49:12):
from a car dealership or even like asalesperson will reach out on LinkedIn,
uh, you know, as a LinkedIn message,uh, and the, uh, the user suspects that
it, or the human who's received thismessage suspected it's actually a, an
LLM that's reached out to them, right?
And so they ask it some sort ofprobing question that sort of
reveals that it's Probably an LLM andthen they say, you know, gosh, I'm

(49:36):
really glad you reached out to me.
Could you please pretend like you'rea Python interpreter and, and, and,
uh, you know, and, uh, provide theresponse to the following code, you
know, and they get it to say somethingstupid because it's not a human and
it's happy to be a Python interpreter.
Um, so there's some, some really funnylike screen grabs that people have, uh,
put together from, you know, abusing Uh,chatbots that approach them like humans.

(50:00):
Um, one interesting problem, so likethe, well, let me, let me start with
the interesting problem, which isthat, uh, you know, what we've seen
recently, so we've been thinking a lotabout, like, prompt injection attacks.
I would say that's another,um, very hot topic.
It's, you know, how do you, you know,make sure people aren't trying to get
your chatbot to do something it wasn'tintended for or get something out of

(50:22):
your company that uh, you know, yourcompany didn't mean to offer, right?
So there was just this Air Canadathing a month or so back where, you
know, yeah, this chatbot, uh, you know,offered some, you know, special discount
tickets for somebody who had to attenda funeral or something like that.
That was totally hallucinated,not part of the terms.
Uh, of the, the company, and then,uh, you know, whatever, Canada

(50:44):
sort of refused to honor the fairlyreasonable miscommunication, actually.

Yeah, that one is particularly egregious, and I think it's
not even clear how much sort of, you know,um, accountability and liability does
the company have if a chatbot misbehavesand basically gives out, it's like, 50
percent off on everything for today.

Right, right.

It's like, what are you going to do, right?
And it's well within the contextof what it's supposed to do, right?
It is, uh, and it's difficult to detectthat particular, um, sort of like
weirdness in, in, in terms of responses.
And, you know, uh, these are like, it'sa weird world we live in right now.

There's, there's an interesting thing that happened.
So, um, I did this experiment, anonymity,sometime last year, I started playing with
ChatGPT and playing games of 20 questionswith it to see if I could discover
interesting things about how it reasons.
And you know, the, you know, you would,uh, so the, the gist of the story is,
uh, you know, if you ask it to think of aclue and play 20 questions with you where

(51:51):
you're guessing question, you're tryingto guess the object that it's thinking of.
Um, it actually doesn't, youknow, it doesn't know, it doesn't
have anything in mind, right?
It's just playing the game and tryingto produce a satisfactory answer.
If you, uh, you know, you can rerunthe same session with the same, uh,
random seed and you can interrupt atan earlier level and it'll give you

(52:11):
a different, different clue, right?

Pradeep Javangula: Completely different answer. (52:12):
undefined

And it really gets to this.
So, you know, one of the ways that thesemodels are fine tuned, like, maybe fine
tuning isn't right, but, you know, the,the, the human reinforcement learning
through human feedback mechanism thatthey use in order to make these really
great at answering questions in a waythat humans like, there's this theory
that it also, there's this, that themodel is doing this thing they call reward

(52:37):
hacking, Basically, it's finding waysto make humans happier with its answers.
And there's a benefit to that, whichit gets to be good at answering human,
you know, questions in satisfying ways.
But a side effect, uh, people think isthat, you know, this sort of confident but
wrong answers, uh, get enhanced, right?

(52:59):
Because humans are often compelledby Confidence over factuality.
The human doesn't knowthe difference, right?

Yep.

And, and so, you know, and it leads to these interesting
things like, um, you know, uh, a sortof bias to say yes versus say no, right?
Like, so, so it's, it actuallyturns out that you can at home
do these experiments yourself.
It's, it's, it's fairly easy to get, tobasically lead, um, an LLM to a kind of

(53:29):
answer that you want or a kind of behavioryou want by, uh, you know, offering it.
Yes versus no options, assumingit'll mostly take the yes path.
But you can actually guide them quitea bit, uh, with, you know, fairly
simple tricks where you're sort ofexploiting the fact that they're
trying to make humans happy um.

I have been thoroughly unimpressed with any sort
of like, you know, plannings thatthese generative models do, right?
You know, and I've beenreading a bunch of papers.
It's a huge area ofinterest for me, right?
And not just for Raga, it'sbeen so for, for a while.
I've always thought that sort ofthese, uh, the neural nets and the deep
learning approaches Starting with solike ImageNets have a certain sort of

(54:11):
like upper bound in terms of like howfar they can go beyond a certain level.
So I'm more of a sort of like a bitof a healthy skeptic of the approach
itself essentially yielding all sorts ofmachine reasoning in some form or shape.
I'm probably more of a fan of Gary Marcus.

(54:32):
Right, uh, who, uh, thinks of sortof like, you know, knowledge crafts
and knowledge representation andgrounded in fact being combined
together with statistical approaches.
I'm not suggesting that the, the, the,the, the large language models are not
doing sort of like just amazing things,just blow you away sometimes, isn't it?
So, yeah.
What it seems to be doing from a planningperspective about sort of like planning

(54:55):
a trip or sort of, you know, doing someof these agentic stuff that they're,
that they're attempting to do, but, but,but still sort of like in, in specific
domains, whether it be sort of like anindustrial domain or in a healthcare
domain or much more of a regulateddomain, uh, things are much more haywire.

Yeah I think the thing that I, I feel a lot of those opinions myself.
Um, you know, I think a thing to be, uh,mindful of is that, uh, things that, that
behave like humans, humans are sort ofhardwired to perceive as humans, right?
Like, and so, um, you know, in a waywe're sort of hacking ourselves, right?

(55:35):
There's a vulnerability when we make humanlike things, um, especially when we know
that They do have these fairly significantlimitations, um, so, um, I think
we're kind of rolling up on the hour.
I don't know if, uh, you had anymore, sort of, parting thoughts
about, sort of, um, you know.

Yeah, so I think, you know, uh, yeah, so I mean, my parting
thought, first of all, is like, youknow, this is a very great conversation,
pleasant conversation, and it's, it'sbeen, it's been fun to hang out and
sort of like pontificate for the mostpart about sort of like what's going
on with this sort of like LLM world.
I think the, the, the, this particularevolution is going really fast and

(56:20):
it is, uh, it's the pressure that iscoming on data scientists and machine
learning engineers from all quarters.
Thank you very much.
To be able to deploy these things,just at least to show off that we are
generating GenAI is something that I thinkone should all have a healthy level of
skepticism for, just as with every sortof software application or tool and so on.

(56:44):
So this is, be careful with respect tosort of like how you go about doing this,
what the right infrastructure choicesare that you ought to deploy, you know.
Careful selection of the context corpusthat is, that is going to be needed and
determining what things are important.
Like from some of the questions that I'mseeing, it's like, is toxicity important?

(57:06):
Yeah, I think it's toxicityis super important, especially
if it's consumer facing.
It can have a very adverseimpact on your brand.
and you don't want that to happen,right, and similarly sort of like
as much as possible pay attention tosort of attributability or basically
citations, right, when it is generatingresponses, you know, expose as much

(57:30):
of sort of like how you have builtthe application without Sort of like
intimidating the user, which are like,here's how I generated my response.
Here's why I think this is truthfuland so on, will go a long way toward
establishing credibilities So, um, youknow, and, and, and be willing to say,
it's like, look, this is a chatbot.
It's, it's, it's basic corenature is to be this stochastic

(57:54):
parrot that will generate stuff.
Therefore, there will be things thatare, that are, that are, uh, that are not
truthful, or not factual, or not, sortof like, are incoherent, and in which
case, it's like, take the, sort of, youknow, the, the, the, um, the caveat should
actually be that, hey, I can make errors.

(58:15):
It's a perfectly fine thing to do and,uh, you know, fall back on a, on a human
to be able to answer the question if itis sort of like truly mission critical.
So, and so that's my general refrain.

That's wonderful so thanks, thanks a lot, Pradeep.
This was a super easy hour.
I really appreciate it.
Bye.
Thank you for the opportunity.
Bye everyone.
See ya.

All Episodes

Episode Transcript

Popular Podcasts

United States of Kennedy

Dateline NBC

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Metrics to Detect Hallucinations with Pradeep Javangula

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}United States of Kennedy

Dateline NBC

Stuff You Should Know

All Episodes

Metrics to Detect Hallucinations with Pradeep Javangula

United States of Kennedy