Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:07):
Good morning.
Uh, good afternoon everybody.
Welcome to another, uh,webinar on AI Explained.
Today's topic is on Lessons Learnedfrom Building Agent Systems.
I'm Krishna Gade.
I'm the Founder and CEO of Fiddler AI.
I'll be your host today.
We have a very special guest today.
Uh, Jayeeta Putatunda, a director ofAI Center of Excellence at Fitch Group.
(00:32):
Jayeeta is a very, you know,accomplished AI uh, leader.
She's a rising star in this field.
Her achievements include the AI100Award in Generative AI, uh, recognition
among Top 25 Visionary Women inFinTech AI and selection as one of
the 33 global advisors for NVIDIA'sEnterprise Platform Advisor program.
(00:53):
Jayeeta, you know, please,uh, join us on this program.
Thank you very much.
Thank you for the kind introduction,and thank you for having me here today.
Awesome.
Excited to talk a bit about AI.
Absolutely.
Jayeeta, I think, you know, uh, you arein the thick of things as in the field
(01:14):
of AI, uh, uh, in generative AI now agentsystems and, uh, you know, I think as
the world is moving in the world of AIagents, um, what is the biggest reality
check when you first moved your proof ofconcept to a, a live production agent.
You know, uh, if you can share anyinteresting insights, that would be great.
(01:35):
Yeah, I mean, that's a great point, right?
Everybody and every now and then,every day when I wake up, I see
there are five different processes,three different frameworks, and.
When we are in this space and inthe tech, the goal is always to
build something that that is new.
That's better, thatworks the best, but how?
How do you really measure better?
How do you really measure ifwhat you have in place right
(01:57):
now, does it need an upgrade?
If it needs an upgrade, what arethe baseline metrics that you're
really evaluating it against?
So my answer to thatwould be twofold, right?
First is, I think it's always beenthere in the software development
lifecycle, the 80 20 rule.
The most of the time when, and I amguilty of that too because I am a
data scientist, uh, by profession andI've been in the field for 10 years.
(02:19):
My goal, I'm very, I get excited everytime I see a new model is like, I have
to implement that and implement that witha new framework and see how it works.
But I think as in when you think morefrom not only like your developer
perspective, but more of like thebusiness function that how are you
really adding value to the business?
Is the, there is the ROI worth it.
You kind of do that 80 20 rule again.
(02:39):
80% of the focus should be on, uh,all those use cases where like they
say low hanging fruit, but thatdoesn't mean that it's not important.
It means that it has a bigger impact andyour impact to effort ratio is like good.
And you can't take that on ratherthan spending, I don't know,
six months just prototyping.
And by that time the framework hasalready been obsoleted and we've been
(03:02):
seeing it just talking about it, right?
Like some models came up sixmonths ago and now they're
like extinct from the scene.
And we are like, okay, whatare we really doing here?
So that's my first take.
And I think the, the second biggestlesson I, I feel is that most of the
time, and it has been true even duringlike when we were building predictive
ml machine learning models, but it waseasier to handle then because our systems
(03:23):
were deterministic and we had like agood set of historical data that we
were comparing it against and comparingone is to like, like apples to apples.
Now it's more like the generative AIsystems are giving so much output.
How do you really categorize it?
How do you really, uh, define your matrix?
And not just matrix that, oh,it gave me productivity gains.
What productivity gains?
(03:44):
Is it, say how saving, howmuch of your dollars, how
much of your developer's time?
Or is it like adding, um, I don't know,like reducing cycle of completion.
So very specific methods of evaluating ofwhy you are really doing an agentic AI,
um, application or building it through.
But yeah, I feel like if these twoprocesses are nailed at the head before
(04:07):
you begin your process, the chances ofsuccess definitely increases multifold.
Absolutely.
So before we dive in deeper, right,can you give us an overview of the
type of agent tech systems and productsyou have built recently and deployed?
Sure.
What are some of the use cases?
What are these systems trying to solve?
Yeah, I mean there are, andso a couple of things, right?
(04:28):
There are two ways of defining whatkind of agent solutions you would
build first based on your use cases.
And second, based on the agenticAI frameworks that's available.
Uh, by frameworks I mean like, is itgoing to be a simple react model where
it's like reasoning and doing some actionfor you, a one-to-one output, or it's
going to be more of a reflection basedmodel where you're also auto optimizing
(04:49):
on the spot, uh, of, of whatever theoutput you are, uh, you know, outputting
based on some business rules or likesome evaluation criteria that you
already have from the, the business side.
The second part is I think we alsohave to figure out that where is
that high ROI agentic pattern,like what are the use cases?
Some use cases.
And especially in finance, I stillbelieve that we are not at that space
(05:13):
where agents can really, and this is mypersonal opinion, uh uh, based on like
what I'm seeing in the industry overall,the kinds of data that I'm handling.
Is that the highest autonomy use cases or?
The patterns is, weare not ready for that.
I don't think we will ever be readybecause we are also very highly
govern and it's, it's for the good.
(05:33):
I feel like there has to be some layerof credibility and responsibility,
accountability that needs to happen.
Uh, and, uh, the entire autonomy pipeline,I don't think we are gonna go there.
So how do you, how do you find a realbalance and find some use cases that that
is really, uh, giving you a lot of, uh,time savings and your analysts are freeing
up their time to do some more high valueanalysis versus, I don't know, like going
(05:56):
and reading a 500 uh, page PDF, likefor and spending like three hours on it.
Rather than if you can build a system orlike an agent, TKI pipeline where you are
processing the data the right way, youhave a interface for them to really query
conversation and then find exact placesof information that they're looking for.
That's much high value.
(06:17):
So I would say.
Yeah, use case driven.
Um, uh, uh, some use cases that definitelycomes into mind is like maybe report
generations having some kind of voiceand templatized, uh, uh, models ready
and then utilizing some of these agenticAI frameworks to see that, how do you
really hone in on the, uh, optimizationside and output something that's
(06:37):
really aligns to what business wants.
Again, at the end of the day, whateverwe are building or whatever anybody
should build should align to whatthe business requirements is and
not, not just for the sake of it.
So, so you mentioned two differenttypes of agent tech systems, right?
One is more like a workflow likesystem where maybe somewhat of
determinism is baked into it.
And then there's like a fully autonomousagent that uses reflection and all.
(07:00):
Can you, can you sort of go a littledeeper into these two systems where you
have found one better or the other thenmaybe get a little bit of internal yeah?
Absolutely.
Uh, again, like without.
Going too deep or technical or like,you know, revealing a lot of the PIs.
I think one of the biggest area whereI feel a little process oriented
(07:21):
workflow works is when, thinkabout some RPA processes, right?
Or any maybe the software developmentprocesses where we, where you were
maybe, uh, calling three differentAPIs, trying to gather the data
and then do some processing andthen output in some certain way.
We have all had those simpler, simplertasks or simpler processes, right?
Those processes, I, I do not think reallyneeds a lot of, um, porosity from the
(07:48):
models that the genAI usually adds.
And it adds a lot of complexity, right?
Like, why would I want to complicatemy clean workflow that I had, but
I still want to make it better.
I want to still give it theoption to call additional tools or
additional memory just in case, sothat it remembers the conversations
prior or whatever that memory is.
Right?
So maybe that's a very good use casewhere you're following a workflow,
(08:11):
but you're also giving a little bitmore flexibility in terms of tool, ac,
access to memory, access to differentdatabases, and and then giving the
user an end-to-end a better experienceof working through that workflow.
It's a new workflow for the user,but still with a little, I would say,
augmented, um, features and capabilitiesto really make their life better first.
(08:34):
And the second one, of course, likewe said, uh, that is a little bit more
where you are really trying to give themodel or whatever the agentic system say.
Say you're creating a batch of threesystems and you want it to configure or
like coordinate between each other and seewhat the output of the previous agent was.
Say if it was an evaluation agentand you have a reflection agent, the
(08:57):
reflection agent is supposed to takethe output of the evaluation agent
and then go and clarify with all theprevious business rules you have inputted
and see that, does this make sense?
Are we avoiding by the rules?
Did the output, uh, really follow allthe, uh, processes, score it on that,
and then feed it back again to make sureyou're auto optimizing on, on the output.
So here you have a little bit moreautonomy, but again, that autonomy is
(09:20):
driven by the set of rules that came frombusiness and you are not just evaluating
blindly if, if that makes sense.
Right.
So it seems like in the first case, youare constructing the workflow manually,
so the routing is almost deterministic,but you are taking advantage of the model
calls and tool calls to augment yourexisting business process automation.
Yep.
In the second case, it seems like theagent itself automates their workflow
(09:44):
and it is, it is using reflection andevaluation to self-correct itself, right?
Yeah.
I mean, it's planning a little bit byitself and seeing that it, at this stage,
do I still need to optimize or am I goodto go and just end the process and give
the user the output that that they'rewaiting for, so that, that decisioning
is still happening from the agentside and that's why I say maybe it's a
(10:04):
little bit more autonomous, but again,everything is like checked end-to-end.
So the, it's not really autonomous,it's just like, uh, uh, uh,
somewhere, somewhere in the middle.
Somewhere up there.
Yeah.
So, so in the second case in theautonomous agents, this concept of
reflection is quite interesting, right?
Because, uh, in traditional software,you, you don't have this, could
(10:25):
you elaborate on what reflection isand how it helps autonomous agents.
Absolutely.
So I, I, I forget where I read,I was reading some, uh, uh, blog
and I can share the link later.
Is that how software is changing?
Right?
Like how the software development,like lifecycle as we know is
like changing a bit because.
(10:46):
Yeah, like you said, we werenot using LLMs prior to this.
LLMs are more non-deterministic.
So how, what happens when youbring in non-determinism into a
deterministic workflow pipeline?
How many checkpointsdo you need to measure?
What do you need to measurefor each checkpoints?
And also, like, not reallyover-engineer it, otherwise it's
gonna be like a too complicated systemwith five different agents where
(11:07):
you really don't need that many.
Uh, maybe it can just be one triggeragent and the rest of the workflow
remains as is with the additionalcapability of tools, like I was saying.
So when, when you use something likereflection, it's, it's mostly for
the LLM to really critique itself.
So underlying concept is stillLLM-as-a-judge, but again, being like
(11:30):
responsible builders, like as likeconsider myself to be, I really don't want
to give all the autonomy or the decisionmaking to my reflection agent itself.
Because I think there was some studiesalso where, uh, we saw, and I, I know
I'm, again, I'm, there's too manypapers that's going around, right.
I completely missed the headlines,but it's basically talked about how
(11:52):
uh, uh, uh, LLM is a little bit biasedtowards output from another LLM and
it can like figure out and then say,oh, this is better than something
very similar, but written by a humanagent, maybe in a different way.
So how do you, how do we takeall of these into consideration
when we are building that system?
So we have smaller checkpointsalso, we have very specific business
(12:14):
guidelines that the LLM is using.
From the historical data as well asthe current, uh, workflow data that the
reflection agent is evaluating alongsidethe output from the previous, uh, eval
component that we were talking about.
Right?
Yep.
So again, I don't think there islike a, um, right way to do it.
(12:35):
It's a little bit of trialerror to see what works with
the use case you're handling.
All with the mindset of really makingsure that you are, you are building a
responsible system that should not bebiased towards any particular system.
And the truth should be, uh,yeah, following end to end process
workflow, the best way it can.
(12:55):
So it seems like self-reflectionsometimes could be biased.
And so a third party evaluation,third party reflection could be
interesting and where it can encodeall the business rules and, and sort
of, you know, judge the system, right?
So that, that sort of brings down to like,how do we test these agentic systems?
Because in traditional software youhave deterministic inputs and outputs.
(13:16):
You can test for this,you can write all the.
Your TD like test and development.
But as you said, agent tech systemsare inherently non-deterministic.
Mm-hmm.
Have you, you know, sort of thoughtabout this approach of, you know, testing
and validating this, these systems?
Yeah, so, so I, I think ithas to be done in stages.
Like it doesn't matter.
And since we are, again, like Isaid, highly goverance, matrix, and
(13:39):
observability is a big thing that weinternally do like with a lot of priority
and a lot of focus, and that is somethingalso that I really appreciate, but how
do you really get started to make sureyour eval components will be ready?
You have to have that, I call it thedata prep tax because if you're, I
call it the tax because it's a lotof, again, data data preparation
(14:00):
issue making data, like they say,AI ready, it's everywhere, right?
Everybody's talking about how doyou really make your data AI ready.
That means that using that data forbuilding your systems as well as keeping
that data ready to evaluate your systems.
So there's no other,there's no magic pill here.
I think data prep tax, likewe would like to call it, uh,
(14:22):
really needs to be the focus for.
Any business or I think it's moreimportant for like bigger and legacy
systems where data is like reallylike unstructured or unstructured,
meaning it's in very different places.
You need to bring it together,make sure they align together.
They have some kind oflineage as well as versioning.
And that is how you really track.
(14:44):
Uh, I, I, uh, I was speaking to somebody,um, in our like industry and I think
this, this concept of versioning, likehow do you really version prompts as
well as how do you really version eval.
So eval outputs, so treatit as like, say API.
How do you really, uh, when you use API,you version it, you make sure that with
(15:05):
every upgraded version you have likea different set of test cases ready.
Similar for like anything that goes intoyour system, maybe prompts the data,
the system prompts the business rules.
The voice or style prompts, all of thatneeds to be versioned and then output
it against, uh, how is it changing?
Depending on the model, depending on ifyou add an external tool, if you add an
(15:27):
external step in reflection, and that hasto be like an end to end view for you or
like the developer, uh, and the businessleaders to kind of really, uh, end to end.
You know, we, you know, we all, comingfrom data science ML background, we
have seen how to evaluate ML models.
You can create confusion matrices,measure precision recall accuracy scores.
(15:50):
Right.
Now as we move to the world ofagent systems and agent AI, how
does evaluation change here?
You know, what do you, how do youneed to evaluate these agent systems?
Can you give brief overviewof all the type of metrics and
things that one needs to measure?
Yeah, yeah.
Uh, absolutely.
So.
Again, I would like to break itdown into four or five different
components because I don't think one,there's one magic evaluation, uh,
(16:14):
uh, metric that you need to track.
Couple, right?
Like, like I said, traceability.
Do you really have enoughlogging matrix for all your.
Calls all your tool calls if the outputof the tool calls were really correct.
Uh, is there a way that youcan, uh, compare the links?
Like say if you're doing a deep researchagent in one of your step and it's
trying to find some links from the weband it suggested the user links, is
(16:37):
there a way for the system to evaluateif the links are correct as well as
like top rated, uh, and not like some,you know garbage or like low quality,
uh, links that came up in the search.
So that's more for thetraceability and logging.
Second is like, how are youreally using the models?
Because again, models doesn't matterhow cheap it gets, at the end of
(16:59):
the day, it starts compounding.
When you are building a complicatedagenting system, it's not.
About only one model call.
It's about multimodal multi-modelcalls in different layers and
sometimes at the same time.
Because if you're initiating like a,say three different agents, it's calling
two different tools you are goingor routing that through some uh, uh,
models to really get the output of it.
(17:20):
So how do you do track?
Like track the token usages response time.
Were there error rates?
How many time your, uh, processes failedbecause of a modeling issue, or is it
because of like, it failed to generatea response out out of that issue?
Both are important, right?
Like otherwise you are not setting upyour system to be like a really good.
User experience for your customers orwhoever you are, um, opening that up to.
(17:44):
So as well as like, there's ton, tons ofothers, like how do we do drift detection?
And then all that still maintains, butthere are too many other components that's
now equally important, especially fromthe infrastructure, uh, maintainability
and infrastructure Observability as well.
Awesome.
And so when you diagnose these failures,right, how do you, you know, what do
(18:05):
you need to, like trace through ordebug the system failure so that you.
Can understand if things arehappening because of tool calls
or like model, model issues.
Yeah, I think, uh, I think I, Idon't have the right answer to it,
but it's, it's definitely work inprogress and what I have seen working
is having checkpoints at every point.
Like I said, like if you're building oneagent and that agent has five different
(18:28):
steps that it's supposed to take, supposedto take after each stage, there has to
be like some component of logging thatit is doing that, okay, this was my
input, this was my output, this is whatI called, and this was the response.
Sometimes it might.
Get too much of an information totrack, but it's worth the work,
at least initially when you'resetting up for your use case.
(18:49):
As you become mature and you get tolike understand a little bit of the
workflow that you're building for,you can tone it down a little bit,
but I think at the end of the day.
The goal, like nobody saidtoo much data is harmful.
Like everybody said,scars of data is harmful.
So logging too much is my way to go.
I, I log as much as possible evensometimes when it's not required.
(19:10):
But you never know when you canfind some, you know, gold mine
an um, idea or like some kind ofthing that you're not thinking
through, uh, from the data you log.
Got it.
And where does likeobservability fit in here?
Because, uh, you know, traditionallyobservability, you know, has to deal
with, you know, more deterministicsoftware where you are measuring
(19:30):
reliability, latency, throughput Yeah.
You know, server utilization, right?
What, what do you thinkabout observability in
the context of agentic AI?
Yeah, I think thatdefinition still holds true.
I will, I will not take it away at all.
Uh, but there are additional.
Additional angles to where youwould define observability.
(19:52):
One thing that I really feel, um,excited about is that, at least
in, in the finance industry as muchas I've seen observability is no
longer like a afterthought, likeit used to happen in most of the
initial ML spaces that I worked in.
That, or you first build themodel you have some pipeline, and
then you start measuring and seewhat matrix you want to measure.
That's not how it really works andthat's not how you should do it.
(20:15):
You start with the matrix.
You start with the steps of whereyou are logging, and like you said,
all the things that you said is stillrelevant, but there's also how do
you really bring in effectively ahuman in the loop for multiple angles
to observe a pattern of output?
Make sure that pattern makessense, because if you really
want to scale a system.
(20:35):
There's no way you can scale it withlike say two or three human in the loop
reviewing everything, say thousands ofdocument data that you have extracted.
But there will be pattern when youanalyze the extracted data to see are
there specific, just an example, right?
Are there specific indicators thathave seemed to have failed the multiple
times for the same type of documents?
(20:56):
That's the pattern you're looking for sothat you know exactly either your, it's
your model that's failing or maybe the.
Kind of extraction you were doing justan example, for example, workflow, the,
the kind of extraction you were doing.
Maybe it's failing.
One.
I I can also share a learning that Ihad, uh, recently, is that some of the
huge financial documents you will see is.
(21:18):
Highly it, it's text, it's textdriven, but it also has like
infographics tables that looks liketables with's actually an image.
So how do you really bring togetherthe insights from a text based,
text-based extraction and from like atable as well as info, uh, infographic
and make sure they all align.
And the summary of all thatextraction really makes sense
(21:39):
and tells the story correctly.
And there's not any no specific datathat's getting lost or you're not
losing anything in, in, in translation.
So.
Yeah, like at, at every stage whereyou are bringing in, um, uh, the
Observability factor, be it the human inthe loop, finding the matter, uh, the,
the pattern, and making sure the endto end story that you are building or
(22:01):
like the solution that you're buildingreally aligns with the expected output.
And then you build on top of itslowly to really add, add more
color to it, but start simple.
Yeah.
Yeah, makes sense.
Especially as you kind of describedthis agentic workflow, which mixes.
You know, structuredand unstructured, right?
Mm-hmm.
And, and, uh, when you are ina financial domain, uh, uh, you
(22:24):
cannot hallucinate on numbers.
You know, you cannot pad an extrazero for a mortgage rate or, or, or,
or like, uh, or some sort of, uh,
intolerance Yeah.
Tolerance there.
Uh, how, how, how do you deal with it?
Because this is a, actually a big problemthat a lot of us are facing, right?
There's this non-deterministic beast.
And you're trying to control and, andput a deterministic layer on top of it.
(22:46):
Um, how do we, how do we make surethat this has hallucinations or these
things, you know, monitored and observed?
Yeah.
So that's why I said, right,like at least for the financial
use cases, I do not think that.
The current LLMs or like the largelanguage models way they way they are
built and how they output, we all knowthey are like all predictive token based.
(23:08):
They're not really understanding thequantitative numbers and that's why
we've seen, I'm sure you have seenthose, uh, funny LinkedIn posts about
is 9.9 bigger or 9.11 bigger andthen you see all the wrong answers.
That's why it's happening.
Right.
But I think the models are getting.
Better and better as.
As we are, uh, uh, making a littlebit of infrastructure change and
that this is where the, the business.
(23:30):
Way of building asolution comes into play.
This is your system design.
It's not necessary that you reallyapply LLMs for all your steps.
Yes, of course.
For things like that's highly,uh, uh, time consuming, like
really extraction of the data.
You do a first pass with the LLMs.
You have your own predictive modelsthat I'm sure all companies have
built, especially if they're alegacy company in the space that
(23:53):
was doing this work prior to.
The LLMs came into the scene, right?
So why, why are we like movingcompletely outta that space and not
really building a, a hybrid space ora hybrid model that's taking advantage
of what we have been doing for so long?
Use that as our learning curve.
Uh, use that as our, I would say atraining material, or I would say some.
(24:16):
Maybe if I can say it in this way,maybe a few short based learning
methodology from what we weredoing with our predictive models.
Yeah.
And help leverage andground our LM outputs.
Again, there is no oneeasy way of doing it.
It's, yeah, a lot of trial anderror, figuring out where we are, our
systems are failing, which indicatorsare too complicated for the system
to handle, and then bringing in.
(24:38):
Either SMEs or our previous generationof models to help us guide through that.
So yeah,
in other words, you're saying, uh,the classical machine learning models
still need to exist and you need tolayer on an agent, skin on them and
try to ground your generative AIoutputs on the predictive, you know,
what predictive models are suggesting.
There's a lot of value with soundpredictive models that I've seen
(25:01):
the financial industry usually work.
There's a lot of work going on,and especially now with the power
of LLMs, uh, knowledge graphs haveagain, come into the space, right?
Like it's getting a lot of moretraction because now it's easy to.
Easy to stand it up and maintain withthe help of the LLMs, uh, compared to
prior to this as well as a lot of work ishappening on the causal AI side, which is
(25:22):
also a domain I'm highly interested in.
I'm trying to like, find and do asmuch work as possible in there is that,
how do you really ground your, um,non-deterministic outputs from LLMs
with like the causal analysis that.
I am sure you have already done, or youreconometrics team or your statistical
team have been doing prior to this.
Uh, and then utilize that to gaugeyour level of correctness, if I can.
(25:47):
Uh, it's not accurate, but correctness,how correct are you and where
are the areas of um, uh, yeah.
Most of the gaps.
Absolutely.
So there's a question from the audienceI'd like to take at this point.
Mm-hmm.
Um, someone is asking, our agentsystem works great in dev mode, but we
keep hearing production is different.
What specific failure mode shouldwe be preparing for so that, you
(26:10):
know, te our testing isn't catching?
Okay, so it's basically,could you repeat that?
So it's basically saying the,
Basically it's an, you know, it's thegen system, you know, you evaluate
it, it works fine on your test cases,but when you get into production Yeah.
Then you in and all kinds of noisy inputs,it seems to be having reliability issues.
Absolutely.
So this, this, this happens, right?
Like when either we are not working,we meaning the engineers and the
(26:35):
technologists or the developers arenot working close enough with the
business to really understand theiredge cases or what their clients are
looking for, what kind of questionsor kinds of dataset that might come in
that we have not tested in the system.
That's number one.
Like I and I have seen thismultiple times with my prior
organizations as well as that.
I am guilty of that too.
Like I said in the beginning, to reallystart building the moment we see a new
(26:57):
architecture or a new process and saythat, okay, look, I have 25, uh, tested
data sets or like, even a hundred.
It works absolutely amazing.
Let's put it into a, a prod.
Before that, did you put it intoqa, open it up to your beta testers?
Did you get like a handful of samplepeople to really push it to its edge
and figure out where it's breaking?
(27:18):
And with all those Observabilitycheckpoint that we talked about, that
is how you really catch what, what yourproduct is missing that maybe your PM has
think about before, or your developershaven't considered in the edge cases and
you really refine that with your PET test.
I'm a hundred percent sure,even if you're going down the.
Predictive analytics route, thereis always something that you miss in
(27:38):
production, and this is how you reallyfine tune your, or tune iteratively your,
uh, application, software applicationor whatever application you're building.
So yeah, never releasethe first into prod.
Make sure you're releasing inQA, dev have open it up with
beta testers, uh, test them.
Ask them to test it and push them.
Push the application to its max andthat's how you really, really know
(28:02):
about your application more and, uh,get, get your stakeholders buy-in
to really support you in that.
Yeah.
So that sort of bringsto a meta point, right?
Where do you see the biggest gap todaybetween, you know, current agent system
capabilities as you, you know, build ontop of these orchestration frameworks?
There and then what actually enterprisesneed for a reli reliable production,
(28:25):
deployment and MA and maintenance.
Yeah.
So multifold, right?
Like I, I definitely feel,like I said, I, I, the, the
frameworks are changing every day.
So, so having one framework or accessto one model shouldn't be your moat.
This is the word everybody keepsthrowing around the modeling model
(28:45):
and systems are not your moat.
The.
Application you are building basedon your business' input, the ROI
you're trying to get, the problemthat you're trying to solve, uh, the
matrix that you have designed andthe process workflow that you have.
All that comes together and this kindof very currently ties back to that.
(29:06):
Great article you shared, uh, Krishnaabout the compounding AI systems.
Like I said, it's no longer thatmodel's responsibility to get you the
right answer and you cannot blame,oh, the LLMs are hallucinating.
Of course they're hallucinatingbecause they are built on like.
Entire internet and it doesn't know thespecific requirements for your task if
you do not give it the right directions.
(29:28):
The, uh, the config files full ofsystem prompts that has very specific
nuanced guidelines, a structuredworkflow to follow a structured
workflow, meaning like a set of.
Agents, if you're building an agentsystem that follows or have a scope
defined for that agent so thatthey're not, they don't have the
tendency to go, uh, out of the box.
(29:49):
One funny thing I actually recently readis that somebody asked an agent that
help me with this, and then it startedgarling out like thousands of lines of
code because that's the way the agentwas trying to help the user, right?
But if you tell me that, help me do 1,2, 3, by using 5, 6, 7, these tools,
that is how you really build up.
More aware and context aware, like this,a context aware system that you have a
(30:14):
better potential of evaluating, handling,and observing, rather than just saying
that, Hey, help me to solve this problem.
So it'll help it like it can.
Absolutely.
Now, now we have all these, uh, agenticframeworks that it, they're being
catered to business users, right?
So where, where you can go andmake these questions and, you
know, shoot yourself in the foot.
(30:34):
Yeah.
Yeah.
Awesome.
So let's take anotheraudience question here.
Uh, so someone was asking our agentsystems generate tons of logs, like
reasoning, API calls memory accessplanning steps when something fails, how
do you trace through all of that to followthe, you know, follow the logic chain?
How do you do root cause analysis?
Yeah, yeah, absolutely.
(30:54):
And they can return if you writethe systems the right way, you
can return like from, uh, someof my, uh, reflection agents.
I have returned the context thatwent into the final reflection chain.
I, uh, returned the chain ofthought sometimes a chain of draft.
Chain of draft is nothing but like asimplified, lesser number of tokens to
use for chain of thought so that you'renot spending too much on like, you
(31:16):
know, uh, on those, on those processes.
Then it also tracks like exact,um, uh, match points depending
on if you have specific matrix.
There is like if it's contextual,relevancy and all that
industry standard type stuff.
Uh, and then you canalso add custom metrics.
So I, I feel like this is where yourcreativity should come in as a developer.
(31:38):
That what you really need, and that'swhy I said start building slowly,
have three, four basic components.
See where you're missing, go back, refineyour evaluation criteria, and then make
sure that you're continuously observingor you maybe say continuously monitoring.
And that's how you really build a goodquality it, developing solution, uh,
(31:59):
rather than trying to build it all at oncewithout like, uh, much thought into it.
Yeah, that's actually brings a question.
You mentioned custom metrics, right?
So like, you know, there is noout-of-the-box way to measure.
These things because like every ap, everyapplication that you're building is,
is sort of slightly different from theuse case and so, you know, you, you may
(32:23):
have to write, you know, tests or customevals to measure those things, right?
Mm-hmm.
So what would be thatdevelopment process look like?
For example, let's say you, let'ssay you're trying to build a customer
support agent that making a call.
Mm-hmm.
Ticketing platform andsummarizing the, the activity.
Mm-hmm.
What are some of the basic stuff whena developer needs to do to get, just
(32:45):
get that going in the first place?
Yeah.
So yeah, like I said, like if you'reusing a agent framework and if you're
using all the components like tool,calling, memory, make sure you're
logging each of those components.
That's baseline.
That's industry specific, orlike by industry specific,
I mean industry standard.
That has nothing to do withyour specific use case.
When you start building your usecase with, say, for a customer
(33:08):
agent that you said, I'm sure you'reworking with somebody from the
marketing or the sales or whoever,which departments are the, the.
Partnership between businessside components and developers
have never been this important.
Mm-hmm.
Uh, now that we are into, like reallygoing into this, uh, open sea of
non-determinism, and there could beso much, there could be so much, um,
(33:31):
negative or noisy feedback that for meas a developer without much business
context, I can, I can maybe, I canmaybe feed out the, the the junk from
an initial view, but what could be junkfor me may not be for the business side.
That could have, they could have likesome internal insights that I'm missing.
So working with partnership with businessis your best bet to start from the start.
(33:55):
And that's why I feel, um, everybodyhas spoken to in the, uh, in the
finance industry style, who istrying to build care systems first
talks about that we have to make.
Our, get our stakeholders buy in.
And your stakeholder isyour business counterpart.
So they need to believe in that visionof what you're trying to build and
how it'll make their life easier.
That's how you bring them in.
(34:16):
Describe the entire process.
Make them understand what the, thekey risks are and help get their help
to really define those key matrix.
There's like literally no otherway that you can, uh, define this
matrix by yourself being a developer,um, without much business context.
Absolutely.
And so, uh, the question is like, youknow, the world has moved in the, in
(34:37):
the last five, six years from MLOpsto LLMOps to now AgentOps, right?
Yeah.
What are some of the, what are some of the
AgentOps okay.
Yes.
What are some of the differencesor commonalities that
you've seen in these worlds?
You know, uh, mm-hmm.
You know, what are some ofthe learnings that you could
probably derive from your LLMOps?
Uh, systems to like now the agentOps.
(34:59):
And what are some of the differences?
That as well, when you do,we need to do differently.
Yeah.
Yeah, absolutely.
I mean, some concepts still stillhave the exact same amount of value.
Like are your, the basic firstlevel is, is your system useful?
Is it useful in the terms of beingaccurate, in the terms of being,
responding to what the user isasking and not just giving it
(35:20):
like five different links that.
The question didn't even ask for.
So yes, relevancy, accuracy, makingsure the system is not breaking down
or there's no up downtime is yourbaseline matrix that rolls over for
each of these components that yousaid right now with more agent side.
And there's actually a, a, agreat paper and I can, I can share
(35:41):
the link in the chat, uh, thatrecently I was, I was reading this
yesterday night and I found it.
Really, uh, interesting because it aligns.
I'm not sure how I sharea share in the chat.
Maybe I can quickly
yeah.
Share my,
yeah, there should be a way ourteam can enable you with that.
No worries.
I can quickly share my screen.
(36:01):
Can you see the, can you see like a paper?
Yep.
Okay, so this is actuallya great paper call.
Why do multi-agent llms fail?
And I was just looking throughsome, uh, failure categories
that it has identified here.
If you see the biggest one is like,say system design agent coordination.
(36:22):
The hardest part is like, how do youreally make sure one agent is working
to whether the other agent the rightway as well as task verification.
So yes, premature termination.
Making sure there is no incompleteverification or incorrect verification
that these are like very new terminologiesor new ways of thinking for your system.
But, but there's no way you can likereally avoid it because if you're building
(36:46):
a multi-agent, different orchestrationprocesses, these has to be taken care of
in like one of those chains of loggingsthat you're doing and thinking through it.
So yeah.
Very newer, new, I would say newer,uh, uh, domain of like, how do we
really think about the systems, but thebaseline still remains the same, right?
(37:07):
Like, is it accurate?
Is it helping your end of business user?
Is it giving you the right kindof answer that you're looking
Absolutely.
So that just tells you that there'sno, is just no black or white.
There's actually a big gray zonehere because the in incompleteness
actually is like a, you know, getsyou into this big gray zone of.
Uh, how to evaluate these systems.
That's amazing That's really cool.
(37:29):
So I guess like, you know, from a, youknow, like let's say from an SDLC of an
agent development perspective, you wantto sort of just give a simple recipe for
a new AI team starting on this journey.
Like maybe five, six steps.
Just like follow this recipe uhhuh, todeploy your first few agent DEC caps.
What would that be?
Very interesting.
(37:50):
Again, that recipe can be made in so manydifferent ways, but if you ask for, like,
say, the top three biggest components.
First component is like,what is the expected output?
Like what, what is the user, user problemyou're solving and the expected output.
From there, you kind of work backto see what you have in your system.
So far.
Is there data gap?
(38:11):
Uh, if there is data gap, howdo you fulfill that data gap?
If there is.
Process gap that is easy to, I thinknow with all the frameworks is it's easy
to fill in the process gap with like.
A lot of open source, like if youwanna build a, uh, workflow and
chain, a couple different, uh, toolsand models you can easily do with
something open source like ran graph.
(38:31):
But that's not the point.
When you are putting it all together,do you know exactly where to put
your checkpoints, who your SMEs are?
Again, all the common themes that Ihighlighted comes back here as your
recipe, that you need all those peoplein your team or to have your back.
Otherwise, if you're.
Building in a silo, I canguarantee you the, the, the
(38:54):
product is never going to take off.
And there you'll see like noadoption because everybody
will be like, Hey, did you.
Check in with me.
This is not what I really wanted.
So do your market fit?
Like they say, uh, analysis really well.
Make sure that you're really solvinga problem that is a bottleneck and
not something that you just wantto build for uh, and use the right,
(39:14):
defining the problem, uh, isthe most important, right?
Find the problems and the matrix thatyou are so that you can map at the end
of the day, um, before you get started,and then you can keep hydrating on it.
And then, um, make it more, Iwould say defined and colorful.
Right, right.
And then include all thestakeholders That's awesome.
(39:34):
So I guess, uh, you know, let's actuallytake a few, um, audience question.
Um, there's one interesting question.
In your experience, have you seena use case that made you go, this
is awful for agents looking fora framework to think about what
makes a good versus bad use case?
Very interesting.
(39:54):
So I actually.
Very interesting.
I actually did a, uh, um, uh, read arecent paper about, uh, how agents,
human agents, and, you know, genAI or AI agents, whatever, will
come together to solve differentkinds of, you know, problems.
There's this concept of if it's, ifit's low risk or, or the theme is
(40:18):
if, if the, um, variance toleratedin the output is really, really low.
That's not something you would reallywant to build in an autonomous system.
By that I meaning exactly that samepoint you said, if you are building
a financial analyst, uh, agent andit analyzes a trend pattern and gives
(40:39):
you complete garbage output basedon the spreadsheet document that it
evaluated, do you really want to buildan autonomous, uh, agent on top of that?
Maybe the autonomous agent is helping onthe initial stages of data extraction and
putting it into like some templated formatthat will help you or any other predictive
kind of process that you have in placeto expedite that process, but end to end.
(41:03):
Is that a good, good, good use case?
I, maybe we are not there yet.
Maybe in two months or threemonths or six months, I don't know.
Like maybe we will getthere, but not yet so far.
Mm-hmm.
So you really think throughfrom the risk as well as.
Uh, tolerance of fault or false tolerancefor your, for your audience, uh, as
(41:24):
your first grounding level that isthat what you, something should do.
That's a good, good point.
You know, variance in terms of being ableto be happy with the Correct output.
Yeah.
Yeah.
It's a q and a procedure.
Sure.
Like if there is a couple of changesin the line of my q and a, uh,
conversational agent, I really don't mind.
But if there is a factual error orlike a trend analysis error where
(41:47):
it's supposed to show high and it'sshowing no low or no changes, that's
catastrophic in terms of reputationas well as, our entire business line.
Yeah, makes sense.
Uh, another audience question here.
Um, our leadership asks how do weknow agents that won't randomly
break in front of a client?
(42:08):
You know, it's basically they'rebuilding agents and, and we can give
them accuracy metrics like we used to.
How do you convince stakeholdersto trust something unpredictable?
This is, I think I. Industrywide problem.
Uh, you start by educating them.
And this is what, uh, with everythingthat I do like outside my day-to-day,
(42:28):
even during my day-to-day work, my entiregoal is that if I'm talking to somebody
from the non-technical side or even fromthe technical side who are more from the
predictive analytic side or more fromthe software development side, how do I.
Add a little bit of extra informationin any conversation I'm having so
that they have some new perspective.
They don't have to agree withme, but, so something like that.
(42:49):
Okay, that's a takeawayfor me to go and read up.
Uh, and then I share some materials orI tell them, what do you think about it?
So start having this conversationso that your leadership or your.
Other side of the technical ornon-technical stakeholders doesn't feel
like their opinions doesn't matter, andyou are trying to like kind of bulldoze
and build on top of their systems.
(43:11):
That's not how we're gonna.
Make it productive and effective, but itshould be more like really collaborative,
trying to understand where they'recoming from, where their fears are, and
some of those fears are really legit.
So you really don't know ifyou can, uh, stop an agent from
crashing during a live demo.
It happens all the time, but it happensin traditional software as well.
I've seen so many softwaredemos where the API calls just.
(43:34):
Fails in the middle, or likethere's some connection, error or,
so it's part of the tech world.
It's part of like building, uh,smaller components and making sure
they're tight, evaluated, and, uh, youknow, kind of gauged the right way.
So yeah, start with maybe friendly,uh, education and upskilling and
awareness kind of discussionsand, and, and see how that.
(43:56):
How that bonding goes there.
There's really no other way.
Uh, yeah.
So I guess, you know, finally,to end this conversation, right?
Like as the world is moving fromdeterministic software into this agent
tech, non-deterministic software mm-hmm.
How important are the things that wetalked about in evaluation, Observability,
(44:17):
guardrails, you know, where would you.
You know, how do you think, how doyou sort of think about it from an
AI, you know, leadership perspective?
Also from a developer perspective?
Yeah, I mean, at least everything thatI build, I try to make sure that I build
in a way that I can prove if somebodywants to run that process again, I show
(44:37):
them some lineage of information, even ifthere is like some change in ity or like
some change in uh, uh, extraction process.
But the key goal should be to showthat, hey, look, this process works.
It's making, um.
The work that I used to do in likesay, I don't know, just give an
example, five hours now I do inlike one or one and a half and my
productivity boost has been insane.
(44:59):
And look, I have three moreideas that have come up with that
extra time that I have in hand.
So it's, it's.
It's evaluation is key to doing all ofthat, just to make sure that you are
building the right way, thinking ina more modular structure, and making
sure that each module and here module,meaning your data pipeline is one module.
The output from that should be somewherewhere you're checking the output quality,
(45:21):
drift direction and all that stuff.
Then your com agent components, orlike the individual business, uh,
uh, components that you have shouldhave another segmented module.
So yeah, build the, build it.
In a way that you can break it downeasily and then see where the gaps are.
Don't build a monolithic legacyagent system end to end five agents
(45:43):
calling three different agents.
I don't think we really need that.
Yeah, sometimes people reallycomplicate stuff just because they
want to build complicated systems.
Sometimes I've seen simple agents with twodifferent calls that they want to have a
very specific task they want to fulfill,can do a lot and like save a lot of time.
So think about the 20 80 rule.
Find out the use cases that you wantto prioritize that solves the most
(46:04):
amount of problem, and, and really startthere and make sure you evaluate as you
Awesome.
Thank you so much.
Uh, I think, yeah, that sumsit up, you know, uh, thanks for
spending the time with us today.
Uh, we'd love to see that paperthat you'd, you know, shared.
Maybe you can share a link with us.
Uh, did I do that?
And then we can share it with our,you know, webinar attendees later.
(46:24):
Um, and, um, yeah, absolutely.
This is a exciting time.
And, and yeah, I think, uh, thepaper is there for everybody to,
I think it just went to post onthe panelists, but yeah, you,
Awesome.
Okay.
Uh, that's it for this week folks.
Uh, thanks for joining us on the AI.
Explain, and, uh, we'll comeback with another great edition
(46:47):
with another great speaker.
Until then, you know, see you.
Thank you.