All Episodes

October 8, 2025 53 mins

Most AI agents are built backwards, starting with models instead of system architecture.

Aishwarya Srinivasan, Head of AI Developer Relations at Fireworks AI, joins host Conor Bronsdon to explain the shift required to build reliable agents: stop treating them as model problems and start architecting them as complete software systems. Benchmarks alone won't save you. 

Aish breaks down the evolution from prompt engineering to context engineering, revealing how production agents demand careful orchestration of multiple models, memory systems, and tool calls. She shares battle-tested insights on evaluation-driven development, the rise of open source models like DeepSeek v3, and practical strategies for managing autonomy with human-in-the-loop systems. The conversation addresses critical production challenges, ranging from LLM-as-judge techniques to navigating compliance in regulated environments.

Connect with Aishwarya Srinivasan:

LinkedIn: https://www.linkedin.com/in/aishwarya-srinivasan/

Instagram: https://www.instagram.com/the.datascience.gal/

Connect with Conor: https://www.linkedin.com/in/conorbronsdon/

00:00 Intro — Welcome to Chain of Thought

00:22 Guest Intro — Ash Srinivasan of Fireworks AI

02:37 The Challenge of Responsible AI

05:44 The Hidden Risks of Reward Hacking

07:22 From Prompt to Context Engineering

10:14 Data Quality and Human Feedback

14:43 Quantifying Trust and Observability

20:27 Evaluation-Driven Development

30:10 Open Source Models vs. Proprietary Systems

34:56 Gaps in the Open-Source AI Stack

38:45 When to Use Different Models

45:36 Governance and Compliance in AI Systems

50:11 The Future of AI Builders

56:00 Closing Thoughts & Follow Ash Online

Follow the hosts

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Atin⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Conor⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Vikram⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠Yash⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
You cannot just look at a model benchmark and feel that, hey,
like this is a good enough modelfor me to use in my workflow and
then just like tie it together with each other.
That's the worst thing that you can do because none of these
models understand, like I can break any model with a bunch of
quirky twisted prompts. So that's where it's important
to understand that, hey, like how do you architect it as a
software? Welcome back to Chain of

(00:26):
Thought, everyone. I am your host, Connor Bronson.
And today we're diving into one of the critical challenges
facing our industry, taking AI agents from promising prototypes
to reliable production systems that operate at scale.
And we have the perfect guide. Joining us, Ash Srinivasan is a
renowned AI expert, head of AI Developer Relations at Fireworks

(00:47):
AIA, powerful voice for buildingAI the right way.
You also may know Ash from her what, 600,000 followers on
LinkedIn that she has developed and the incredible content that
she shares there. Ash, welcome to the show.
It is so good to see you. Thank.
You so much, Connor. I'm very glad to be here.
Yeah, we've had the opportunity to collaborate a couple of times
already, but we haven't actuallysat down for a one-on-one

(01:10):
conversation like this that was recorded.
So I think this will be a fantastic way for us to dive
deeper into some of the things that I know we're really excited
about with LLMS and small language models, evaluation,
driven development, so much more.
But let's start with the bedrockof all of this.
It's one thing to build a reallycool Asian demo.
We've seen a lot of them, I knowyou have in particular, but it's

(01:31):
another thing entirely to deployit to potentially millions of
users responsibly. When we talk about responsible
AI in the context of agents at scale, what are the top risks
that you're seeing? I think when we're going from
building a prototype to buildingsomething in production, one of
the biggest differences comes when we are seeing LLMS as tools

(01:57):
versus LLMS part of tools. What I mean by that is since
maybe like 2022, end of 2022-2023, we have seen a lot of
applications which are so-calledwrappers around LLMS.
And the reason it's being calledsuch is at the end of the day,
it is a little bit of engineering, but most of it is

(02:20):
LLM calls. That's what's happening.
We're expecting that this LLM isgoing to be the magic box where
we're going to ask it any sort of questions, follow-ups and
it's going to get us the end product.
And I think that has quickly changed where people have
realized that, hey, like it's way more of a software
engineering challenge rather than just a model building

(02:43):
challenge. That is one part of it, sure.
That is why like all the frontier model labs are working
towards it. But when it comes to building a
application that uses generativeAI as a brain for it, things get
way more complicated. So when when people say about
like responsible AI, right, one of the biggest risks or

(03:05):
challenges that I've seen or like one of the issues that gets
tackled is around how much autonomy is right autonomy
because of the pure non deterministic nature of the
model, it sometimes is very hardto go through all of the edge
cases and understand where a model might fail.

(03:26):
So people are obviously going with eval data sets trying to
understand that, hey, like this is how it performs on my eval
set. But at the end of the day, the
question is how good is the benchmark, why you've like
chosen that model? How good is the eval set that
you've built? Like it's as good as you think
it is, but there is no like goldstandards for you to quantify

(03:49):
how good that is. So at the end of the day, it
only gets as good as you're testing it in your development
environment. But as soon as it goes into
production, there are a lot morevariability that it encounters.
And it's very simple, right? Like the exact same way how we
have seen chat bot grow from, hey, select between these four
options to more language specific chat bots, which uses

(04:13):
like dialogue flow at the end, at the end of the day to what we
have right now, which is a completely autonomous LLM model
that's able to converse without any boundaries to it, right?
As soon as that happens, it's very hard to keep it under a
specific territory. And that's where it is.
It needs to be very well plannedon where you add a human in the

(04:36):
loop. So that's one of the top
challenges, right? Like how much autonomy is too
much autonomy? Which all are the places where
we need to have human in the loop.
And there's no right answer to it.
It all depends about like how your user journey looks like
with that particular toolkit, where are the possible points of
risks and which are the places where it makes most sense for

(04:58):
you to have a human in the loop or have a more deterministic
review or checkpoint at that point in that.
So that's one of the biggest risks that I've seen.
Now, second thing that I've seenis with these kind of models,
because even the testing is not very rigorous or it's not very
deterministic, it's not very quantified.

(05:20):
People are still going with thisvibe eval where it's like, hey,
like, you know, I'm, I'm just, I'm just going to run the model.
I know this particular coding question on top of my mind, so
I'm just going to run it throughthe model and see how it
performs. Oh, it does good.
But how many times can you really check how rigorously can
you check for for these models and how they perform and how
they are going to perform when you land them in your user's

(05:46):
hands? That stress testing is something
which is still a big question mark.
The plot thing that I've seen isaround reward hacking because in
the systems that we are building, we are using a lot of
reinforcement models. So whether it be for fine tuning
or human in the loop, we are using a lot of reinforcement
models and it does sometimes come back and buy you in the way

(06:11):
that the agents try to find those shortcuts or they reverse
engineer how it can maximize therewards that you put in inside
the system and that can sometimes come back and buy too.
So these are like the top three things that I have seen,
particularly with respect to like challenges which are
aligned to responsible AI agentic systems.

(06:32):
You brought up a ton of great points there.
I mean, the whole topic of evaluation we can definitely
dive into. And I think the other piece that
really struck me is this idea offinding the right context for
LMS. And there's a reason that we're
hearing the phrase context engineering talked about so much
more. You know, folks were talking
about O prompt engineering and of course folks are still

(06:54):
adapting prompts, but actually enabling our systems with the
right contacts can be really challenging.
You have a, a deep background indata centric AI.
How are you thinking through theright data approaches that will
actually enable AI systems and agents with the context they
need to be successful and help hopefully address some of these

(07:18):
risks and failure modes before they make it to production?
Absolutely. I think data centric behavior is
not something which is specific to machine learning.
That's what that's what my background is in.
I've been working in machine learning much before, like Jen
AI became cool. So data centric AI is something

(07:39):
that I've been doing before and it's very well applicable to
what we are calling as context engineering right now.
And let me like maybe for the viewers, if they've not heard
about it or like are not very well versed with what it is, the
entire shift that we have seen from prompt engineering to
context engineering has happenedbecause of the very reason that

(08:02):
I was talking about. When we're building systems,
it's not just a LLM prompt anymore.
It's not an API call anymore where you're just getting the
output and that becomes your product, right?
Like that's not the end result that you're looking for.
You're trying to build a system,which means that you're going to
have a combination of different modality of models.

(08:23):
You're going to have different combination of smaller models,
mid, mid size model, larger models.
Some of them are going to be fine-tuned, some of them are
going to be distilled, some of them are going to be
proprietary, some of them are going to be open source.
And all of this combines with the data that it is getting
access to, the memory systems that you're building, the tool

(08:44):
calls that you're giving it access to.
So all of these combine and formulate a more software
engineering problem again, right?
And that's where this entire context engineering is coming in
picture. Because now that models are
conversing with each other, it'snot just about the user prompt
and the system prompt anymore. It's about all the logs and

(09:07):
traces that is going back and forth between each and every
model to model conversation between say a traditional
machine learning model and your large language model.
Anything that's happening between, between the data set
that it's trying to access, between the tools that it has
called and everything in between.
So it's becoming a huge clutter of logs and traces and how

(09:30):
you're going to think about likethe fall back for if anything in
the system breaks down. And that's somewhere where it's
very important to think about what data is fueling the back
end of all of these systems and where it's where is it pulling
the information from. So starting from the very first
thing, right? Like even today, like when I'm

(09:50):
talking to my friends working atFrontier Labs, there are folks
and this is still a super, supermanual process when they have to
go and download data manually because they have to sanitize
it. They have to make sure that it
is very high quality. Because at the end of the day,
if you're trying to fine tune a model or you're trying to build

(10:12):
a coding agent, you need to train it with the data that is
really high quality. As soon as you feed in a large
language model, when you're building a foundational model
with bad code, it is going to mess up the output, It's going
to fail on the benchmarks. So this process is still very
manual where they're going in like trying to find the most

(10:33):
sanitized, the most clean code, which would like high
programming standards that is used to like train these
foundation models. That's like step one like which
is pre training, right? The same goes on when you're
doing a post training for these models.
Whenever organizations are trying to fine tune them or
distill these models and trying to like customize it for the use
cases that they have, it comes back to how are they gathering

(10:57):
that data? Does it reflect the right kind
of data that that is the use case for, for which they're
building for? The same thing again, propagates
to when you're trying to test itfor evals.
If you're testing it with the wrong data or the bad data or
the data which is not diverse enough, you're not going to get

(11:18):
real results of how your actual model is performing.
So again, it all comes back to like what kind of the quality of
data that you're using and goingback again to like something
like reinforcement fine tuning or reinforcement.
Human feedback. Not every human feedback is
going to be the right feedback to your model.

(11:39):
How do you distinguish between what's a good feedback that
you've received from the user and what's a bad feedback?
Because if you end up training it with every single feedback,
then you're going to like open it up for vulnerabilities.
So all of these, if we think about like the core of or the
foundation of where all of theseproblems stem from or what's the

(12:02):
right key to solving all of these problems.
It comes down to how do you get the right quality data?
How do you get the most correctly annotated data that
you can use to train your models?
And as teams work through this challenge in different domains,
I think we've seen an obvious spike domain here where software

(12:23):
engineering, there's so much good quality data out there.
It's also a lot of bad code. Please do not ever train a model
off my GitHub. Not a good idea.
But there's plenty of great codeout there that is very well
labeled, very well commented, excellently done that that can
be used to train these models. And it's interesting because we
also have developers and PMS whoare working on these AI models

(12:48):
who can act as subject matter experts for that human feedback
you mentioned. And so I think we're seeing that
use case move ahead particularlyfast because not only do you
have SME experts who are alreadyon hand, you don't have to go
find a bunch of lawyers, for example, to double check the
data. They're just already working at
the company. But you also have a vast supply

(13:09):
of potentially very strong open source code and many other
projects. And this is where companies can
also leverage their own code bases to assign things like,
hey, this is going to be the right standards who want this is
the approach we want. But this gets more complicated
as we expand the use cases we'reconsidering, whether that's
something along the lines of, you know, broader writing and

(13:32):
with the chat bots versus code specific, you know, it's quick
for us to see or it quite quickly.
We see that concerns around fairness, transparency, safety
start to come in. And with agents making
autonomous decisions, whether that's in customer support or
many other use cases, this is a a key area of concern for many

(13:57):
enterprises in particular. You know, maybe you're not
worried about it in a demo, but as you actually start putting
this out to, you know, thousands, hundreds of
thousands, millions of people, this becomes the problem.
I know we've thought about the smooth standpoint of, OK, you
know, working with a lot of our customers around providing them
in production guardrails they can apply to block things like
personal identifying information, safety concerns,

(14:20):
and also, of course, customized to their needs.
But beyond simply guard railing,as useful as that is, what are
the other considerations you think teams should be taking
into their processes as they tryto ensure that their agents are
actually delivering on the promise that they've set out to
solve? So the short answer is trying to

(14:42):
quantify anything and everythingin that system that you're
building, because that's what comes handy when you're trying
to scale things. And that's what comes handy when
you have to go back and look at where things broke and why
things broke. Parts of what we talk about
tracing or logs or understandinghow the agent behavior changes

(15:08):
with different user prompts. It all comes back to, hey, how
do you define what is good or bad?
So it's all about like those metric definitions and all of
those like quantifiable elementsthat really helps you build that
reliability in any particular system.
So one of the examples is when we are looking at something as

(15:33):
simple as a copilot, right? Like a coding copilot.
And we're talking about, hey, what level of autonomy do we
want to give this particular model?
Do we let it write the code for me?
Would should I let it also review the code for me?
Should I also let it write unit tests for me?
Should I also let it push the code and like merge the code for

(15:56):
me? For any of these steps, there
are different levels of quantification that you want to
do right? Like at every given step?
Like is it doing right? Is it not doing right at any
given point in time? Do I need a human reviewer?
Or do we need a different model maybe to like judge, judge the
response of that like having an actogritic system right there.

(16:18):
So all of those are very important to understand where
are those guard rails? Where, where do they need to
sit? And how much autonomy do you
want to build it in that system?And as I said, there's no right
answer to like how much autonomyit is completely dependent on
the use cases. You, you were talking about a
use case where you have PII. It would be completely different

(16:41):
if in my case there is no PII, right?
The level of autonomy versus thelevel of human intervention that
you want to have in any particular system is very
subjective to what's the level of risk that you have, What is
the level of scrutiny or like compliance that you're building
that under. So for example, in like going
back to the coding example, right, Like if there is a point

(17:04):
where I have automated the entire pipeline right from where
it understands what I want it todo, it codes it out, it writes
unit tests for me, it does a code review, it pushes and
merges the code into the into the main branch.
If something breaks, how exactlydo I go back and see where it
broke and why it broke? And can I really reverse each

(17:28):
and every step that happened along the way?
And that is really important when you're trying to think
about fall back options. If something breaks, how do you
trace it back? And how do you make it right?
Because if if you've not logged all of those things, if you've
not had like checkpoints along the way for how a model is
performing, how is it responding?

(17:49):
And how is it merging any code into your code base, Then it's
going to be extremely hard for you to let go, figure out what
broke, why it broke, and how do you avoid from it to happen
again. So all through this way, what's
really important is that how do you quantify each and every step
that your model is taking? How do you quantify each and

(18:11):
every step of your system? And how do you have evaluations
running at every given checkpoint?
So that really helps you to build something which is
scalable. And at the same time, you're
not, you're not just falling back on subject matter experts

(18:31):
to review it and like see if it's right or not.
And it's something that can be logged.
And it can be something that, that you can look at at a
dashboard and see like what's what's going wrong.
So instead of like you having tolike manually go through all the
logs and like figure out like what went wrong, you have all of
that available in a very quantified manner.
And you're able to pick and choose and you can like really

(18:53):
pinpoint and say that why thingsbroke and how do you not let
that happen again. I think you're spot on that we
have to treat observability of Rai systems as fundamental
because we need to be able to unpack and root cause analysis.
What is actually going on? What's broken?
Do we have an agentic tool errorproblem or is it a context

(19:14):
window issue? Is it simply an efficiency piece
and we we need to improve the actual ability of our LM to make
calls more rapidly? Or maybe we just simply haven't
paid our API bill. We're getting stuck.
It really depends, and I think This is why we see as we get
past testing, so often, evaluations for agents become

(19:39):
obsolete if they haven't been customized to an organization's
needs. And a lot of engineers are
starting off of either open datasets for their evals and scores
or for out based off of out-of-the-box metrics like ones
we provide on agents like for example, like action completion.
And those are all great and useful, but it's once you

(20:01):
actually fine tune those for your use case and customize them
for your use case that I think they really start to shine.
Are you seeing? AI builders actually take on
this customization challenge? Or do you feel like we're still
at a point where there's a lot of confusion for builders about
what it means to effectively evaluate AI systems?

(20:21):
I feel it depends on who are we talking about.
I wouldn't really generalize saying that, hey, like everybody
is confused or everybody is likedoing it the right way.
I think it's different when it comes to deeply technical
professionals. I think people who have worked
with machine learning operational systems have a good

(20:44):
understanding on how they shouldbe defined, even though there is
slight difference between how machine learning traditional
machine learning models work versus how large language models
work. There's there's a lot to do with
context, window, memory, etc, which sort of is quite different
and the model nature is also quite different.
But I think people who have had deep expertise running machine

(21:06):
learning models at scale have been able to adapt to that
change well compared to that folks who are sort of coming
from a non-technical machine learning background now that the
barrier to entry for building anything with elements have
become so easy. I think that's the area of

(21:29):
vulnerability where I'm seeing where a lot of like early
founders who don't have a deep technical background, who have
not really built systems at scale, who don't have a good
understanding of ML system design when they are coming in.
And they're seeing that, hey, like I have a bunch of these
white coding tools out there where I can like, you know, spin
up a system in minutes. And they don't really go through

(21:53):
like that thorough analysis of how original systems used to be,
bit like, you know, having a solid like PRDS for before you
build a product, having solid understanding of what does the
user journey look like, What does that each and every step
look like in the process, Talking about like the flow of

(22:15):
data like that also like part ofcontext engineering, right?
Like how do we evaluate each andevery step of how the data flows
inside a large language model and through the system?
People who have not done that orpeople who are coming from non
machine learning backgrounds, I feel primarily are confused, are

(22:35):
figuring out because they have asolid business need of why they
need to build this product, but they lack that engineering
expertise. It sounds like you think not
only from a context engineering perspective, but also perhaps
even evaluation driven development perspective, which
is terminology that really has only come into vogue over the

(22:56):
last year. And I think maybe Harkins back
to this idea of test driven development from traditional
software engineering. What's your perspective on the
decision making process for folks who are perhaps newer to
AI and maybe don't have that deep assure machine learning
background? How should they be thinking
through when to add evaluation observability to their systems?

(23:18):
Is this something that you know has to be AP0?
You start doing it from the verystart.
Is it? Does it depend on the use case?
More so? What's your perspective?
It depends on how soon they wantto build something introduction.
If it's sooner, then definitely like that should be like your
day one part of how how are you planning to like build evals

(23:39):
into your system. But if it's something where
you're like shot, I'm building aprototype, I'm going to go pitch
that to IC and see how things goes.
That's something that you can think about later.
So it depends on like how how soon do you think your system
can go from your laptop to a cloud hosted environment and

(23:59):
available for people to use. And the risk factor you talked
about earlier too, for sure. Where like hey, do you have Pai
that maybe surfaced from that application?
Yeah, you might need some, some guardrails in place.
You may need to, you know, have live evals.
But of course, this monitoring is dependent upon what use case
folks have. And the, I think this also comes

(24:22):
back to your points about responsible AI and considering
as you put it from from first principles, OK, what am I doing
with this? Let me make sure I actually have
this set out list. I mean, not just vibe cut my way
into potentially a mess, but actually think 3 architecture
and the goals that I'm setting for, for whatever agent we're
building. You also mentioned a specific

(24:43):
evaluation technique that is no longer new, but I think it's
very central to how evolves are being connected today.
And it's no longer focused on humans as evaluators, or at
least not first line evaluators.Most folks are using LM as judge
at least to some extent within their systems.
And you mentioned very kindly our our prep call for this, that

(25:04):
you enjoyed the insights in our our e-book on the topic,
Mastering LM as Judge. And I'd love to get your
perspective. Is this an approach that you see
every team should be applying atscale or are there inherent
risks to leveraging LLMS and we need to be considering the

(25:24):
biases and and guard railing them in some way?
There definitely is risks that come to using LLM as a judge
because you're saying that, you know, on one hand you're saying
that hey, I'm using your LLM andit can sometimes not produce the
right kind of output. So you are inherently defining

(25:45):
the nature of how the model looks like.
And then you are saying that, hey, with the same exact
characteristics, we're putting another model to, you know, go
and judge how this previous model was doing.
So it's in in a way like if you're trying to say that, hey,
this model is blind, then you'relike taking another blind model

(26:05):
to go and check how the how the previous model is doing.
So that's not surely going to help in some use cases, but they
are great in some other use cases.
So one of the things is LLM as ajudge, it's great if you are
trying to measure something which is not very quantifiable.

(26:30):
So the reason I say that is if you have a LLM output, a
particular result, and when you're building that system, you
have a good understanding of what that range of output could
look like or what that range of answers could look like and what
a right or what a wrong answer could look like.

(26:52):
If there is like a statefulness in that system, then going with
traditional evals is the best thing to do.
But in a lot of cases, that's not how things are.
A lot of cases. One, one of the examples that I
can talk about right, like something as simple as hey, be

(27:14):
my writing assistant. It is such a subjective use
case. And the reason it is is because
the way I write is very different from some how somebody
else is right, or somehow a third person writes or how
somebody who's a native English speaker writes or how somebody
who's proficient in Mandarin writes.
It's very, very different. So when when the use case is

(27:37):
that undefined or difficult to define where the boundaries are
not not very specific, having anLLM as a judge can actually help
because you're using a differentmodel which also has a wide
variety of knowledge base in it to go and judge how the previous

(27:58):
model did. Now obviously, like even for LLM
as a judge, you can fine tune itto a certain extent.
Because if you are using this LLM plus LLM as a judge combo
for say a storyboard writing usecase or a poem writing use case
or a Co writing use case, for example, there are still some

(28:21):
boundaries that you want the models to follow.
So in those scenarios, you can fine tune the LLM as a judge
model to also like perform a specific way.
But inherently, like when the output range can be differing
for every single user, that's when it's helpful to like use
LLM as a judge. It does come with its own
challenges the same way and LLM comes with the challenges, but

(28:45):
for any any of those scenarios, it's rightfully one of the best
ways to deal with it rather thanlike not having anything at all.
Yeah, it's typically better to do something even if it's not
the perfect thing when it comes to Eval's insurability, having
at least something in place is going to start to give you
telemetry that will prove valuable as you fine tune and

(29:07):
adjust your approach down the line.
But let's shift our focus a little here.
I know something that is extremely important to you and
that you're passionate about is the open source community, both
as an area that you've, you know, spent time professionally,
spend time in your free time. And of course, as we discussed
earlier, it's an incredibly important data set for LMS to

(29:31):
train off of as well. How do you think the
proliferation of both data and now powerful open source models
has fundamentally changed the game for teams looking to build
AI agents? So funny thing is one of the
questions that I get all the time is I'm very vocal about

(29:54):
open source community. I advocate for it a lot.
And one of the biggest questionsthat I get is like, if open
source is so good, why do we even need proprietary models?
And it's, it's a very, very valid question.
And it's something that I, I keep asking myself every time
there is a new open source modelcome out.

(30:15):
And to some extent I would say that after the DeepSeek V3 model
came, we did see that inflectionpoint where the performance and
price ratio of what you can really build at a very cost
efficient and a very customized manner, something that you have
full control on with an open source model compared to a

(30:39):
proprietary model. It just went up like with deep
C3, people were like, oh like wenever thought an open source
model could be as performant as this one is.
And then it was followed up withR1 model.
There were different variations of R1 model.
Recently we have had V 3.1 and subsequently we have seen like

(31:00):
various other labs also produce their own open source models and
they are one of the top used models.
When I see from an agentic AI behaviour, like when I see the
use cases that companies are building on both enterprises as
well as start-ups, they are heavily invested in using open
source models for certain use cases.

(31:21):
Again, I'm not saying that the industry has completely shifted
from using proprietary model. There's obviously like certain
reasons around developer experience around how easy it is
to like get up and running on proprietary models.
A lot of times it is also aroundmodel quality.
For example, if you're looking at vision capabilities of the
model, proprietary models does give you actually really good

(31:44):
performance. So it's a mix of reasons why
industries is going and looking at open source models from a
customization perspective, from a cost of use perspective, from
running it on Prem perspective. There are a bunch of reasons
there. And what we are seeing is a lot
of the companies which had started pivoting into pure

(32:08):
proprietary models, including open AI, including Meta,
including Google as well. Now they have started like
thinking that, hey, like, no, weneed to invest in open source
models as well. That is one of the one of the
edges that they are getting. And rightfully that's that's why
you're seeing all of the new models coming out like latest

(32:30):
was with open open AI's GPT OSS model.
And that's that's where we're seeing like equal level of
investment going on in both to some extent.
But the top models if you see are still coming out of the Kimi
models, the Quinn models, the DeepSeek models, etcetera.
And they are like really, reallycapable.

(32:51):
And we are seeing that shift where people realize that for my
use cases, for certain of the use cases where out-of-the-box
models are not the right solution for me.
Where I need more control over how my models are being trained.
Where I need more control over my over how my data that I'm
pushing into the system is beingused.

(33:13):
Where I care about zero data retention policy places where I
care about customizing the modelto the size and to the speed
that I care about. In all of those use cases,
people are switching to open source models because they just
have that autonomy to them in order to customize the model the

(33:33):
way they bound. From your vantage point, where
do you see the biggest gaps in the open source AI stack such as
Vision, which you mentioned earlier is an area where
proprietary models are doing extremely well that if these
gaps were filled would unlock a new wave of open source scaled

(33:54):
applications that are are powered off of open source LMS.
I think vision is definitely youlike one of the areas where it
is harder for the open source models, at least the ones that
are out there compared to the proprietary models which are
available in the market. Apart from that, yes, there are
some level of performance gap that comes with how open source

(34:20):
model look like on the benchmarks that we have publicly
available. That being said, again, like
most of the use cases when I'm seeing enterprises or start-ups
use these open source models, they are not using it
out-of-the-box. They are either like using
supervised fine tuning methods to fine tune it or reinforcement

(34:40):
learning fine tuning or even like if they don't have any of
the data set or they are not likely pulling in the data to
fine tune it, they're using synthetic data to do that.
So at the end of the day, I feellike people have a clear choice
of when they are building a composable system where they
require very, very specific outputs from specific models.

(35:07):
They are choosing a combination of models.
So most of the use cases that I'm seeing, they're never like
running it on one particular model.
Like one of the use cases that Ican talk about like is with
Notion, we had the head of AI engineering from Notion join us
for our dev day and she was talking about the use case so
that they're running on Fireworks and parts of the

(35:27):
Notion AI where latency was really important to them and
quality of the responses was very, very important to them.
They were not able to get that from the proprietary models.
So in those use cases, they did fine tune the models on
Fireworks AI platform and they were running it with very, very
low latency on there too. Now are they using just one

(35:49):
model? No, they're using a combination
of multiple different models. And that's the concept of like
compound AI systems that we've also been speaking about, which
is integrating multi modality ofmodels together, integrating
smaller and larger models together, integrating models
with specialized performance metrics together and like having
them build that end to end system.

(36:12):
So I don't see that it's gonna be a future where it's not the
case that hey, all the proprietary model shops are
going to shut out because there's open source models.
I think it's a pros and cons of how these systems are being
designed and how the model architecture looks like.
What are the specific nuances that model is good at?

(36:33):
I mean, I can share a personal example, right?
Like I have my preferences on using when I use ChatGPT versus
when I use Clod versus when I use Perplexity versus when I use
Gemini. I use all the four of them, but
for slightly different use casesbecause somewhere I like the way
ChatGPT writes certain things for me, whereas I like certain

(36:56):
things about Gemini, the way it writes for me.
So at the end of the day, when the model gets so large and when
we are trying to like just measure it on top of a
benchmark, it's not the, it's not like an accurate way to
judge the performance of a modelbecause at the end of the day,
it is not an objective task thatwe are measuring it against.

(37:18):
It is a subjective task. And there's no one right answer.
There's no like 4 right answers.There could be multiple
different right answers. There could be multiple
different ways of framing the same thing, also subjective to
the user that who's interfacing with these models.
So that's where like all of these nuances come in play.
And yeah, I mean, like that's why we're seeing that some

(37:41):
companies are focusing on building 600 billion parameter
model versus we're also seeing 20 billion, four billion
variations of the same models. I love that you're bringing up
both model variation and personal model variation, where
I would term that as using different models for different
use cases. Something I will say I
personally do as well. Like for example if I'm needing

(38:05):
help writing or problem solving,I actually prefer Claude over
GPT off and I find it's a littlemore effective at that.
I'm curious, from your perspective, do you mind sharing
an example or two of when you might switch models if you had a
particular set of tasks? From a so also like probably

(38:27):
giving you another example, right?
Like I already spoke about usingdifferent tools, which is CHA
GPT versus cloud versus perplexity versus Gemini.
Depending on like if I'm needingit's help to write a technical
blog or something, which is a more personable text or if I'm
needing you to write help me write code or am I asking you to

(38:50):
help me write blog around the code.
So all of these are like very nuanced specific use cases, but
I want to give you a use case where it's essentially the same
task but different results is what I find when I'm using a
cursor agent versus when I'm using Cloud code.
So what I've realized is when I have a very, very defined scope

(39:14):
of what I'm trying to build, when I know that in one or two
short of the prompting I am going to get the result that I'm
looking for. That's how defined I have.
That's how defined I've written out the problem.
Cursor Agent does a good job forme.
It's able to understand what I'mtrying to build and it is able

(39:38):
to do a very good job at it. But at times what I've seen is,
let's say my problem statement is not very well defined.
And let's say I give a mildly ambiguous task to cursor agent
and I have to do a bunch of follow-ups on it.
It starts breaking, it tries to fix one thing, ends up breaking

(39:58):
10 other things, and then I haveto do another prompt to fix that
one thing that it broke. And then it ends up breaking
like a few other things. So it takes a lot of like back
and forth to get a right result.And by the time I do that, it
has hallucinated a bunch of import statements at the top.
So what I've realized is in cases where I probably have a

(40:19):
very, very defined scope of, this is exactly what I want to
build. These are the buttons that I
want in this particular app. This is the tool that I want you
to use the front end for. If I have something that's that
specifically defined, then cursor code or a cursor agent
works out better for me. But when it's more
collaborative, cloud code seems to be working better for me.

(40:42):
So again, like going back to thepoint, right, like it's so
specific about how your users are going to interact with the
model. Somebody who is a engineering
professional would interact withyour model differently compared
to how a PM would interact with your model versus how a
completely non-technical professional would interact with

(41:03):
your model. So that's where I keep coming
back to for your use case, how can you define that user
journey? How can you quantify the way the
users are going to interact withyour, with your model?
Do you can you build a simulation like out of it?
Like that's, that's one questionthat I typically ask.

(41:23):
Like hypothetically, if you're trying to build a particular
product and I ask you that, hey,you don't have time to go gather
a user feedback for this, I needyou to get this out in the next
couple of hours or next couple of days.
Let's say, for example, being more realistic, can you build a
simulation environment where youcan exactly point to how your

(41:45):
users are going to be using or interacting with this particular
model? So if that's something that's
something you can define, that'sgreat because you can like build
test cases around it and you canlike think about all the
possible edge cases that could come in that scenario.
But a lot of times people cannotdo that because you don't know
how your tool is going to be endup end up being used.

(42:06):
Who's going to use it, how they're going to use it?
Are they going to try to deliberately break it or not?
So all of those like, you know, unknowns come in play and that's
where it becomes harder for for for the designers to like
architect a system like that. I love that answer and I think
it's very important to double click on in a bit because there

(42:32):
is so much customization and personalization that needs to
happen depending on your use case, depending on the size of
companies you're working with, depending on the type of users
you're working with, that just saying even something as
specific as, oh, I work with AI engineers may not actually get
you where you need to go. You need to understand what

(42:54):
they're trying to achieve. Perhaps you'd understand if this
is new AI engineers or folks who've been doing machine
learning for years, for example,as you brought up earlier,
there's a big difference in how those folks who are are newer to
the field versus have been around for a while are
approaching things as you know, fundamentals evaluations.
And another crucial area you brought up earlier is the idea

(43:17):
of compliance. And you mentioned, you know,
zero data retention policies, for example, with the Excel
ability that models have to our data and your systems including,
you know, making their own decisions about what to do with
that data. Sometimes if they're in an
agentic environment, there's definite risk.
That's why you're seeing risk management teams at large

(43:39):
enterprises have very many sleepless nights.
What do you see as the unique governance and security
challenges that teams need to consider when they make the
choice between building skilled agents on open source models
like we've been talking about here, versus proprietary ones?
Are there? Are there pros and cons that you

(44:00):
would encourage them to consider?
With proprietary model, I think one of the biggest challenges
that come is, as far as I understand, again, there could
be like difference in policies across different model
providers, But in most of the cases it isn't a zero data
retention policy. They do like capture the data

(44:21):
that goes in and out of the model in order to make their
models better. So that's, that's one part of
it. Second, it can become harder for
you to really peel the onion andsee where things are breaking,
how things are being processed by these models compared to like

(44:46):
what you can do when things are open source.
You can really tweak the model to the T when you're trying to
use an open source model and you're trying to customize that
model for your specific use casenow.
So one of the things that Connor, you were mentioning
around is around compliance and like how it would play around in
a regulated environment, right? So one of the very basic

(45:09):
challenges that come in when you're trying to build a system
like this is when users are interacting with your model.
How do you control the access that the model gets every time
it's interacting with a different user?
So that's the traditional role based access management that you
want to build into your model. And how much information does

(45:31):
that model particularly needs tosave in its short term memory or
like what's the exact level of anonymization that you want to
want want to happen in a particular data being held by a
model? So in all of these scenarios, in
the typical, again like this is the typical best practices that
I've seen across the board is where people try to bring in

(45:55):
multi agent systems. They say that, hey, we have a
primary agent which has access to five different things.
That's the primary agent that interacts with the user.
We're not going to give it access to certain data sets by
default because we know that it could be a point of failure.
So what we're going to do is have it send a call to our

(46:18):
secondary agent that has access to specific data set that you're
trying to pull. And based on the parameters that
is being shared from the user request plus whatever the
processing has happened at the primary agent stage to the
secondary agent, one of the secondary agents to see if it
matches the criteria at which itshould be pulling in that right

(46:40):
information and sharing it back to the primary agent.
Now, again, it could be something where it does share it
back with the primary agent, butthe primary agent does not store
that information either in its short term or long term memory.
It is just something which is it's communicating and then just
like forgetting about. So all of these are nuanced
engineering setups that folks need to make.

(47:03):
And that's exactly where I'm, where I go back to that, hey,
when you are talking about building an AI agent as an
architect or as a product manager or as a leader who's
thinking about building a system, I think you need to go
way beyond how a model performs.You cannot just look at a model

(47:25):
benchmark and feel that, hey, like this is a good enough model
for me to use in my workflow andthen just like tie it together
with each other. That's the worst thing that you
can do because none of these models understand like I can, I
can break any model with a bunchof quirky twisted prompts.
So that's where it's important to understand that, hey, like
how do you architect it as a software?

(47:47):
It is not a model that you're surfacing to your user.
It is a software. It is a system that you're
surfacing to your user, which requires it to have that nuanced
approach at every single point of communication between the
models, between the model and any tools that it is trying to
access, between the model and the data set that it is trying
to access, etcetera. You have such a unique position

(48:10):
from how central your content creation is and how central you
are to much of the AI movement today.
I know you talked to many leaders and builders on a daily
basis who share their insights with you and are seeking to
share with you. Here are the the new things
we're looking at. And then of course, you're doing
so much important work with fireworks as well, based off of

(48:34):
all that context that you're you're gathering for you, I
guess your personal mental model.
What are you seeing that you think AI builders are either not
paying enough attention to rightnow or on the horizon?
How do you take any particular model that's running in
production in in development, any system that's running in

(48:55):
development to take to production level?
And when that happens, it's verydifferent when you're working
with traditional machine learning models.
And this is the same exact challenge that we had when we
used to work with something as simple as a logistic regression
model or deterministic systems to traditional machine learning

(49:16):
models. And the same level of shift is
what we are seeing from traditional machine learning
models to generative AI kind of models.
And it's the level of complexityof how the model performs.
So there is while the cloud computing fundamentals, while
the system design fundamentals stay the same, it is still not

(49:37):
the same because of how the model performs.
It is because of how input goes into the model and how the
output changes every single timethat you're just running
inference. So one of those beautiful blogs
that I saw recently was from Thinking Machines Lab and they
talk about this, which is non deterministic nature of large

(50:01):
language models. For the large part, people's
misconception was that LLM is non deterministic or it shares
answers differently when you're asking different kinds of
questions. Short, it does.
If I just say like summarize my own query into something which

(50:22):
is 2 lines shorter, it's going to respond in a different
manner. But even with the same queries,
at times you see the LLM inference engine responding in
different manner. And that was coming based on how
the plumping looked like around the large language models rather
than just the nature of the model itself.

(50:43):
So these are some of the things that are still being uncovered
about the performance of the model, about like how the
behaviour changes in different scenarios.
And it is still something which people are discovering and
sharing and it's still somethingthat not a lot of practitioners
understand. So I think it's always a

(51:04):
challenge to challenge to architect these systems based on
how we are changing the brains of these systems from something
which is more deterministic to traditional ML to large language
models. So like now that we have a
different kind of brain, it has a different kind of thinking
capability. So the architecture also needs
to change. Architecture also needs to

(51:26):
evolve. Part of it would be objective
evaluation, part of it would be subjective evaluation.
And the way that we define all of these things is one of the
one of the biggest areas that still people are continuing to
continuing to work on. Ash, thank you so much for
joining me today and sharing your thoughts on the importance

(51:48):
of evals, hope and source, and so much more.
It's been such a distinct pleasure having on the show and
I, I really appreciate you taking the time.
Where can our listeners go to follow you and the important
work you're doing? Well, Linden is where I'm most
active on. While I did start recently on
Instagram as well because I realized there's only so much
volume of content that I can post on LinkedIn.

(52:10):
And while I have been catering it to more technical audience
more maybe like mid level to senior level professionals on my
LinkedIn to help more students and enthusiasts, beginner level
folks. I've started sharing like tidbit
and like bite sized videos on Instagram just to get people
started in the field. That's awesome.

(52:30):
I'm going to have to go follow your Instagram.
I'm excited to check out some ofthose videos.
And we'll be sure to link Ash's Instagram and LinkedIn in the
show notes. Ash, it's been such a pleasure
having you on and a reminder to everyone who's listening.
While you're going to follow Ashon LinkedIn or Instagram, your
platform of choice, please don'tforget to subscribe to our new

(52:52):
Chain of Thought newsletter on LinkedIn for more insights on
building with AI. And hey, you know what if if you
haven't already, if you're watching us on YouTube, you
know, like and subscribe there or Apple podcast, Spotify,
however it is. We always appreciate that as
well. So thank you so much for joining
us, Ash, and thank you for all the insights.
Thank you. Thanks, Connor.
Listeners, that's all for us this week.

(53:12):
Have a good one.
Advertise With Us

Popular Podcasts

On Purpose with Jay Shetty

On Purpose with Jay Shetty

I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!

Stuff You Should Know

Stuff You Should Know

If you've ever wanted to know about champagne, satanism, the Stonewall Uprising, chaos theory, LSD, El Nino, true crime and Rosa Parks, then look no further. Josh and Chuck have you covered.

Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.