Why AI Agents Fail In Production And How To Fix It (ft Josh Purtell)

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
The AI agent gold rush is on. Everyone wants one.
Businesses are racing to use them.
Developers are coding furiously.So why aren't they everywhere
yet? Well, there's still a gap
between the expectation and reality of agents.
And unlocking that reality hinges on solving 3 fundamental
challenges, observability, reliability, and performance.
And today we're going to discussall three so that you can deploy
your agent to production with confidence.

(00:21):
Welcome to episode 39 of Tool Use, the weekly conversation
about AI tools and strategies toempower forward thinking minds
brought to you by Anetic. I'm Mike Byrd and this week
we're joined by Josh Protel, thefounder and CEO of SYNTH, a
company that automates the development of state-of-the-art
AI agents. Josh, welcome tool use.
It's great to be on. Absolutely happy to have you.
Would you mind giving us a little bit of your background,
how you got into AI? Yeah, for sure.
So I'd say my first exposure to AI was at a cyber security

(00:45):
startup called How War. And I was working there because
I ended up getting Aqua hired ona small startup that I began
with some friends. And there is in 2022, we were
trying out GPT 3 for vulnerability fuzzing.
So what vulnerability fuzzing means is you throw a bunch of
potential inputs at a piece of software and you see if you can

(01:08):
break it. And before AI, people would use
kind of just deterministic methods to generate that data.
You can imagine that if you havean AI, it can kind of get to
more interesting inputs that aremore likely to break the AI
sooner. And I think at that point, we
weren't quite ready to, you know, hack into everybody's
website with AI at GBD 3 was wasdefinitely a beginning point,

(01:30):
but it was clear to me that there was a lot of potential
working with these systems. And so from there ended up
working at a a startup that, youknow, does a really good job
automating accounting with AI. That's called basis.
And in parallel, did a bunch of research for some folks at
Stanford, culminating in a paperon optimizing agent props.

(01:51):
So at that point I had some academic experience kind of
figuring out what works and whatdoesn't.
Got the hang of that. And then also a lot of hands on
experience at startups and really thought the timing was
right to just spend the next 10 years of my life working on this
problem for everybody. It's a lot of fun.

(02:11):
Absolutely. And it's, it's funny you brought
up GPT 3 because I remember there was, I think a press
release from open AI saying thisis too dangerous for the public.
We cannot release it. And looking back, the outputs
laughable. But a lot of people are seeing
the parallels with agents today.Or it's like, oh, you can
sometimes book a plane ticket ormaybe order your food, but we're
just in the very early stages. So what?
What convinced you that agents was what you wanted to put your

(02:31):
time energy into? Yeah.
So I think with agents, there isthis kind of irreducible
complexity that will only kind of grow.
The irreducible complexity is like an agent is doing a lot of
things. It's interacting with, with a
lot of, you know, tools and an environment out there.

(02:52):
And maybe some of those interactions can cause problems
or, or they can actually add a lot of value.
There's just a lot going on and,and something that I worked on
in research is like, OK, you have an AI system that's doing a
lot of things. How do you help kind of
coordinate all those subcomponents to play along
nicely and to kind of get the outcome that you want at the end

(03:13):
of the kind of journey? And the cool thing about agents
is as AI gets better, that complexity will only grow,
right? So if you think about how long
run there's certain problems youcan work on in AI that will go
away eventually, right? That's kind of the people call
it a bitter lesson. It's like you focus really hard
on kind of making something workwith the limitations you have

(03:36):
today. And then the better model comes
out and all your work is is kindof useless, right?
That's not really exciting for me.
The thing with agents is like, as AI gets better, they'll be
doing more ambitious projects. They'll be working for longer,
you know, maybe for like days ata time.
They'll be working in kind of more mission critical settings
where they can really screw things up.
And so being able to kind of tune them and rein them in, I

(03:59):
think only becomes more interesting, more important and
a harder task over time. And that's reassuring, right?
It means I'll have something to work on for the next 10 years
and I'm not super worried about all my progress just getting
obliterated by GBT 5. Yeah, totally.
The the ability to kind of skatewhere the puck is going and

(04:20):
playing ahead is super importantand it's really impactful work.
It's not like you're just kind of freaking out a micro
optimization of some existing process.
It's a whole new paradigm shift.Could you shed some light just
based on your experience, the type of agent use cases you see
that actually work right now or maybe some that are coming up
soon that are maybe one or two breakthroughs away from being
like economically viable? Yeah.
So in terms of what works now, Iwould kind of partition it into

(04:42):
maybe low value and high value. And I don't mean that kind of
with prejudice. And maybe you'll see what I mean
with high value. You have the agent working on
some kind of, Claude calls it artifacts, but it's just this
thing that it's working on. At every step, it kind of makes
a small change or it tests it out.

(05:03):
And then at the end of a bunch of steps, it comes back to you
and says, here's what I produce.And so really there it's only
tasked with creating this one thing.
It's a difficult task to do that.
It has all these resources to help it get there.
There's a lot of vertical agent startups maybe in healthcare or
finance or accounting or law that kind of have this set up.

(05:26):
And I would call it high value. The flip side of that is that
it's high cost, so it takes a lot of human effort to kind of
craft an environment where the agent can succeed at creating
this kind of valuable artifact. Overall, those steps, because
there's so many steps, it's really easy for the agent to get
derailed or for it to just, you know, create a reasonable plan

(05:47):
for creating an artifact, but not the plan that really gets it
there. So it's kind of hard for
developers to to kind of make this set up and it costs a lot
of money just to run the setup once, right?
It costs one to $10 per run if you're running sonnet millions
tokens for like $10 per million.And that price point hasn't
changed a lot in the last year. The, the kind of the light at

(06:08):
the end of the tunnel is that for some of these verticals,
producing this artifact can be really worthwhile, especially if
the professionals that would, you know, kind of factually do
it are charging, you know, $100.00 an hour or even more.
And so this is kind of one way, like one type of of agent form
factor that I think is really worked out.
And then I would say there's this kind of low value, but then

(06:29):
the flip side, low cost and kindof easier to ship form factor of
the agent is watching some, you know, set of components or it
has to be responsible for something that the customer
cares about. And then something happens and
the agent has to figure out whatnow, right?

(06:50):
So the kind of perfect example of this is maybe an agent that
is doing customer support. The agent wants to just maintain
a status quo of either every customer is happy or if they're
not happy, there's a ticket thatreflects what should be done
that make them happy in the future, right?
That's the status quo they want to protect.
And then a customer comes in, they're like, hey, I'm upset,

(07:12):
right? And it's the agent's
responsibility to figure out like, is there a ticket that
addresses this? If so, I'll just inform them.
Is there not a ticket? OK, then I need to have a
conversation with them to figureout what's the problem.
Maybe they're mistaken. Maybe no ticket needs to be
written because the code actually covers this.
They should just read the docs, right?
In that case, they just kind of the status quo to protect and

(07:32):
they have to figure out what needs to be done to do that.
The end result of what they do is like not super complicated.
They'll write an e-mail, they'lllike write a ticket.
You know, you don't need the agent to do that thing.
It's usually in one step. What you need the agent for is
to review all this context and kind of triage things and, and,
and you know, think about priorities and that's often a

(07:55):
lot easier. That's more of this kind of
horizontal form factor. It probably costs you like $0.50
to a dollar per pop and it's a lot easier to get to market.
So you see a lot of smaller start-ups that aren't kind of
vertically focused using this form factor and it tends to work
out. Yeah, we've had a few people on
the show where they talk about viewing the LM is just like a

(08:15):
fuzzy function. You don't quite know what's
going in, but then based on thatyou can kind of piece it into
your pipeline so it can get a more deterministic output.
On that note, have you noticed people having certain issues or
pitfalls when building agents based on either like the high
value or low value form factor? Yeah.
I mean, there are a lot of problems that people can run
into. I would say the first few

(08:36):
problems and there are, you know, the first step in, in
making a really great agent system, but they're actually not
easy to get right. And they're important to spend
some time on is, I would say making sure that you're
communicating to your AI what you want.
I think people have this idea that, you know, when they talk
to ChatGPT, it tends to infer their preferences as an

(08:57):
individual like pretty well. It kind of gets them, you know,
maybe in some sense. And so like in an enterprise
setting, if I just kind of give it the broad gist of what I
want, it'll, it'll read my mind and it'll kind of guess how to
make all the right trade-offs and, and kind of all the details
and how to dot the IS and cross the T's.
But what I tend to see is that that's not really the case, or
it will be the case half the time.

(09:17):
And then the other half, the agent will make a very
reasonable guess that just doesn't happen to match with
what you wish for in, in kind ofthe, the inner sanctum of your
heart, right? You need to communicate these
things and tell the agent what you want, and that can be harder
said than done. Sorry, easier said than done
because you probably have a lot of requirements and constraints

(09:38):
in the software that's surrounding this agent, right?
Code needs a lot of constraints,a lot of regularity.
If you want the agent to provideit, you have to communicate what
you want and what the software system needs.
We've had some guests on that just try to emphasize the, the
LM is a stateless machine like input output.
And we'll have some people who even rather than like engage in
a chat back and forth and add, reply and try to clarify
further. All they do is edit the original

(10:00):
message because that's going to be the most likely way to really
tweak it to get the proper response from LLM and then get
that first output be the correctone.
And then everyone just mentions,you know, look at your data like
the best eval is just ask it something, see what comes out
and then try to enter it on there.
Have you found going through that process of kind of treating
it with like a one in one out mentality helps at all?

(10:22):
Yeah, so this might get a littleabstract, but generally I like
to think of, you know, multi step as either sharing work that
you know, needs to be done in inkind of multiple steps or a good
match for when you have information to gain.
And so if the agent needs to learn something about some

(10:45):
external system or even you, thehuman, and it's great to have
that conversation. If you're just trying to refine
an output that it could produce in one shot, and that's kind of
one single task, then it should be done in one shot.
Because one of the kind of iron laws of LMS is that they don't

(11:06):
scale well across context, right?
Attention is quadratic, so the longer that context gets with
unnecessary information, the less likely it is that it's
going to produce the outcome youwant.
Yeah, and that's actually been not, not to derail the
conversation too much, but that's been one of the qualms
with MCP where people like more servers, more tools thrown at it
and all of a sudden you're really bloating the context
window with these things that aren't relevant.

(11:26):
So trying to find a tool or a solution to kind of orchestrate
and just only give you what you need is kind of like a a layer
on top. But but I digress, back to sorry
the pitfalls. Yeah.
That's actually essentially the gist of what I would say the
second biggest pitfall is, whichis they are setting up their
agent for failure, right? So on kind of one end of the

(11:49):
spectrum, you're giving your agent no information about how
to succeed. And then on the other hand,
you're dumping a ton of irrelevant information into one
LM call, and that's creating problems for you.
So this changes with each model release.
But my kind of rule of thumb is that if you're doing something
really hard with an agent, you probably don't want to put more

(12:12):
than 10,000 tokens of context into the window.
And there's some evals out therelike MC here, RC, that kind of
capture roughly what that numberis.
And maybe you can check how, youknow, new models stack up there.
But that's kind of what I found.And what a lot of people do is,
is they'll kind of write an agent harness that provides the
agent a reasonable amount of context most of the time.

(12:35):
But then, you know, maybe one step in 20 or in half the cases,
it'll just dump a ton of context, right?
Because there's some edge case they didn't really think about.
And now it has like 2,000,000 tokens of things to sift
through. And that's going to create a
problem for the agent. It's not going to succeed.
It's going to derail. And kind of frustratingly,
there's really unlikely that a model is going to come out

(12:58):
that's going to succeed at 2,000,000 tokens in agentic
task. So that probably won't save you.
It's unlikely that you're gonna like prompt engineer your way
out of that. Really it's that iron law of
like these things scale quadratically.
You just need to fix your software.
The hard thing about that is if your software is injecting all
that context one in 10 times, itmight take more than one try

(13:20):
right at like just looking at the data, like you said, to
figure out where those things are are happening, where those
problems are coming up. And.
And so often you won't even know.
You'll just see that the agent failed.
And then you'll kind of ask why.Yeah, I.
I think we kind of got, not necessarily misled, but the
needle in the haystack test showing, oh, it can fill all
200,000 contacts and it still works really just LED us on the

(13:41):
wrong direction for expectationsaround how much we can actually
feed in there, especially when you deal with a pipeline like
you said. So do you have any strategies or
approaches for helping speed up the the debugging process of
trying to identify where those bad LM calls happen?
Yeah. Yeah.
So 11 kind of tool that we use internally that we give our
agents is just, you know, a verifier or kind of LLM as a

(14:06):
judge as a tool, right, That that tends to be really helpful.
It's just like a first pass lookat, you know, first few
trajectories that you have. And it's very easy to have an
LLM tell you if it looks like there's unnecessary context
being provided, right? Or if the LM is not being given

(14:26):
the directions it needs, especially if it's looking at
the full context. So that's what we use.
But I've actually talked to a lot of agent builders that maybe
they're really sophisticated or they like to do things
internally that they've done thesame thing.
They find a really good way to convert their traces into some,
you know, clean text format and then they'll throw into O3 or,
or O one or, or what have you. And, and that really helps them.

(14:49):
So I'd say like whether you use a tool or you, you do it
yourself, a lot of people are having LMS kind of give, give
them this first pass on, maybe call it like a eval set of of 10
questions. And if you're not screwing up on
10, then you know, that doesn't mean you ever screw up, but it
probably means you're in a decent place.
If your agent is not working, you're going to find out pretty

(15:11):
soon, yeah. I was talking to some people at
Anthropic and they said about 10to 15% of startups actually use
evals and the vast majority don't.
And it's kind of like going backto the days like what's your
threshold for unit testing? But it, I think part of the
problem is just how intimidatingit is to people being like, oh,
I got to set up this big eval suite, go through all these
things, implement element as a judge.
But really just having something, whether it's basic

(15:32):
asserts that do a deterministic test or being able to pass
certain inputs, creating that that that golden list of prompts
with desired outputs just to kind of do a, a random test is
going to be massively helpful. Just getting something in, have
you found it's been to develop your internal evals like heavy
lifting or do you think there's been something like an easy,
easy win, low hanging fruit thatpeople can do evals?

(15:53):
Are very important to me and so I kind of took an approach that
definitely involved some heavy lifting although you know now
it's it's very nice to have that.
If I were making suggestions to somebody who wanted to kind of
move a bit quicker, I would ask them to kind of think of like
mile posts or kind of mile markers or signposts of success

(16:16):
for their agent, Maybe just a handful and just write evals for
that or write some kind of dev setting where you can see what
the agent is doing. So I'll be, I'll be more
specific. I was working on this one agent
where it was one of those kind of vertical agents.
It needed to produce a work product that was really
valuable. That was actually kind of hard.

(16:37):
And when I first developed the agent, it never produced the
right work product, which is actually kind of expected
because it's like, there's a lotthat goes into the right work
work product. But what it could do is it could
find like the right context for succeeding 30% time, right?
And then just every once in a while, it would like generate a

(16:58):
kind of close enough work product.
And, and so maybe call that likemile marker 1, mile marker 2.
And then there's kind of the final destination that was a
like just capturing those three side 3 evals.
Did it find the right context? Did it get to like a decent
starting place? And then did it get the right
answer? That was really essential
because it was getting 0 on the final result and if I made even

(17:20):
minor improvements, probably would stay at zero.
I wasn't getting a lot of signal.
I wasn't learning what worked, but that first number of if they
got the right context did go up actually quite soon, right?
Because it's kind of the first mile marker.
So as I made the injured better,I got clear signal.
OK, this is helping, right? And then once that first mile
marker saturated, the second onestarted going up, right?
And then at the end, once it always got the first starting

(17:42):
place, then I could kind of do that last mile tuning and the
last number started going up. And so is it kind of a pain to
like set up those 3 prompts and then, you know, run them over
the trajectory? Yeah.
But you know, looking back, if Ihadn't done that, I think I
probably would have gotten nowhere.
And and so sometimes it's a question of like you, you do
what you got to do now. It's kind of close that thought

(18:05):
or to kind of, you know, add some nuance.
Would I personally add like 20 different evals?
So like, is it like nailing thisdetail, that detail?
I mean, I do because I have all this infrastructure set up.
Would I do it if I didn't have that?
No, I would find like the reallythe bare essentials, you know,

(18:26):
break it down kind of the way I just described and stick with
that. And as long as they provide you
a lot of signal, you don't need more.
If you're bad at thinking about the problem and you're bad at
setting up those evals, and yeah, you might need 20 because
each of them kind of sucks. But if you think through it, I
think you can really get a lot with just a few evals.
Yeah, totally. Agree, something's better than
nothing and just being able to interly go through it is super

(18:48):
important. I wonder if we can go a little
into detail about that. And I know with the different
aspects there's a different answer, but to move that first
mile marker up to get better reliability and performance out
of it. Did you just do kind of like
random changes, some more methodical EB testing?
How can someone kind of like puta process in place to improve a
specific eval in in let's say the prompting level I think.

(19:10):
That the 1st 2 problems identified or good to have just
visibility on and, and so the first step I took in improving
that age that I mentioned was I literally just printed what is
the like P50 and P90 token countgoing into the prompts
throughout the trajectory. And I kind of just whittled

(19:31):
those down. And as I went from like P90
being 40K, which is a lot to like 8K, the, the number just
steadily went up and it was a very, very like simple and
straightforward. And the progress is very steady.
It's like, why wouldn't I start there?
And you could imagine maybe it'sa little more difficult, but you

(19:53):
can imagine just putting a simple Gemini call, just saying,
like, does it look like I'm asking the LLM to do something
and I'm not getting it the rightdirections?
Maybe that could be the kind of the second counterpart.
OK. So once those are kind of dialed
in and you're getting decent performance, the second thing I
did. Is I would ask an LM, OK, the

(20:13):
agent screwed up and you could just copy and paste this into
the chat 3D if you want, or you can write software for it.
Why? And it would spit out an answer,
right? And it would spit out an answer
for each of those runs. And then I as a human be like,
OK, it looks like half my problems are basically all the
same. So let me fix that.

(20:33):
And you fix that by just thinking about how your software
works and then going to the source and trying prompt
changes. And that that kind of basic
approach, I think really is kindof the right balance.
It does take a little bit of work.
You need to get an AI that's going to tell you that one
second summary if I went wrong, but having that perspective on

(20:56):
like, here's twenty runs and here's the thing that actually
is screwing me over or sorry, here's the thing that's actually
giving me trouble. Let me focus on that and just
working through that. The high level principle here is
that software systems like agents, it's kind of like you
have to think about the rate limiting step.
There's a lot of ingredients that are going into the recipe.
You need good scaffolding. You need good AI, like a strong

(21:20):
AI model is strong enough. You need a good prompt.
You need all these things and you shouldn't try to improve all
of them at once. You should improve the thing
that's holding you back right now.
And once you fix that, there'll be another thing that's holding
you back, right? So you will solve all the
problems, but you do them in theright order and having the
visibility into OK, this is the problem that is affecting me 50%

(21:44):
of the time. It helps you do that
prioritization, absolutely. You gotta it's, it's the same
rule of entrepreneurship. Find the absolute bottleneck,
solve that and on to the next one.
Yep. For observability.
Do you want to share some insight onto how people can get
better observability into their flows?
Because I imagine a lot of especially people new to
building, there's input, there'soutput and then potentially
black box in the middle. Maybe they want to try to break

(22:05):
it up to get some more points ofaccessibility to kind of observe
the data. But what are your thoughts
around that? Yeah.
So I think. For agents, the main refinement
to kind of the tried and true open telemetry provider form
factor is make sure that you're seeing what the tools are doing,
kind of to caveat that or to even refine it further.

(22:25):
Sometimes you don't just wanna see what is going into the tool
and coming out. If the tool's interfacing with
some environment that you've setup or a bunch of software in
your back end, you might wanna observe that too.
And I, I haven't seen a lot of like solutions out there.
I'm sure they'll come that help you do that in production.
So that might involve, you know,just some simple logging set up.

(22:47):
But if you know what's going into the software that that your
agent is calling and you know what's coming out and you know a
little bit about what's going onand if if that's causing
problems, then that with the prompts going into the the AI
and then the message is coming out, that's probably all you
need. And so I think people can kind

(23:10):
of get a lot of headway using maybe you call it Lang Smith or
Lang Fuse or kind of these providers and just getting a
sense of what's going on. There's so many.
Different flavors of tools. It's kind of like whatever
people prefer just to get that insight.
Pulling it back a little bit forthe initial point you brought

(23:31):
it, which I thought was super important, was being able to
tell the LM exactly what you want.
And there have been people who come on and say they do
different strategies for structure and they're prompt,
whether it's using XML tags in certain ways.
Some people say that makes no difference whatever.
Do you go through a process or is it kind of intuitive now how
you can best most clearly articulate the objective to the
LLM the way I? Prompt is, you'd think, helping

(23:53):
out. In a paper on prompt
optimization, I would like try to optimize every prompt that I
write. And what I found is that I just
have so many prompts throughout my software because we build AI
for other people's AI, that it'sactually easier to build prompts
for humans for me, right? And actually, it's kind of more

(24:14):
likely that I will communicate improperly that the message will
not be sent than it is for the AI to understand what I'm
communicating poorly, right? It's like usually the errors on
the sender side, not on the receiver side.
So I'd rather just write a prompt that helps me write
really good prompts consistentlyand maintain them.

(24:35):
And the way I do that and there is some kind of research I think
that was guiding my thinking. I'm not gonna dump that on you,
but hopefully there's some rightto the reason.
I break it up into here's what you're doing as an AI.
Here's kind of premises ground rules, here's the setting you're
in, here's what I'm going to name things like, here's an

(24:57):
entity that I'm going to name. Here's the user, and the user is
XYZ, and here's me, the designer, whatever, right?
Just here's what's going on. I call those premises.
Then I outline what I want the agent to do.
So those are kind of like goals,right?
I want you to do XYZ and kind ofhere's what's important to me,
here's my preferences, here's what I want, And then at the

(25:18):
end, there's constraints. Don't do these things or make
sure you follow these rules, right?
And you might say, well, that's kind of like goals.
And I just find it to be helpfulfor me to break up what I want
and what I don't want in the twocategories.
And I've just found that that breakdown helps me consistently
articulate what I want to the model.

(25:41):
Nice. So.
I've actually done something really similar, which is super
reassuring. Where I actually have in my app,
there's just like a templating system where there's a default
template with these different areas and then like an
abstracted prompt that has to fill out each variable that
gets, you know, compiled into the prompt.
How do you manage prompt organization if you have a whole
lot? Do you have like a prompt folder
with just them, like one per file, or how do you try to

(26:03):
structure that? How?
I write software for, you know, AI systems is it's very modular.
So we have this kind of custom language where we put in Python
And we put in prompts, and that's really easy for our AI to
work with. So our AI can actually write an

(26:24):
entire agent end to end in just one shot.
It's really hard to do that, which is kind of like
freewheeling. Anything goes Python.
It's a lot easier if you have a structured format to do this in.
So I think it's kind of hard forme to say because at this point,
like a lot of that setup is designed for AI system.
I do like, you know, storing prompts and other artifacts in

(26:45):
some kind of like text or XAML format from my perspective, like
the prompt isn't code and actually the agent code isn't
code. They're kind of these AI and AI
system that you're iterating on and you might optimize and it's
kind of like a set of weights. They're there to be changed, but
if you only have one or two prompts and you have other
parties as a software engineer, then, you know, put in your

(27:05):
Python And different story. I I think it might be an
exception there. We go, well, it depends on the
use case, depends how complex the system is and what you want
to do. The next question we'd ask is
also very dependent on the system.
But when you're doing like a multi step process, have you
found there's either a consistent approach or or some
situations when doing a completereset of the context and then

(27:27):
giving like a fresh one to the next step.
So to better explain it, you getthe result from the first LM
call, extract 1 core piece, and instead of putting the whole
conversation history into the second call, you just take this
piece along with the brand new prompt and kind of like start a
fresh conversation. Yeah, so.
I'll be honest with you, I don'tbuild a lot personally of the
conversation based agents. I have some users that do and

(27:53):
nothing against it, it's just not my preferred approach.
My preferred approach is to control the context completely.
So kind of what you said where information goes into a model
and then comes out and then codetransforms it and decides what
the next call will be. And I, I know exactly what's

(28:13):
going where. And sometimes there's like skip
connections where like, you know, step one will information
about what happened will go to Step 3, but not Step 2.
But all of that is decided by software, not just the
conversation history that you have to like compact.
I do understand why it's really simple to just write a
conversation history because theinfo is all there.

(28:33):
I think it tends to introduce problems to the context if you
don't have control over what info you have to give it
everything and then degradation.Do you have any?
Strategies or mitigations against potential malicious
prompts coming in, or even some of your customers or colleagues
that do more of the conversational ones like I've
seen. Conversational agents, I haven't
seen a lot of people doing like safeguards against some kind of

(28:56):
injection. I think there's actually a lot
of good like open source research and tools for detecting
that sort of thing. And you can use really small
models that are really fast for flagging it.
And so I'd say if it's a priority, like go look there.
But if you're a small startup, it's probably not a priority.
And in that case, like it's justnot gonna be on your radar and

(29:17):
just. Being able to do whether it's an
element as a judge or one thing I learned is strip emojis cuz
plenty of the Liberator show that you can embed some hidden
text into the emoji Unicode character.
So it's just like little things,just like tidy up your input.
Just try to keep things safe. One thing that I'm kind of
curious about, and this might bemore of like a hypothetical, is
how do you envision people interacting with agents?

(29:38):
The chat interface is getting kind of old.
People are saying it's like, youknow, the equivalent of like the
terminal back in the day. So how do you foresee the
interface kind of evolving over the next couple years?
I think it definitely depends onhow you're using this agent.
I tend to suspect, you know, we mentioned these two form factors
where it's an agent working on something and then there's an

(30:01):
agent kind of responding to somenew fact about the world and
kind of getting things in order.Tend to suspect that in a lot of
cases, people will be interacting with the latter.
So they'll be kind of some AI system that is owning some piece

(30:23):
of your organization or something important to you as a
consumer. And it is not going to be
designed for doing a lot of work.
It is going to be designed for maintaining that status quo
really effectively and drawing on other resources to do any
work that needs to be done. And I tend to think that it'll

(30:44):
be AI like that, that we'll be calling these worker agents,
right? So it can get kind of old to
call a lot of these agents one by one, right?
And I'd be able to find that internally, we have started
adopting this form factor. We have like 1 agent that kind
of calls these worker agents. And that agent doesn't do a lot
of work. It just kind of manages and
triages. And I think in that case,

(31:07):
interacting with these systems will be updating the status quo
or updating priorities or updating what you want it to
achieve when it responds to the inputs, right?
And so maybe if you have a software engineering agent
interface and all these cloud codes are kind of humming away,

(31:30):
you'll say, hey, we just did ourSprint planning and here's what
we wanna get done. There's some question marks for
you to fill in. You know our priorities, you
know our security posture to make the right trade-offs there.
Go for it and come back to me ifthere's something I need to
clarify. It's still text, right?

(31:53):
But it's not text that you have to rewrite every day.
It's actually something that evolves over time and you have
memory and you're not losing progress.
And it really becomes a process that isn't this kind of one off
thing in in a cha cha terminal, Yeah.
I'm, I'm super excited for it along with everything getting

(32:13):
intelligence embedded and operating the background and to
use the code example code gets pushed up part of the PR process
could call the doc writing agentthat updates the documentation
with the changes. And just like a lot happening in
the background. We had one guest on Jake and he
was talking about the idea of having kind of like almost like
a factorial type thing where youhave all the agents operating
kind of get like the the 30,000 foot view of it.

(32:33):
So I'm super curious how it willevolve.
I'm not a UX guy by any means, but there's a lot of potential
just for we can do generative UI, even things pop up based on
what the context is. And they don't have to always
have a text box or always have agraph, but sometimes it's got to
show, yeah. I'm definitely not design
oriented. I don't know what is going to
feel the most intuitive for the people using these systems, but

(32:55):
what I can feel pretty confidentabout is that kind of going back
to what I said before, As these systems do more ambitious tasks,
they're going to need a lot of information about what their
owners want them to do. At some point, we're going to
get to a stage where it's just going to be impossible to
reiterate all that information in a new chat session.

(33:16):
And at that point, I think a really good UX will help the
team build up that context over time and reference it and manage
it so they can do this really ambitious things you mentioned.
A couple of internal systems or internal tools you have, and I
think that's going to be one of the bigger proliferations too.
What are your views on the buildversus buy debate?
Do you think people should say, try to bring in one AI engineer

(33:38):
as a new hire and have them start building tooling?
Should they just go and buy off the shelf things?
Do you have any thoughts on that?
Yeah. So one really fun thing about
working in AI is that things move fast, right?
The tools that we have today arejust so much more powerful than
the tools we had months ago. What that means is that

(34:00):
providers can move fast too. They can take these new models
and package them in effective UIS.
But it's my personal belief, just observation that often can
take time for a provider to figure out just the right form
factor that works for everybody,right?
OK, O 3 came out. So what?

(34:21):
Like, how do we, how do we put this into a product for
everybody in the world? It might take months to even
just do the discovery to be like, OK, here's how they want
it, here's how they want it. Let's combine that.
So on If you've one AI engineer internally, he could just figure
out how you want to use O3 in a weekend.
He doesn't have to balance the whole world of companies kind of

(34:43):
competing priorities and yada, yada.
And he he just has to balance your priorities.
And he might be able to deliver like the kind of O3 software for
you really rapidly because he has O3 helping them.
And that might set you six or nine months ahead of where you'd
be if you were just waiting for the kind of, you know, off the
shelf solution, which can be really useful.

(35:07):
Right. And it that dynamic, it might be
more important now just because of how much progress can be made
over nine months, right? Like maybe five years ago, if
you had to wait nine months for a piece of software, it's like
no big deal. But for some people, in some
context, it might matter a lot for them.
If you don't care about that, like if if something is not a
huge priority or you just think that it's good enough and you

(35:30):
can think to yourself like, yeah, if I had to use the
solution that I could have done myself nine months ago.
If if that's good enough for foryou, then just buy an off the
shelf solution in general, right?
I think maybe the one caveat I would add to that, right?
Because you know, as a vendor, hopefully there's a good reason

(35:54):
for people to use me. It is probably going to be hard
for AI engineer to write this software that we've written.
So that is probably the biggest counter.
Factual counterpoint is if the tool is 10,000 lines of code.

(36:14):
DIY is a great solution if it's important to you to be on the
bleeding edge. If it's a million lines for 500K
you obviously can still do DIY, but maybe at that point the
calculus is different. I.
Really agree with the explanation.
I think it's so depend on the situation.
Having a hacker in residence or someone that can spin up an
internal tool seems like a no brainer for me.

(36:34):
Anything customer facing you need a bit more reliability
because we both know you can getto 80% overnight but to get up
to 9099 a few nines it's going to take a lot more effort.
And having people, especially a team of smart people dedicated
to a single problem can probablysolve it better than someone
with the assistance of an AI. But it's really just depend on
the situation and what the expectations are.
So I hope more people build their own software.

(36:55):
And I think that's the trend we're going towards because I
don't know, you get tools like Slack, everyone just uses it for
messaging. But if you've got something
hyper specific to your use case that's ingrained in all your
products and you don't have to worry about setting up all these
integrations, it seems like it'll move a lot faster.
One thing I'm curious about is your approach, whether it's any
workflows or strategies for whenyou want to figure out a new
feature. So you have a general idea of a

(37:17):
product and you want to start expanding it.
How do you figure out like what you should add, what's a waste
of time, anything like that? Yes, this.
Is kind of what I was saying, where like software has changed
a lot or AI has changed a lot inthe last nine months.
My current approach is I work with O3 to find out kind of what
I want to get out of this feature and then quickly

(37:40):
transition to how should this bedone.
Here's like a bunch of context about my code, both front and
back end, yada yada. And here's kind of a document
describing what my company like cares about and whatever.
I'm not an expert at every pieceof infrastructure, every like

(38:01):
database provider or whatever. So I'll come up with a plan
that's kind of constrained to what I'm familiar with, but what
do you think is the right approach?
What's the fastest approach? What like gives you leverage?
Are there new tools I should adopt?
And what I find is that it tendsto do a really good job at
giving me something that is kindof the shortest path to, to

(38:22):
getting the outcome. And then I have to like filter
out the the unnecessary bits because AI tends to be kind of
over eager and and it'll add, add things that are necessary.
But I just find that to be so useful to have this resource.
It's almost like an AI consultant.
They could just tell me how to do things right in areas where

(38:44):
it's not my subject matter expertise.
And that's been really fun. Yeah, it's fun.
Having a little brainstorm partner, you mentioned O3, do
you use any other models for certain tasks or what's your
tool about these days for like? Producing a lot of structured
data, I tend to find Sonnet 37, which is not a ball I love and a

(39:06):
lot of people notice it's kind of reward hacky if it's like Max
thinking it's so good at producing structured outputs.
And I mentioned we produce agents in one shot.
And so that could be like 30,000because you have a lot of
prompts, you have a lot of code.And it's just incredible to me
that like often it can get therefor the first pass, right?

(39:27):
And then we'll refine it. There's really not many models
that could nail that much code in like a custom format and
maybe except for like for me. So I'd say those are like really
good with working with kind of custom rules and structured
outputs. And then for reasoning O in Pro,
I mean, I can never afford that in the API, but if you use it

(39:50):
personally it's definitely the place to go.
And O3 is kind of great for reasoning about how to approach
a problem or what have you. In terms of agent harnesses, I
still like Sonic 35. When I do fine tuning in RL it's
not on the table so often I'll do GBD 4 O or 4.1 now, or some

(40:12):
of the open Chinese models that you could do RL and preference
tuning on. And, and those are kind of great
in the harness. I haven't tried to build a lot
of agents with O4 mini and O3. It feels like one of those
things where like we're almost there.
O 1 was like unusable in an agent harness, at least for me.

(40:33):
O3 feels a lot closer. I wouldn't be surprised O 4 just
become that form factor of like RL models just becomes the way
you do agents and, and everybodyuses that approach.
But for now, I'm still sticking this on it.
It's still sticking with 4.1 testing out O3.
OK, there's a few other models that I find useful.
The Gemini models. I love the Gemini models for

(40:56):
long context. I mentioned that usually you can
only give an agent model a smallamount of context.
The ceiling is just so much higher for Gemini models and it
always has been. I actually have a benchmark on
my GitHub from a year ago and they were crushing it then
they're just obliterating everybody on handling long
context and they're crushing it back.
I don't know what it is. They have some secret in their

(41:18):
post training recipe, but they're really good at handling
a lot of context. They're not good at tool
calling. So I don't use them as a main
agent. But if I need to process a bunch
of information and weed out irrelevant details that the main
agent doesn't need, those are the ones I'll use.
And you know, there's kind of you could really go fractal here
and kind of think about more detailed capabilities, but I'd

(41:39):
say long context Gemini agents used sonnet or or the GPD models
and then naturally reasoning OO3O4 many feel really good.
Forget about 1 and I. Hope that encourages people to
just play more because you had to go through each one to kind
of figure this out. If people have their own custom
eval set at home, all the power to them.

(41:59):
But even just going through somebasic tasks, use a tool like
Light LLM to swap out AI or LMS really quickly just to see.
I think it's super important to do just so you can everyone get
their own personal preference asto when it works best in certain
situations. But I was awesome insight,
really appreciate it. Quick pivot, could you tell us
about synth? Who should look out to you, who
should reach out to you, what services to provide, stuff like
that? Yeah, so.

(42:21):
Synth is designed to help peoplebuild better AI agents.
I mentioned there's a lot of things that go into making AI
agents and there's kind of this process to improving an AI
system and we can help you at different points in that
process. Probably the most accessible
product that we have, we called it Cuvier when it, you know,

(42:43):
first released, I at this point call it Star data scientist
because the cute name thing kindof wore off and that's what it
is. It's our data scientist.
What it does is it reviews what your agent is doing often in
production. But if you're really sensitive
about that, you could just do iton like fake eval questions so
that, you know, no customer data, you know, leaves your

(43:05):
secure set up. And in that case, we'll do a lot
of the work that I mentioned where we'll identify like, OK,
you're not telling your agent the right information.
Oh, you're dumping way too much context in your agent.
Like the token counts are blowing up in these contacts.
And then we'll do the next step,which is here are the patterns,
like here's the biggest problem that you're facing.
Let's fix that. And as information comes in,

(43:29):
maybe we'll identify new problems that you expanded to a
new customer base and they're really trying to use your agent
in a way that you're not designing for.
They're getting a bad experience.
Some of those problems are subtle.
They'd be like, hey, we think that your agent is, but not

(43:49):
really making the most use of information.
It gained early on. A lot of the problems are very
obvious. So it'll be like the end user,
their customer asked the agent to give an answer in Spanish.
They did a bunch of things. The game of telephone didn't
work out. The last step, it just didn't.
It forgot about the Spanish thing and it gave it in English,
right. That's actually hard for an

(44:10):
agent to like keep track of all these priorities, but it's super
easy for me. I get to see like, here's what
the customer asked and here's the agent.
OK, that's obviously a problem. And you might think like, well,
I could find that myself. But if you're running 100 agent
instances, 1001 time 10,000, which almost broke my bank
before, you know, it made it cheaper for us to run this.

(44:35):
You can't read that that was an agent runs, but I can.
My AI can and and we can give you that sense of what to
prioritize next, right. So that's one system and then we
have a, a second system, which is like let's improve your
agents. There's a kind of asterisk on
this. OK, so how we can improve

(44:56):
everybody's agent is we can go ahead and change your props
right within the scaffolding youhave and we can help you find
the model or L the model to someextent, kind of just drop in
replacement. That is really helpful for some
people, right? And it's really helpful at
certain moments. It does constrain me.

(45:19):
So let's say your problem is notthat you haven't fine-tuned your
model. Your problem is that your
scaffolding is horrible and you're dumping tons of context
under your agent. Unfortunately, under those
constraints, I can't help you. No matter what I change the
prompt to, it's not going to help.
And so I would say that those kind of capabilities are really

(45:41):
easy for me to give to somebody and sometimes they really help
and sometimes they don't. Then there's this kind of second
or form factor, which is let me iterate on on some of those
things for you and I'll find a solution that works.
So maybe we'll change the scaffolding, maybe we'll add a

(46:02):
prompt somewhere in the loop andthen we'll come back to you and
say, hey, this, this approach works really well.
That's not something that we've found is like really easy to
package into a self-serve product.
So it's not self-serve. We have been working with, you
know, quite a few agents, companies and getting to

(46:23):
self-serve, but it's not quite there yet.
That being said, you can make a lot of headway and there's some
pretty crazy performance gains that we've gotten for a few
customers. Like just for reference, we had
one, they were getting like 70% success in their Asian task with
O3. They're getting like 10% with an

(46:43):
unfine tuned model that was likeQB 4 O.
They fine-tuned it themselves ina OK manner and I think we're
getting like 60%. And when I just added one step
and then did fine tuning, the way we're good at fine tuning at
this point, if we've been doing it for a while, we got up to 87,

(47:04):
felt really good about that, right?
So if you really want really good performance and you're OK
to kind of figure out the UX, you can reach out.
If you think like your agent is already at 80% and it just needs
a little bit of fine tuning or prompt tuning and you don't want
to completely change things, then that's a lot easier for us

(47:26):
to kind of give the people, Josh.
This was a blast. Really appreciate you coming on
before I let you go anything about the audience to know how
can they keep up with you They. Can follow me on Twitter.
I think that's where a lot of AIresearchers kind of like to
share notes and I try to do thatand try to share updates on the
product yeah, they should followme on Twitter and hopefully

(47:48):
we'll be sharing some open source.
I would say that's the biggest thing going forward.
Want to get back we've been kindof working in the trenches for a
while at OIC, but I worked on you know, the paper with the DS
Pi folks. We we open source that software
something I really love. So hopefully they'll be able to
catch me in the GitHub as well. I'd love to see it.
Thanks for the talk. We'll talk to you soon all.

(48:10):
Right. Thanks, Mike.

All Episodes

Episode Transcript

Popular Podcasts

On Purpose with Jay Shetty

NFL Daily with Gregg Rosenthal

The Joe Rogan Experience

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Why AI Agents Fail In Production And How To Fix It (ft Josh Purtell)

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}On Purpose with Jay Shetty

NFL Daily with Gregg Rosenthal

The Joe Rogan Experience

All Episodes

Why AI Agents Fail In Production And How To Fix It (ft Josh Purtell)

On Purpose with Jay Shetty