While loops with tool calls

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Jerod (00:04):
Welcome to the Practical AI podcast, where we break down
the real world applications ofartificial intelligence and how
it's shaping the way we live,work, and create. Our goal is to
help make AI technologypractical, productive, and
accessible to everyone. Whetheryou're a developer, business
leader, or just curious aboutthe tech behind the buzz, you're

(00:24):
in the right place. Be sure toconnect with us on LinkedIn, X,
or Blue Sky to stay up to datewith episode drops, behind the
scenes content, and AI insights.You can learn more at
practicalai.fm.
Now, onto the show.

Daniel (00:48):
Welcome to another episode of the Practical AI
podcast. This is DanielWightnack. I am CEO at
Prediction Guard, and I'm joinedas always by my cohost, Chris
Benson, who is a principal AIresearch engineer at Lockheed
Martin. How are doing, Chris?

Chris (01:04):
I'm doing just dandy today, Daniel. Getting ready as
we record this. We'reapproaching Halloween. So I'm
I'm getting in the spirit.

Daniel (01:12):
All sorts of sweet treats.

Chris (01:13):
Thank goodness the listeners can't see me. But
yeah, I'm I'm busy growing outmy my costume in this week
beforehand. So I look like Ilook I don't look so good. So,
yeah, glad this is audio only.Thank goodness.

Daniel (01:26):
Well, you're in between. Yeah. You're getting getting
ready for the festive season.That's that's okay.

Chris (01:32):
That's right.

Daniel (01:33):
Not not quite to No Shave November, I guess, but,
you'll you could you couldstretch it out through November
and

Chris (01:41):
I'm starting to become the monster under the bed maybe,
you know?

Daniel (01:44):
There you go. Speaking of of sweet treats, our guest
today has me thinking about cakebecause, a little while ago, we
had, Jared Zonnerich, who is acofounder and CEO at PromptLayer
join us. And, of course,looking, always reminded of
their nice cake logo, which ismaking me hungry right now. I'm

(02:05):
I'm wanting some dinner. But,welcome back to the show, Jared.
It's great to have you.

Jared (02:09):
Thank you. Thank you. I'm, excited to be back. And a
fun fun fun note for you. I'llI'll tell you, we're we're
designing our booths to go toconferences now, and I think
we're gonna bring real cakes tothe booths.

Daniel (02:21):
That's awesome. Oh, nice. You should. Definitely.
Nice.
You should.

Jared (02:25):
Into it.

Daniel (02:26):
Yeah. Yeah. Just, like, have those little I I see at the
airport occasionally, they theyhave those little vending
machines where you can get cakeout of the vending machine, it's
in a little, like, plastic thingyou pull open and eat while
you're at the airport. So, yeah,I think that's a I think that's
a great idea. But, I'm justlooking.
We last talked to you March2024, so episode two sixty one.

(02:50):
Folks, go ahead and, can loopback and hear from Jared when we
had this initial discussion,which was great. At the time,
Jared, everyone was, of course,talking about the I think we
were still in the days ofeveryone talking about kind of
prompt engineers and, promptinggoing crazy. Certainly people

(03:11):
are still talking aboutprompting, but maybe they've
shifted in some ways, focus. Asyou've been kind of at some of
the the center of discussionsaround prompting, how people are
engaging with AI, from yourperspective, what has the year
been like?
How how have things kind ofchanged in people's perceptions

(03:33):
of of prompting AI systems ormaybe even your own thoughts
around prompting AI systems? Andand, of course, we can get into
agent stuff later, but, yeah,any thoughts?

Jared (03:44):
Well, we're still called prompt layer, so we haven't
changed that yet. I don't thinkwe will. But, yeah. So March
2024, a year, a year and a halfalmost to go. It's almost it
would be hard to even list allthe big AI things that happened
since then.
I mean, the the one thatimmediately comes to mind though
is reasoning models. Soreasoning models, I think OpenAI

(04:07):
released o one. I think I saw atweet that it was like a year
from last week or something likethat. And that was, I think, the
first next generation ofprompting. Meaning before, if
you recall probably our lasttalk, we talked about chain of
thought.
And, now, I guess, the reasoningmodel does the chain of thought
for you and just the models havebeen getting better. But at the

(04:29):
core, the core way to thinkabout LM applications as an
input to an output or input tothe LM, meaning the prompt and
the model and then the output,that's all the same. I think
it's gotten much easier. Themodels have gotten much better.
're easier to steer.
And some of the some of theweirdness of how you persuade

(04:50):
the model and yell at the modelhas chained has gone away, and
it's gotten a little bit morestraightforward. But I think,
yeah, I think prompting is stillat the core of everything. And I
will have to say the new wordpeople love to say is context
engineering, of course. And, tome, they're the same, but, I
think the main thing reasonpeople like this word, context

(05:12):
engineering, is it's not justthe prompt. It's not just the
text.
It's how much are you putting inthe text. Are you putting too
much? Is the model gettingdistracted? Are you using rag?
Are you not using rag?
Are you throwing in a blob? Areyou using multiple models? And,
I think my my, like, high levelone sentence of what's changed
would be we have way more toolsat our disposal now, and, we

(05:36):
have way more mixing andmatching.

Daniel (05:38):
Interesting. Yeah. And and I guess that how would you
distinguish or how might yourthinking around kind of context
engineering, you know, maybethat is a term that's coming up.
What do you think is kind ofjust some jargon that's changing
and what do you think is kind ofa more substantive part of what

(06:04):
that means and maybe why peopleare shifting language, if there
is one?

Jared (06:09):
Right. I think the key around context engineering and
the reason, well, we're a littlefinancially invested in prompt
engineering because we made hatsthat say prompt engineer on it.
Maybe Right.

Daniel (06:19):
You gotta

Jared (06:20):
that's where my bias comes from. Yeah. But maybe
we'll make both hats. But Ithink context engineering really
gets to this, well, one contextis much longer. So you can send
so much more to a model than ayear and a half ago.
It's almost unlimited from auser perspective. And the
question is, how do you send it?How do you make how do you

(06:42):
squeeze the juice out of themodel? Models, I like to think,
get distracted like us humansdo. So how do you not lead it
down the wrong path?
And the the core difference,though, on how we build stuff is
I think there's been much morebecause models have gotten
better and because we can fitmore in the context, we've moved

(07:03):
from and we've seen this withour customers, moved from
complex DAGs or complexworkflows where you say this
prompt goes to this prompt andthis prompt goes to that prompt
to something a little moreautonomous and a little more
agentic and a little more ofjust a while loop with tool
call. Everything's a while loopwith tool calls now. But let me

(07:24):
that's a whole that's a wholedifferent topic. But let me
leave it at that.

Chris (07:28):
I'm curious, just to kind of pick up on that point,
though, I think I mean, I'mguessing a lot of our listeners
can really, you know, fill anassociation with what you're
just talking about about thatevolution. I know that like, in
my own job, it's not a primaryresponsibility, but as an
ancillary, I often get asked totalk to groups in our company
about how to prompt and stuff.And some of the things we had

(07:52):
put out internal videos likethat I put out a year and a half
ago that are completelyobsolete. And I and so I'll I'll
come back to a group and I'll belike, that whole thing I told
you before, don't do it that wayanymore. It's gotten a little
bit more capable, a little bitmore sophisticated, drop the the
the structured framework a bit.
I'm curious, you know, so I'mkind of backing out of what I

(08:15):
told them a year and a half agoat this point. And I'm sure a
year and a half from now, I'llbe doing the same thing with
what I'm doing today. So like,how do you address when you're
dealing with end users, how toevolve with the capabilities
that the model and theinfrastructures are are
providing these days? Becauseit's really hard to track that
fast, quite honestly.

Jared (08:35):
It takes me many hours a day of scrolling on Twitter to
keep, keep in touch with what'sgoing on. I would say so I guess
for context, for everybody,we're we're a platform for
building and testing LMapplications. And, people are
using us to version theirprompts, version their agents,
evaluate them, log them, andgenerally just build a AI system

(08:58):
that's working. So we get askedthis question all the time of,
like, how do we track it? Whatshould we do today?
And, we have the same problem.We we have YouTube videos or
interviews where maybe I rewatchit. I'm like, oh, well, that's
not true anymore. Now you couldjust use this.

Chris (09:14):
Thank goodness it's not just me.

Jared (09:16):
Yeah. I mean, it's everyone it's OpenAI too. It's
Anthropic too. It's Gemini. Ithink I think the answer is to
lock into the core truths of,building LM systems.
And how we look at it is it's aphilosophy thing. It's I guess
there's there's two there's twocompeting ideas here, and I'm
curious also to hear youropinions on this. But, there's

(09:39):
the academic idea of how do youunderstand LLMs and how do you
understand with the context andhow it works. And then there's
the tinker, the builder ofphilosophy, which we push people
towards, which is it's a blackbox. I don't need to understand
it.
I just need my input to match towhat I want in the output. And I

(10:01):
usually give a really annoyinganswer to teams who use us,
because we we get on a lot ofcalls with our customer. We're
not a consulting product. We'rewe're a software, but we love to
help our customers, and we liketo share knowledge. So so we get
on a lot of these calls, andalmost always my answer to a
specific question of, should Iuse GPT five or should I use

(10:22):
Sonnet four or five or should Iuse Gemini is that I don't know
the answer.
I don't think OpenAI knows theanswer. I don't think Anthropic
knows the answer. The onlyperson who can know the answer
is you for your own use case bybuilding tests and checking and
seeing what works. Basically,it's a black box. You just have

(10:43):
to try it out and see if itworks.
And that's the only repeatablemotion here. I know it's an
annoying answer though.

Chris (10:50):
No. It's it's good. It's and I think it's I I think it's
it's a it's an answer that flowswith the times because anything
that you would say is gonna beyou know, anything that you any
of us say today is obsoletetomorrow, with the way this
thing accelerates. So Yeah. Itit's it's not not as annoying as
you might think.

Jared (11:08):
It's a framework for figuring out the answer instead
of relying on us three. Andit's, how do you how do you
think about testing? Becausetesting them is really hard and
figuring out if your output isgood and if the prompt works is
really hard. So how do you buildheuristics and how do you build
evals and how do you even or howdo you skip the eval? Like, just

(11:31):
how do you approach this is theinteresting thing, at least to
me.

Daniel (11:37):
Yeah. And I think I agree with you, Jared, in terms
of like the spectrum that youwere talking about. The reality
is a lot of people don't need toknow what, you know, self
attention, you know, means, forexample, to build cool AI
automations. I think there islike, on the flip side of that,

(11:58):
I guess there would be this kindof like, you need kind of enough
of a mental model to understandkind of maybe the coverage of,
to your point, like, I thinkthis is partially why the
testing is so hard becausesometimes it may not be clear to
you like, Oh, if I, for example,change the formatting of my

(12:18):
prompt without changing any ofthe words, why would that change
anything? And kind of thesethings around prompt sensitivity
and other things, right?
So some of that can be built upthrough tinkering, but maybe
also comes even with this kindof mental model, maybe not so
much of what the models are, butI find sometimes how they

(12:40):
operate in generating output isoften a trigger for people to
know, like, oh, well, now itmakes sense maybe why of things
could go off the rails with hightemperature or other things like
that. So yeah, I don't know ifyou have any thoughts on that,
but that's kind of my thought isthat there's like a different

(13:01):
kind of mental model orintuition that you need, less so
the academic kind and more solike how things operate.

Jared (13:08):
Yeah. A 100% agree. I think you could have too much
academic knowledge of how LMswork and that actually might
hurt you in this because you tryto understand what you're doing.
Whereas a lot of these thingsare pretty hard to understand,
and maybe people don't need it'sit's in the neural net
somewhere. But I think whatyou're referring to I like to

(13:31):
refer to it as a LM idiom.
So how do you how do youunderstand the language of LMs?
I think the example you gave wasgood about formatting. Like,
JSON formatting has been a bigtopic people talk about of. Can
you just give JSON to the modelinstead of a prompt, and how
will that work? And I think it'sa really illustrative example
because giving asking the modelto return JSON or a structured

(13:54):
format will work really well ifyou wanna return precise numbers
or precise values.
But if you wanna write a lovenote, it's probably not gonna
work as well because the modelis now in I'm coding gear. It's
like, if you go into maybe Ilike to I always compare it kind
of the human. I don't think AIis human, but I think there's
some things we can kind of useas metaphors. So it's like if

(14:17):
you go into an engineeringschool and you just give someone
a paper and say, write a lovenote, they're they're thinking
about code. They're not in thebrain space of of poetry as
opposed to maybe they are goodat it, maybe they can do it, but
you kind of have to step backand say, okay.
This is what we're thinkingabout in the same way as if
you're asking the model tooutput JSON and output a key and

(14:40):
curly braces. It's probably notgonna write as good creative
language in one of those values.And, I can't write academic
proof for you on why this istrue, but it it is a it's a l
it's like an idiom. It's anintuition. It's a way to talk to
the LM without using so manywords, I think is where I'm kind
of settling around.

Chris (15:01):
And you may have just identified why my last
anniversary with my wife justwent off the rails. You know,
the JSON output for the lovenote. I thought I had it, but,
you know, clearly not.

Jared (15:12):
Yeah. It's a you gotta you gotta be careful with it.
But it's really an importantthing of I mean, when we're
talking as humans, we haveidioms that we can say one word,
and you're gonna think of awhole different thing. If I
mentioned Beethoven or if Imentioned Kanye West, you're
you're not you're not thinkingof just the person. You're

(15:35):
thinking of everythingsurrounding them.
And I think it's the same thinghere, and it's the same you're
putting it in a probabilityspace. And you wanna be in this
part of the probability space,it's gonna be a little bit more
challenging.

Daniel (15:48):
So, Jared, I do wanna dig into something that you said
a little bit earlier, which isthere has been this progression
maybe from static workflows orkind of prompt chaining where
you have like a DAG, whichpeople, if they're not familiar,
like that's sort of a onedirectional graph of like this
logic that the chain of calls tothe AI system is going through.

(16:11):
You kind of mentioned thatshifting more to kind of a while
loop with tool calls. So couldyou break that down? I mean,
first of all, maybe for thosethat aren't familiar, kind of
what you mean by kind of toolcalling and then, like, what
what this means to have a whileloop of of tool calls.

Jared (16:29):
Yeah. So maybe the best way to start is to, like, paint
the evolution here. So, you justmentioned, DAGs, these one
direction these basically graphsthat are just a bunch of nodes
that say this node goes intothis node, this node goes into
this node. The reason we startedwith that is because models were
a little unpredictable becauseof hallucinations. If you were

(16:52):
at the beginning of this LMcraze, let's say, two years ago,
three years ago, maybe threeyears ago is too much, maybe it
didn't exist.
No. Yeah. Something like threeyears ago was ChatGPT. If you're
United Airlines and you'remaking a customer support
chatbot, you don't want toaccidentally give people free
flights. And to avoid doingthat, the best way to do it
would be to build thesestructured paths that LM can go

(17:15):
down.
So the first question woulddecide what the user was asking.
And if the user was asking for arefund, it would go to the
refund prompt. And this is kindof how, let's call it prompt
engineers, context engineers,agent engineers, whatever you
wanna call it, stopped LMs fromgoing off the rails. Now what's
changed is two big, let's callit innovations. One, because

(17:39):
models are better at followinginstructions, and hallucinations
really don't happen that muchanymore.
And the second thing is modelshave gotten much better at
structured outputs. So before,it was kind of hacky to get the
model to return in a way thatcode can process it. Now tool
calling, which I'll explain in asecond, is baked into all the

(18:03):
main models. So what toolcalling is is basically you're
telling your prompt, you'regiving the instructions for the
prompt, but you're also tellingthe prompt it has access to a
few different functions. So inthe United Airlines example,
maybe the functions are issuerefund.
Maybe there's another functionof check user status. When you

(18:25):
use chat gbt.com, it has accessto search the web. It has access
to like, when you're generatingan image, it's probably like a
generate image type tool. Sothis way, as we've built more
tool calls, a lot of the modelshave been built around tool
calls and have gotten reallygood at interacting with them,
interpreting their response,sending another message to them.

(18:47):
And that's why you see so manymore autonomous agents.
Like like if you look at CloudCode or you look at Codex. The
reason coding agents areactually good now is because of
this paradigm and becausethey've actually simplified
everything and said, instead ofthis complex tag where we think
through every single step, we'regonna actually kinda give the
model a little bit more freerein, let the model run things,

(19:10):
see if it works, and actuallyfix it because models are turned
out to be really good at fixingtheir own mistakes. And what
that has unlocked is a lot moreflexibility for the model. So
now if you use Cloud Code, whichis or Codecs or any of these
coding agents in the commandline, they have only axe it's
the we wrote a few blog posts onhow it works behind the hood.

(19:33):
But the simple way to explain itis it's one loop that says
continue until the AI is doneand then ask the user for input.
So I'll say, make my applicationwork. And then now it'll start
the loop, and it has access tojust write in the terminal like
a human. So it'll writesomething. It'll get the output,
and then it'll decide. Do I waitfor a user response, or is it or

(19:55):
do I wanna run another toolcall?
And this simple loop is mucheasier to develop, much easier
to debug, and kind of just theway everybody's gone. Of course,
it has disadvantages. But that'skind of the way I see it, in
terms of where we've gone.

Chris (20:13):
I'm curious as we talk through this and and, you know,
as we've transitioned into thisagentic world, especially since
we talked to you last time, youknow, over the last year and a
half, it's really, you know,come on strong, you know, versus
where our conversation was backthen when that was a very new
thing. What kind of incumbenciesdoes this, this whole new set of

(20:34):
capabilities bring to the user,when they're trying to think,
you know, we've talked a littlebit already in this conversation
about kind of the evolution ofprompting context and, you know,
the rapidity of that. But, youknow, as we move into this
agentic world, any thoughtsaround around what the user's
new responsibilities are to beeffective in that?

Jared (20:56):
Yeah, the user as in the builder of an AI application or

Chris (21:00):
the user in this case is in the human who's listening to
our podcast right now and isgoing to turn to their their
system at the end of the showand go, I'm gonna go try that.

Jared (21:10):
Totally. Totally. So a little both, maybe. I think if
you're just a user onchatbpd.com or you're using an
AI application that's available,codex, cursor, whatever, it's
capable of doing much more andworking for a much longer time
and staying on track more. Andyou kind of can have a it can do

(21:33):
a better job, AI now, offiguring it out, let's call it.
So if you if you want to do ageneral task or do an
exploration because of this newconcept that people are using to
build AI applications. These AIagents are able to try
something. If it doesn't work,try something else and do the

(21:55):
exploration that humans woulddo. And we're really just trying
to make them act more like ahuman. And the DAG way, the old
way that we're talking aboutwhere you have a bunch of nodes
and you have a structured way,that's not how humans work.
When you you when you give anintern a task, you're not giving
them the exact flowchart of howto solve it. You're telling them
generally what tools they haveat their disposal, and they're

(22:17):
gonna keep working and using thetools until they figure it out.
Now for the builder behind theseapplications, what it gives them
is it makes it a little bitharder to test. It makes it a
little bit harder to keep thingson the rails, but it makes it
much quicker to build somethingthat succeeds. We built a a
agent using Cloud Code thatupdates our docs every day.

(22:40):
Base it looks at all the codeour team has written over the
last, twenty four hours anddecides if it should be in the
docs and then updates it. Tookme two hours to build because
all I said was download theserepos, read the commits, and
then check our docs and see ifit should change. And then it
gets to figure it out. Now isthis gonna be good for a

(23:00):
production system? Maybe itneeds a little bit more work,
but for something simple, it itopens a lot of use cases.

Daniel (23:07):
And what would you say, I guess, as you mentioned,
you're focused on kind of aplatform around kind of
building, testing, versioning,improving AI workflows or
agents. With these kind of whileloop tool calling things, right?
If you have kind of a variety oftools that could be called under

(23:30):
the hood and you have kind ofpart of the complexity of the
system, I guess, is in the toolsthat you can call. How does that
influence, I guess, like the waythat you should version or test
these systems or or does iteven, in the sense that you kind
of now whereas before it it feltlike there was kind of a single

(23:53):
function that I'm calling intowith a prompt. Right?
And that function producesoutput which may or may not be
useful. Here, you have kind ofthis recursive function that
that feeds into itself and couldcall any number of things. And
so, like, when you look back, ifyou're just looking at the input
and output, I guess, is what I'msaying, then any number of

(24:15):
things could have gone wrong inthat kind of recursive loop or
in that while loop that you'retalking about. So how does that,
I guess, how does that influencekind of the prompts that you
would version, how you iterateon those, how how you test and
improve the systems from yourperspective.

Jared (24:32):
Totally. It makes it interesting. So at the end of
the day, the tool calls that arebeing run, the the the functions
are still those input outputthings that can be unit tested
and that could be reallythoroughly and rigorously
tested. The hard part is thiswild loop. So I think the core
master prompt, the one that runsthe loop, that could be tested

(24:53):
in a lot of ways.
You can run sanity checks. Youcan test against old data. You
could see how things changed.But what you're getting at is
this really interesting problemthat's kind of been created,
which is how do you test aflexible agent where flexibility
is kind of one of the keys toit, with what makes it good. I
think there's kind of theheuristic I've developed is

(25:15):
something I want to call it.
Maybe I'll write a blog postabout it, like agent smell. So
if you run an agent, what sanitychecks can you see to see if it
smells a little funky? Like, issomething is it raising any red
flags? And I'll give you someexamples. So if I was building a
agent to, let's say, to fixerrors in my application.

(25:38):
So, like, if I had a databaseerror and I wanted to build an
agent that would go and fix thecode automatically, I would want
if I wanted to test this agent,what I would test for is how
many tool calls is there. First,I just wanna surface these
statistics. How many tool callsis it running? How many times is
it retrying the tool calls? Howlong does the agent take?
And these are kind of surfacelevel things. They're not end

(25:59):
all be all, but you first wantto start simple. So you first
want some sort of smell testwhere you can say, hey, this new
version is behaving verydifferently than my old version.
Maybe better, but maybe worse.And then that's when you go in
and break down like a state bystate test.
So the most useful tests we seeour users doing are individual

(26:23):
states or full conversationsimulation. So individual states
would basically be saying,here's conversation history. Now
run that good tool, and what'sthe next step? And we're just
checking if the next step is thesame. Of course, that's only one
part of the picture.
The other part of the picture ishere's the initial instructions.
Now simulate the wholeconversation and see if the

(26:44):
final output's correct. And thencombine that with the smell, and
it doesn't give you a fullpicture, but I think the core
learning we've had, at leastsince the past year and a half,
is that you don't need to have a100% coverage when you're
evaluating these things. Ifanything, if you're trying to
make a perfect test for youragents, you're probably never

(27:06):
gonna ship. And you're probablygonna It's just gonna You're
never gonna do it.
The better thing is to make itgood and have heuristics of
figuring out when it's regressedbefore it does.

Daniel (27:19):
So Jared, we talked through a little bit of this
agentic stuff, and I know, likeyou set up some of the
conversation where you as acompany are enabling kind of non
technical users to kind of workon prompts, embed their domain
knowledge, kind of have thatkind of nontechnical connection

(27:41):
to the prompts under the hood,which are maybe embedded in
various systems or tools andthat sort of thing. I'm
wondering about your perspectiveon obviously, one of the things
that's been talked about inrecent months a lot is this kind
of like 95% of AI pilots failingand and that sort of thing,

(28:02):
which, we've talked about on theshow, the the report from MIT
and gave some thoughts. But I'mI'm wondering how you think that
intersects with the way that thetools that people are using to
manage their their prompting,their AI systems, maybe the the
rigor that needs to be therethat's not, or maybe, like,

(28:23):
there's another side of thiswhere kind of some of those
engineering principles need tobe brought into the into the
picture. Yeah. What what is yourthought kind of from working
with a lot of nontechnical userson a platform like this?
You know, what have you learnedover time to be those kind of
key key pieces of making surethat those people coming to a

(28:46):
problem don't end up justwasting a lot of their time
working on something thatdoesn't actually get results.

Jared (28:53):
Right. So maybe 95% of AI pilots fail, but they're not
using front layers. That's whyit's

Daniel (29:02):
so Yeah. Exactly. That that was the that was the setup
for the question. Yeah.

Jared (29:07):
0% of AI pilots in front layer fail.

Daniel (29:10):
Well, yeah. Exactly. Same with prediction guard.
Yeah. Exactly.
Exactly. Exactly. Yeah. They'renot using

Jared (29:17):
the right tools. I would say so as a platform, how we
look at it is PromptLayer. We wehave a large diversity of teams
ranging from super technical andall engineers to basically no
engineers and everything in themiddle. And what we are trying

(29:37):
to do is build a rigorousprocess for building these
applications and expose it tothe people who know if the
outputs are correct or not. Sowhat I mean by this is rigor in
terms of versioning and knowingwhich versions are working.
Rigor in terms of being able totest, like we're just talking
about testing agents, testingprompts, and also rigor in terms

(29:58):
of logging and seeing what'shappening in production, what's
going on. Because you can onlytest so much in development in
AI and you kind of need toexpand the surface area of how
people use your product. Thereason we focus a lot on getting
these domain experts involved inthe process is because we

(30:20):
believe actually from a businessperspective, that's how you win
as an AI company. If you'rebuilding legal AI, you win by
having a lawyer involved. Theexample I always give and I love
going back to so I come from afamily of psychologists.
It skipped me. I'm an engineer,but but but I have some

(30:41):
familiarity with it. And if youwanna go to But

Daniel (30:44):
now you're just psychoanalyzing, the language
that's going into models. So I Iguess your your family can be
proud.

Jared (30:53):
I hope so. I hope so. I'm working on it. But, if you wanna
see a shrink, there's like sixon the block. Right?
I live in New York City. There'sa lot of them. And the
assumption I'm making here isthat there's no global maximum.
There's no one correct answer topsychology. You know, you have

(31:13):
different methods.
You have CBT. You have Ayahuascaretreats. You have a lot of
different ways to treat people.And same with medical doctors,
same with education, same with alot of things. And what's the
core differentiator between thedifferent psychologists on the
block that my office is on?
The taste and how they choose topractice their field. They're

(31:37):
all going to the same education.Maybe some have a little bit
more knowledge than others, buthow they implement their
product, let's say. And in thesame way as if you're building
an AI therapist, how you win asa business is the the non
engineering taste that's beenput into the AI product and the

(31:57):
way it's using the context thatyou provided and what you've
told it to do. And going back toyour question of how do you do
you take that knowledge ofneeding the nontechnical
engineer or, like like, we thinkan AI engineer should be
nontechnical.

(32:17):
And but how do you bring inthose engineering principles so
the pilot doesn't fail? And alot of times we see engineering
actually owns the product, theAI product. So we're usually
talking to VP of engineering orCTOs or something like that to
get PromptLayer installedbecause we we're all engineers
on our team. We're bringing inthese principles. We think even

(32:38):
if it's a nontechnicalexpertise, you have to do it in
a in a organized and systematicway.
And that's why you see the skillof prompting, I almost think, is
not quite the same, Venn diagramas the skill of coding because
it's really a skill oftinkering. And not all coders
are tinkerers, but not allwriters are tinkerers either. So

(33:02):
there's some new type ofalgorithmic thinking that
overlaps very highly.

Daniel (33:08):
Yeah. It's almost like being a negotiator. Almost.
Exactly. Yeah.

Chris (33:12):
It is. But it also, you know, maybe you have
organizations that are leaningone way or the other, you know,
you've just kind of kind ofdescribed that spectrum of of
skill and expertise that apply.And so possibly for whatever
problem your organization istrying to solve, trying to find
the right place on thatspectrum, to bring the right

(33:33):
resources together. And that,like, even aside from what we're
talking about here, that's aneasy place for businesses to
fall down anyway, across. And soin a sense, it may not be that
different for many otherbusiness problems that companies
are trying to face.

Jared (33:48):
Yeah, totally. It's like, what can we learn from non AI
world to ship these thingsbetter? To us, the big thing
that the big mistake people makeis they try to boil the whole
ocean at the beginning, and theytry to do too much. And really,
you wanna do the whole crawl,walk, run when you build these

(34:10):
systems. And you want to maybeinstead of the pilot being, hey.
We're gonna add a billiondollars of revenue with our AI
product, you wanna say, alright.We're gonna make a beta version.
Maybe we'll only release itinternally. Maybe we'll do this.
Maybe it won't do ever.
Maybe it won't be the while loopwith all the tool calls. Maybe
it'll just be one tool call. Andthat's also true with how you

(34:30):
test these things and how youbuild them. So it's not just
what the product does, but it'sa lot of company a lot of teams
get stuck at trying to buildtests because they try to build
perfect tests. And like we'resaying, it's hard and maybe even
impossible.

Chris (34:44):
And there's a learning process, which that kind of
implies crawl, walk, run is thatit gives it gives companies a
chance to, to not crater toohard when they're first starting
out, you know, as they're tryingto get something done. Keep it
small enough scope so that theycan actually achieve something
small but positive, and learnfrom that and kind of build up

(35:05):
toward what their trueaspirations might be.

Jared (35:08):
Exactly. I mean, we're all learning.

Daniel (35:10):
Yeah. And I don't know when you said crawl, walk, run,
Chris, it made me think likesome of the problem might be
that people don't understandwhat is crawling, what is
walking, what is running. Like,it all just looks like AI tasks.
Like, it's a big soup of thingsthat you can sort of quote do
with AI and it's unclear, youknow, how do I pick out the

(35:34):
crawling task? Because I don'tknow which of I don't know which
of them it is.
I I I see that kind of paralysisa little bit. I I don't know if
if you see that as well, Jared,or have any suggestions for,
like, how people can thinkabout, you know, picking apart?
Because because you're youmentioned like those domain
experts who are coming into, youknow, PromptLayer and you're

(35:57):
connecting those those peoplewith the business knowledge into
PromptLayer. You know, how is itthat they come upon the
knowledge to know what is a kindof crawl task or a feasible task
to like start with and, and, andplay around with? Any, any
thoughts?

Jared (36:13):
Yeah. It's a good point. I think the most successful
teams, AI teams, I've seen workin collaboration between
engineering and domain experts.So if you're if you're just
domain experts or you're justengineers, you can succeed, and
I've seen it succeed. But thethe most common way and what we
recommend is it should be ajoint effort.

(36:35):
The engineers often know how toship a product and how to do
agile or iterative design andthe the non technical understand
what makes it good. And you needboth of these, I think. How to
break it down, it's a it it'salmost the the crawl task, it's
almost you have to just stepback and say, is my heuristic

(36:57):
here as a let's talk abouttesting. Just what's the crawl
task in testing of our AIapplication? The hard examples
are, like, something likesummaries where there is no
ground truth.
So what is the crawl task ofevaluating a AI notetaker? Well,
you kind of have to step backand say, as a human, what is a

(37:17):
good summary? Alright. What'sthe simplest thing? Maybe the
simplest thing is just saying,is the summary in English?
Maybe doing another LMS judgewhere we say, does it use
markdown? And then maybe anotherone that says, is it less than a
page? And these are allobviously not end all be all
tests, but it's the crawl. Andthen once that's working, can
check for hallucinations. Andthen maybe you can check for a

(37:40):
style.
But it's very use casedependent. So it's hard to give
a one size fits all there.

Chris (37:47):
So I'm gonna I want to turn things as we're starting to
get closer toward the end. And Iwant to ask a kind of a fun
question for you. I know youlike Claude code. And so I
wanted to ask, some of thethings that you're playing
around with and, you know, whatyou're doing and, you know,
what's got you excited on it.

Jared (38:02):
Yeah. And and I will say, I I like Claude code, I also
like Codecs. I also like Cursor.I also like AMP. AMP is doing
really cool stuff with the thefree coding agent too.
So I switch between all of them.I think I give Cloud Code a lot
of credit for being the firstreally useful coding agent that
I've used. What am I doing? So alot we we we we redid our whole

(38:25):
engineering philosophy aroundthese coding agents at
PromptLayer. Basically, the hardpart about building a platform,
as you guys likely know, is allthese little things and the
death by a thousand cuts.
So we have to work on bigfeatures, but also the UX of,
like, this button here doesn'thave a loading state or this is

(38:45):
not draggable here is how youyou could fail if you don't fix
those and that list piles on. Sonow we have a new rule in our
company. If it takes less thantwo hours to do using Cloud Code
or Codex or something, just doit. Don't ask anyone. Don't
prioritize it.
And it's helped a lot. Honestly,our customers have literally

(39:06):
told us like, wow, you guys areshipping so much faster now. So
I think everyone who says, oh,it actually makes you a slower
coder, it's just full of it.It's so good. And I use it for
nontechnical things too, if Iwanna go through a CSV.
I'll you I'll tell you one thingI did recently that that is
pretty interesting andnontechnical. I went to an

(39:29):
event, and I won't say which onebecause, I then people will be
like, oh, that's why I'm gettingspammed. But I went to an event,
and, everyone was prettyinteresting on the event. And, I
basically copied and pasted thelist of users there. I I
actually, I gave the HTML toClog code.
I said, make it into a CSV ofall the people there who click

(39:51):
going and whatever social mediayou have. And then I put into
prompt layer, and I actually,like, added new columns. So I,
like, did a batch of, like, findwhere they work, find their
whatever. And then you could goback to Cloak. I went back to
Cloak Code and did some dataprocessing and saying, like, who
should I contact?
And, just doing, like, randomtasks like that and batch

(40:12):
prompting. Like, I combinePromptLayer with it, of course,
but random like sending emails,creating random UIs for like, to
understand a company. I use itfor everything. I'm I'm
constantly on it. I I constantlyvibe code these days, and we sat
down with everybody on our teamand Claude coded with them just
so they can see how good it isbecause a lot of people are

(40:33):
skeptical, and, it's great.

Daniel (40:37):
That's great, Jared. I I love the tie in to even this
sort of, like, combination ofthings that it's like when
Anthropic released QuadCode, Idon't know that they imagine all
of these like trickle downeffects of like things in the
way that people are combining itwith other things in the, like,

(40:58):
even the nontechnical thingsthat you mentioned. So it's
really cool to see how thatplays out. As you as you kind of
look towards the future, as wewe get to the kinda wrap
wrapping up here, tell us alittle bit about kind of what
excites you kind of moving intothis next year. And, you know,
when we talk again in the nextyear and a half or whenever it

(41:21):
is, you're excited about duringthat period of time, like,
related to PromptLayer andrelated to kind of things things
in general and how theecosystem's evolving?

Jared (41:30):
I'm excited about a lot. Simplest one is we're talking
about it, these coding agents,as a headless tool. So using
them in your workflows to runthings, that's exciting. I'm
excited about especiallynontechnical uses of these
things, I think. You're gonnaright now, they're in a
terminal, and you're gonna beusing them for so much more.
And then ClogCode aside, I'mvery excited about the whole how

(41:54):
the space is evolving to a placewhere you have a lot of
different tools at yourdisposal. Some models are really
good at writing. Some models arereally good at coding. And the
consumer is just having moreoptions than ever to building
their product. I think we're notin a world of one model rules at
all.

(42:15):
There's a lot of ways to solve aproblem. There's a lot of,
variability on how you buildyour product. And I think that's
a good thing. I think I'm very Ithink the future is great. I'm
I'm excited for AI to take over.
I think, I don't I'm not worriedabout it at all. And, yeah, I
think I think and I'm honestlyreally I know this is a little
bit of a a shill because this iswhat our company does, but I'm

(42:38):
very excited about unlocking AIengineering for people who
didn't study computer science.And I think this has been
something people have talkedabout for so long of how do we,
democratize coding and get morepeople coding. And maybe people
aren't going to be codinganymore, but the way that people

(43:00):
have expertise and now they'regoing to be able to build AI
products around it and do AIengineering around it. And it's
really anybody's gonna be ableto distribute their work to
almost infinite levels.
So that's what keeps me up atnight in a good way.

Daniel (43:14):
That's awesome. And yeah, I would definitely
recommend, of course, we'llinclude links to PromptLayer in
the show notes, but also, as, asJared mentioned, they have a,
they have a great blog. Youknow, they've, they've released
some, some excellent articles.They, they have great learning
resources out there. So checkout everything that they're

(43:35):
doing.
Really appreciate the way thatyou all are contributing to to
the ecosystem, Jared, and,definitely keep up the good
work, and we'll look forward to,to talking with you again, next
time you're on the show.

Jared (43:47):
Amazing. Amazing. Thanks for having me. And anyone can
reach out on Twitter or on emailor sign up for PromptLayer, and
get started for free. So,excited to see what people
build.

Daniel (43:57):
Yeah. Definitely. Alright. Talk to you soon.

Jared (44:00):
Thank you.

Jerod (44:08):
All right. That's our show for this week. If you
haven't checked out our website,head to practicalai.fm, and be
sure to connect with us onLinkedIn, X, or Blue Sky. You'll
see us posting insights relatedto the latest AI developments,
and we would love for you tojoin the conversation. Thanks to
our partner Prediction Guard forproviding operational support
for the show.
Check them out atpredictionguard.com. Also,

(44:31):
thanks to Breakmaster Cylinderfor the beats and to you for
listening. That's all for now,but you'll hear from us again
next week.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

Las Culturistas with Matt Rogers and Bowen Yang

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}While loops with tool calls

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

Las Culturistas with Matt Rogers and Bowen Yang

Dateline NBC

All Episodes

While loops with tool calls

Stuff You Should Know