All Episodes

September 10, 2025 54 mins

What’s the first step to building an enterprise-grade AI tool? 

Malte Ubl, CTO of Vercel, joins us this week to share Vercel’s playbook for agents, explaining how agents are a new type of software for solving flexible tasks. He shares how Vercel's developer-first ecosystem, including tools like the AI SDK and AI Gateway, is designed to help teams move from a quick proof-of-concept to a trusted, production-ready application.

Malte explores the practicalities of production AI, from the importance of eval-driven development to debugging chaotic agents with robust tracing. He offers a critical lesson on security, explaining why prompt injection requires a totally different solution - tool constraint - than traditional threats like SQL injection. This episode is a deep dive into the infrastructure and mindset, from sandboxes to specialized SLMs, required to build the next generation of AI tools.


Follow the hosts

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Atin⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Conor⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Vikram⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠Yash⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠


Follow Today's Guest(s)

Connect with Malte on LinkedIn

Follow Malte on X (formerly Twitter)

Learn more about Vercel


Check out Galileo

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Agent Leaderboard

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
On the agent in particular, thisframing that I have where it's a
new kind of software, but not software that's like weird, but
software that we always wanted to write.
Where I'm coming from is that there is a type of software that
automates something that we do on a daily basis where there is
a little bit of of flexibility that's needed where that's
actually quite difficult to model in software or like

(00:20):
traditionally, because you wouldhave written these like if this
and that type of blocks, it's just difficult to make that
exhaustive, but agents do reallywell on that same software.
So suddenly it's it's super easyto ride.
Welcome back to Chain of Thoughteveryone.
I am your host Connor Braunston,and today I'm joined by Malta
Ubel, CTO at Versailles. Malta is great to see you again.

(00:43):
Great to be here. Super awesome.
Yeah, it's, it's been a lot of fun chatting with you over.
I guess the last year it was thelast time we had a chance to sit
down and kind of dive deep on everything happening with AI.
So it's been fun to watch a couple of your talks as you've
continued to be a prominent voice in the AI space, often
striking a balance between excitement and pragmatism and

(01:07):
the ability to be actionable, which I think something Versal
does a really good job of. And you've even described
yourself as an anti hype guy at times, something which was on
full display at your recent Versal Ship 2025 talk.
And you had some fantastic takeaways there.
In particular, you talked about the FOMO that people are feeling
right now. And as someone who talks to

(01:28):
leaders like yourself every weekwho are all doing exciting
things in AI, I feel this FOMO vehemently.
Every week I'm going, oh God, here's a new thing that I
haven't experimented enough withyet.
And it's very common, I think, for folks who are building in a
space to experience this. It's almost the default emotion
of AI builders as perpetual FOMObecause there's so much

(01:51):
happening, there's so much hype,there's so many new and exciting
things that are actually being achieved, and the space is
moving faster than I think many of us have ever experienced.
You've posited that AI agents inparticular are a new kind of
software and a new paradigm thatis something we've always wanted

(02:12):
to create but couldn't for economic and, you know, energy
and, frankly, reasoning reasons.Could you expand on how you see
agents in AI more broadly, shifting the paradigm of how
technology is working today? Yeah, totally.
And you know, I think like the way I qualify anti hype guy is
that I'm actually I'm pretty hypey about this stuff, but I

(02:35):
try to ground it in reality. Like, I think there's always
this moment in in when you tech comes out that people kind of
feel compelled, like they feel the FOMO, but they don't really
quite know what The thing is. And, and so like you don't
really know where to start, right.
And what definitely is differentin the AI is, is that there is
value though, like, because I, I, you know, it felt like almost

(02:59):
the entire web 3 hype cycle was like just perpetually there, but
you never like, you know, the, the thing that actually would be
the value was never really discovered.
And, and but this is so different, right?
Like, and it's, it's you, you can go try it out and then it
actually happens to work. So, so we're coming from a
different perspective. And the, the, yeah, the, the on,

(03:22):
on the agent in particular, the this, this framing that I have
where it's a new kind of software, but not software
that's like weird, but software that we always wanted to write.
Where I'm coming from is that there is a, a type of software
that automates something that wedo on a daily basis where there

(03:45):
is a little bit of of flexibility that's needed where
that's actually quite difficult to model in, in software where
like traditionally, because you would have written these like if
they send that type of blocks, it's just difficult to make that
exhaustive, but, but agents do really well on that same
software. So suddenly it's super easy to

(04:06):
write. I could give a few example like
the, the, and just from the top of my head, like, so we've been
working on like a don't like, don't like this category, but
def SEC OPS agent. It's basically just kind of
takes anomaly detection from a, from, from a firewall and, and,
and goes fishing for like what happened?

(04:27):
Like, and it could be anything. It could be many, many things.
And, and, and it can be all of these things like combined and
novel ways. And so writing software that
would do that analysis would be,I mean, at least difficult and
like, a lot of work. And if you just take like, a
modern frontier model and you give it a few tools to say,

(04:47):
like, here's how you can query our data stream, what happened,
right? Like, it will do a pretty good
job just from the scratch, and then you can do it better over
time. But like it actually is able to
answer that question in a way that is valuable.
And you spend, you know, an afternoon on it and now you have

(05:08):
a piece of software again like that used to be so hard to write
and, and, and you know, the now it's now it's something we can
actually do. I completely agree.
It's, it's interesting too, because it feels like, and maybe
this speaks to your anti hype positioning that we all
overestimate the ability to go from zero to 100% of solving the

(05:30):
problem and maybe underestimate how much effort is required in
that final, you know, 20% of solving it.
Because to your point, it's really easy to have a, you know,
simple agent that's working off a frontier model, you know, take
your pick depending on the task and solve a lot of this problem
right off the bat. And then we maybe are distracted

(05:52):
and don't go through all the needed tweaks to get it to 100%
of our task at the same time that that's incredible.
We in an hour or two have solvedmost of this problem.
And I wonder if part of why there is this bit of a hype gap
there is that while most developers are thinking about

(06:15):
agents, not all of them are actually building them yet.
And you know, I, I referenced your talk at for cells ship 2025
and one of the first things you asked the audience was how many
of you are actually building with agents today?
How many of you have actually tried it out?
And I believe you said about 5% of that audience actually had,

(06:35):
whereas I know you think it's going to be more like 100% when
you come back and ask that same question next year.
Right, yeah. And, and the reason just is that
I think we like in our finding, like we tried actually quite a
few things and they all work like so, so it's like, yeah,

(06:56):
it's like unreasonably effective.
And so I, I feel I generalize itto multiple categories.
Now, I, I will give one caveat, which is in fact the reason why.
And obviously we're talking to like a suffer to an audience who
is using AI agents today in substantial fashion, right?
Maybe cursor, but I mean, certainly clock coaches fits

(07:19):
that bill and and it's working and it's fascinating.
And so the, the, I think the, the agreed upon reason why you
have these like very effective agents in software engineering
and you don't have them in otherdisciplines is that in the
software engineering area, you can like reinforcement learn on

(07:42):
like this very valuable, very open and a problem in a way
that's more difficult in like for things that don't have
source code that you can validate and, and run your tests
on and so forth, Right. So it's so it's a little bit
easier to make these agents verygood at soft engineering than it
is to make them really good at like anything but but that.
But having said that, I think that that is basically just

(08:03):
pointing towards something that that is going to be true more
generally. And that's why I feel that what
we see a signal today in our owndiscipline does most likely
generalized further out because I am working on death tolls.
I, I only have death tolls examples, right.
But I do have I, I actually let me give you one example that I
thought was really cool. Please.

(08:24):
So I live in, I live in, in Alameda and specifically on Bay
farm island, which is the same peninsula that has the Oakland
airport. And I don't hear airplanes
except for the Tri jets. And so I, I don't love them and
they, they and they, you know, not obviously, no, no passenger
airline anymore flights them, but the but the cargo airlines

(08:47):
do. And so I was, I wanted to
wondered when are they facing them out?
And the Internet doesn't know that.
So I, I sent the out deep research and what, what Cha Chi
BT did it looked at all the flight data of every day from
the Oakland airport that it publishes and figured out what
is the overtime change in, in Tri jet patterns.

(09:10):
And like, obviously I could havedone that myself, but I was
like, it's a hell of a lot of work, right?
And so I wouldn't have done thatmyself, right?
And now you have this thing thatdoes it.
And so so that just worked that out-of-the-box deep research
task. It created novel data.
It was technically there, but obviously it would have been
incredibly difficult to surface.And we actually have a product
on development that is very similar, which is we have an

(09:33):
agent that does essentially likethe same as git bisect for
Versal preview deployments. So it's because Versal has
immutable deployments, you can essentially bisect them to find
out where a regression occurred.And so this bot, you know, you

(09:53):
can teach it how to Rep produce a problem and then it just does
the bisect and, and it's a similar thing.
We're like, I mean, you can do that as a human, but it's a
horrible task. Like we, we happen to be bad at
it even with like a bisect helper, like going to go up
down, like, like humans are so bad at this.
It's very easy to program that thing.
And so like it, it perfectly finds fits into an agent.

(10:17):
And this agent, I mean, this is already different from like a
cloud code, right? Because it's now now you're in
this like in a world that is kind of outside of coding and
and that again, just works like it with relatively small effort.
You have this like incredibly valuable tool that is able to
navigate more uncertainty than like a traditional computer

(10:37):
program would have been able to.Yeah, that's a really great
example. And I know we, we all know like,
oh, it's fantastic to have a grid pattern search and we, we
want to approach things that way.
But God, when you have to, you know, bisect things that
yourself and recreate a problem,it can be tedious.
So I love that as an example. And it sounds like you're

(10:59):
identifying other unlocked opportunities for businesses
that are embracing AI agents today.
You know, obviously coding and and dev tools are clear examples
here where we have a ton of public data to train off of.
We have a lot of solved problems.
We have a ton of people who are working on these problems who
can help, you know, fine tune and improve this.
But what are the other use casesthat you think businesses should

(11:22):
be embracing today for agents? Yeah, I think that the, the, the
clear big one is support agents.That's where we also see kind of
this entire ecosystem of kind oflike let's call them SAS
businesses that like that act specifically in this space like
Decagon, Sierra, etcetera, right.

(11:44):
And I think generally speaking, there's isn't because it's not
cheaper to make software. I wonder how much you actually
going to buy software packages like that versus you're able to
buy these yourself have the perfect tailored support
experience. So I think that's a that's a big
one. And I mean, I just gave this
deep research example. So deep research as a basically

(12:08):
just like generalized way of, you know, going through data
that's useful and whatever public, you know, AI is there
does not have access to your company's data.
So I think 11 really concrete thing that you can probably work
for your company is that you built a deep research tool that

(12:32):
has access to your private data,right, that you don't share with
anyone else. How would you suggest that
engineering leaders or product leaders who are currently
evaluating their own AI use cases think through these
opportunities? So it sounds like 1 you know,
what can you unlock with your private data?
I think that's a great example to, you know, what tasks are

(12:57):
simply challenging to solve today that you can at least get
mostly solved much faster with an agent or AI solution.
What other ways would you encourage other leaders to be
thinking about solving problems with AI?
Yeah, I think the, the, I mean, overall it's all about being
creative, right and and knowing your business.

(13:18):
I think besides the categories that I've been talking about,
it's worth whenever you have a workflow, it's looking, it's
worth looking into. And often in like, you know,
like especially bigger companies, if you have a
process, there is some kind of routing phase in there where
like where it's not, it was difficult to write this like
algorithm like what's next, right?

(13:40):
And that's I think where AI models excel like finding like
where you basically just write down the business rules and
prose and you basically just have to LM execute your business
rules, right? That that works really well.
The other one, the other patternI think that is under discussed
in my opinion, is AI in the loop, because we would love to

(14:02):
talk about human in the loop. So in human loop, the AI does
the work and then you have a review or not.
That's obviously like what usually should happen, right?
It's, it's, it's so obviously true that I don't even think
it's necessary to to discuss this.
But the opposite also works really well.
So if the human does the work, then AI is an incredible

(14:23):
reviewer. I think what we're seeing now is
that AI core for you, first of all, is amazing.
It has this like relentlessness,like it doesn't forget about
your past mistakes. It checks again if you if you
make a million and stuff like that, right?
Like so I'm a fan. And so I think that is almost
like there's no way that doesn'tgeneralize, right?

(14:45):
Like, I think like validating work product against like about
like it was hard to like again, the, the rules you want to
validate against are really difficult to put in an
algorithm, right? And I mean, often with human
work product, it's natural language, It's very hard to, to,
to write like Lynch rules against human output, right?

(15:07):
But you can, you can basically think about AI doing a similar
thing where like you, you know, let's say you're in the
marketing department and you have a certain policy around how
language should be used. Like that is something that that
LLM can enforce really well. And, and again, like it doesn't
go to bed, that doesn't, doesn'twake up tired, right?

(15:29):
Like so it's good at these like very like tedious tasks that
like, you know, humans the 15th time they do it kind of go low
on quality. So I think that's, those are
kind of the areas that they are ripe for ideas.
And again, these are just examples.
So you have, you have an idea inyour head the the next step then

(15:53):
to like turn this into somethingmore practical.
Is that what I call the vibe check, where like you actually
do a non programming exercise tofind out if, if the LM is
anywhere up to the task. Some of the examples I just
gave, they're, I mean, they're almost trivial to, to, to check
in that way, right? Like if I, if I have a set of

(16:15):
writing rules and I want to understand if the AI is doing a
good job at, at, at forcing them, I can do that right now,
right? I go to chat.com, I paste my
text, I ask it, can you please, you know, check this text
against these rules and see whathappens, right?
And if it delivers all kinds of false positives and that sucks.

(16:36):
And if it doesn't find the problems, that's also sucks.
But maybe it does. And like you really quickly can
understand, like, am I giving the AIA task that it's just not
up to? And maybe I need to wait a year
or maybe I need to do something else or you're saying, OK,
though actually this does seem to kind of work and great.
Now I can go and turn it into like a program, right?

(16:58):
Because obviously the, you know,next day, ideally we don't have
to paste it into jet GBT and, and, and write a prompt from
scratch. But that part is like the easy
part. I, I feel like this is almost a
simpler version of the kind of MVP approach that a lot of folks
are encouraging when it comes tobailing out new product features
where they're saying, Oh, well, just, you know, vibe code a

(17:19):
quick version of that feature, see how it looks or, you know,
of, you know, vibe code what youwant the new UI to look like.
And, you know, let's let's have something to work off of.
And you're saying, hey, you know, that's a great step.
But before we even get to that with let's just really simply
check the use case. Is this something that an LLM is
kind of equipped to handle rightnow?

(17:40):
Or are we going to have to add more structure and more context,
or simply wait till it's better set up for I?
Have said yeah, I think you're right in that it's a it's it
fulfills a similar role in the process.
The like I mean, I'm generally the biggest fan of of rapid
prototyping and like and using like an early version of the app

(18:00):
as a discussion platform rather than the fully blown thing.
Like obviously we're making V-0,which is like that's the whole
point. But I think that that is that's
great for when your idea is thatyou're going to ship this
product right and and and has anexperience and you want to
experience it. So that's one thing.
The other thing is that no, if that product has an AI as it's,

(18:24):
as it's heart, obviously it might not, but like if it does
like, you know, because you're building an agent, then it
really helps to just kind of prompt your, you know, your way
to seeing if it's, if it's working because it, it just
removes all these steps that areessentially enterprise process

(18:44):
automation, right? You know, and maybe that's like
only 5 minutes in your startup, but if there's an SAP and an
Oracle and you know, whatever, right?
Like it might be a lot of stuff and, and like writing code
against those doesn't hurt. But like you also you don't, you

(19:05):
know, that's, I think we're all kind of confident that we could
do it if we, you know, if the AIat the heart kind of did what we
wanted. What would be your next step?
So you've started to do this iteration.
You've kind of validated that hey, yeah, this is a real use
case we could apply this to. What should builders do next to

(19:27):
turn this into a fully fledged agent or system?
Yeah. So hopefully we we found out
that the AI somehow magically actually, you know, that's what
we want then, yeah. Then I mean that that's the step
where we go and put in the software, right?
Like that's when we attach it tothe, our business systems that,
that have the data that we want to process.

(19:49):
You know, again, like it could be, it could be all kinds of
different, different services depending on, on what we want to
build. But it's like at this stage what
we're really doing is effectively the like just plain
old work for integration, right?So that, that, that step, it
should like if you're a back endengineer, you're right at home,
like you're not learning anything new.
That's what you're already doingday-to-day.

(20:12):
Nothing's, nothing's weird. The other stuff is that at this
point you build the actual agentand we actually haven't talked
too much about like what that means.
So maybe now is a good time. Please.
There there are different architectures, but the one that
like that feels uncomfortable but works that which is the
architecture, which is why, for example, Cloud Code works as

(20:33):
magical as it does. And and the one I would
encourage using is the most simple one, which is where you
have an LLM and you you know, obviously you have some kind of
trigger for why you start working is some kind of input
data and prompt. And you give it a set of tools.
And these tools should be relatively simple.

(20:58):
So like for recording agent willbe read files, this files edit
file, create pull request. So nothing crazy, right?
Like really just to kind of go figure out like what's what's in
my repository. Maybe there's a grab files and
there's something like that, right.
If you're if you're building anyform of like deep research

(21:18):
agent, you'll give it as like and you want to attach it to
your company database. Maybe you give it a search
functionality that hits you glean or whatever you're using,
right? And maybe you give it like a
generic SQL execution tool that gives it access to your
snowflake, or maybe you're doingsomething more specific.

(21:39):
But that's the type of tools yougive it.
And you describe like to the LM like this is what these tools
are good for. This is your goal.
And and you tell the LM OK, you get to make up to 10 turns using
a tool, it's towards your goal. And then you just let it cook.
So like you literally tell it like go figure it out, right?

(22:00):
And and then OK, it says like, hey, this is my task.
Hey, OK, I have this tool. OK, that seems useful.
Execute the tool. It gets the response back and
sees, OK, this is what happened.OK, this is help me.
This is not maybe I need to callthe tool again with a different
query or no, this is actually right.
So it figures out this entire sequence of, of kind of
iterating towards the result in a fully autonomous fashion.

(22:24):
And the, the magic is that that works and you have to like, it
does feel a little bit uncomfortable.
I think we, we all like what we've been finding is that you
actually really want to make these tools somewhat atomic
rather than, you know, more highlevel, more specific to the task
and that, you know, it, it just,it just works like you have to,

(22:47):
you have to give it a try, but that's, that's, that's the
architecture. And then you know at the end
that that LM produces some kind of output and, and there you go.
The the, I mean, if you're, if folks are not so familiar with
how these LMS work, you have multiple turns.
And technically this is almost like like a immutable function

(23:12):
in the sense that the when you call it again, you just give it
the previous output. And so it acts like it was, it
would be continuing on previous interactions on the previous
talk 2 calls. Obviously he's processing a new
talk tool return. But really it's kind of it
always goes from scratch and just does the token completion,
but given the new data, right? And so that that that actually

(23:35):
works. And yeah, in the end, you know,
you get some kind of output. You there was something you were
going to do, right. So you do that like that's,
that's that's where you're back to like the, the plain old work
for automation. And there you have it like
that's, that's the whole program.

(23:55):
How do you then approach optimization of this to ensure
you move from, you know, 90% of great to 100% or whatever it is
you're looking to truly solve? Yeah, I think, I mean even maybe
slightly before that because I, I, we have this chaotic software
now, right? Like the, the core kind of

(24:19):
business logic was kind of only expressed in words.
And the LM is figuring out its own path given the rules you
gave it. So that's, that is again, it's a
new kind of software. It's, it's very different.
So we do have to put it in all kinds of logging and tracing to
understand what it does right. And if you know, if you're

(24:39):
professional software engineers,we were probably already like
super deep in that topic. But like, if we skipped it for
now, like you need tracing now 'cause like, like you, the trace
tells you what the thing did. Like it's not always the same
trace. It really depends.
So you gotta, you gotta figure out what's going on and, and
from there you can kind of go and, and, and do the, the

(25:03):
optimization, the optimization itself.
Like folks probably have heard the term prompt engineering,
which is busy really kind of just like golfing out the, the
prompt towards being more successful.
I think one thing that isn't as intuitive is that if you give a
tool to the LM that is also an exercise in prompt engineering.

(25:27):
So you need to explain to the LMwhat the tool is for what that
kind of, I mean, obviously the the signature of the arguments,
but also like, let's say you have a string argument, you need
to like tell it what goes in there, like what the format is.
If there's anything that cannot be expressed in the time system,
like you have to be really expressive the, the way like I

(25:49):
think we're essentially past this step for like in GPD 3.5
step today's you need to like doall these tricks with front
engineering. It's largely not near anymore,
but you you'll really have to what my advice is to treat it
like a junior engineer who you know, isn't the smartest kid on

(26:11):
the block, but has incredible patience with you.
So you can like tell it like, you know, like how things work
and, and then it, it always considers it like, it sometimes
ignores a little bit and stuff, but like it.
But it's like, but the, the education that you give it is,

(26:33):
is actually considered every time the program runs from
scratch because it's, you know, it's really stateless.
It's not a learning system for now, but that also means that
it's, it's always top of mind. Like your business rules are
always top of mind, right? Like so like, but yeah, you have
to write them down. And I know as an engineer, you
know, we, we don't love writing docs.
That's how we drop now because you're writing docs for the

(26:55):
machine, but it feels better because the thing like listens
to you doesn't feel like writingdocuments That and because other
engineers also don't love reading them, that they don't
read like these, these things like read Yeah at.
Least this one that listens to you a little.
Yeah. So that's that's kind of thing.
So that's how you optimize things.

(27:15):
I don't want to talk about fine tuning models.
I think it's obviously it's doeshelp with cost and maybe that's
but it's a whole it's own topic.You also, I mean, there's this
whole topic of evals which you do need to dive into to make
sure your thing stays good over time and to kind of optimize it
at the edges. So, but that's kind of the loop

(27:37):
that you have to do with the right.
Yeah, Initially there were always a low hanging fruit on
the prompts. Then you, you know, tested with
users, with yourself, you see kind of with those special cases
that don't work, there will be some low hanging ones.
And then eventually you'll, you do have to write key vaults both
kind of that show you how it works in production.

(27:58):
So like basically just in production of implementation of
your, of your system. And also like these type of like
they're almost like unit tests where you effectively just run
your AI, give it a known task with known data and you'll
essentially write assertions that it does the right

(28:19):
recommendations at the end, right?
Yeah, this concept of eval driven development is really
starting to become talked about more similar to, you know, test
driven development has been talked about.
And I mean, you referenced it earlier, this idea of using LMS
as judges, I think is obviously very common now within certain
tasks, like writing, for example, and having it be an

(28:41):
editor for you. But in evals in particular,
there's, you know, an entire field and concept around it.
We just wrote a large e-book about it, actually talking about
like basically how to effectively do LM as a judge.
And I'm curious how you're applying AI as a judge to
evaluate outputs during this optimization phase and then also

(29:02):
within your evaluations? Yeah, you can.
You can use it in two ways. So what?
So one way how you can do it is actually not in the eval phase,
but like in the actual agent. So I mentioned how you know for
your, because your agent effectively is a loop, you have
to write some kind of exit condition.
And it's some like what like I wouldn't, what I would start

(29:23):
with is to just give it a numberof turns, but it doesn't need to
exhaust them. Obviously, if the AI says I'm
done here, then great, right? Well, what you could do is that
you could say, OK, my, my exit condition is actually just
another LLM that I give the taskand the output and, and maybe

(29:43):
the intermediate steps or maybe not.
And I basically just ask it like, what do you think about
this? And I think we've all like had
this experience ourselves that if you use Chichibuti or clutter
or some other like chat AI like that, that if it tells you it's
something in high confidence andthen you're like asking, Are you
sure? And it's like, oh, no, actually
in hindsight, right. And, and you know, obviously

(30:05):
like that is because they're doing the next token prediction.
And then in they're literally inhindsight saying this is all
like doesn't make any sense. Well, it's always fun too, to
pit, you know, Claude against GPT or something, right?
Like, Oh yeah, what? You know, GPT did this.
What do you think? Yeah.
And so you can, you can put thisinside into software, right That
that that works. And like it also like because
it's not bullshit, right. It's actually working like as in

(30:28):
the model now has more data and so that can actually make a
better call than the the next total prediction, which brings
forward the error that it made like 100 token before, right,
where the data wasn't there. It kind of gets biased and now
it like gives the wrong answer. Whereas then if you give it
everything at once says OK, obviously this, this, this was a

(30:50):
mistake, right? So you can put that into your
core agent root loop. But also very similarly, if you
write an eval where you say, OK,given this input produce the
output, you can then ask the, the LLM to make sure that or
like to, to, to have to get an eval output, whether that's a
good answer or bad answer. And the reason why that's better

(31:11):
than doing it in the main program loop is that you avoid
kind of the extra cost, extra latency from the LLM call.
And if you just don't need it, if you get your eval to like be
so good that it's always that's the right thing in the initial
turn, then you just don't need the LLM as charged kind of in
your in your primary eval loop. For teams that are maybe not

(31:32):
doing concerted AI development yet or at least don't have a
system that they're applying, let's say they get really
excited. They're like, you know, Malta
has really got this idea down and like these phases.
Where would you advise them to focus within this framework as
they begin to build their first agents?
Where should they be spending the most time and effort?

(31:53):
Yeah, I think the the the main thing and like it's so first of
all, it's new software, new typeof software.
So it's clear that most of us don't have a good intuition for
that for it. Most of it like if you have a,
you know, slightly larger organization that's going to be
subset to have maybe anxiety or just overall like don't feel

(32:16):
sure about it. And the best thing to do against
that is just to try it out in a non pressure setting.
So we, for example, just did an agent hackathon for just a week,
all engineers in the company. And you know, I mean, actually
it has pretty good outcomes and pretty awesome stuff.
But like the bigger outcome is that now everyone in the agency

(32:37):
has built one agent in their life.
And when the business task comes, they know they, they come
coming from a perspective that they've done it before.
We, we do have for, for the AISDK, which you're building.
We do have a set of examples. We you referenced my talk, my
colleague Nico had a had a session about building a

(33:00):
coaching agent from scratch, which is super well received and
which kind of is a tutorial section that kind of goes walks
you through building an agent. And it kind of shows some of the
fascinating stuff that I've beenmentioning today where, you
know, he's using not that smartest model on the blog.
It's giving the thing three tasks and it's not as good as

(33:20):
cloud code, but it's incredibly good.
Like like, and now you have built that thing yourself.
Like there's no ingredients, right?
There's no. And I think that kind of builds
the confidence that you can apply the same technique to
other other use cases because you're really not, you know, the
thing that seems like magic. And I think I, you know, at

(33:41):
least the magic in me creates this kind of a little bit of
fear response, a little bit of respect for like, I'm sure they
were super smart people, you know, obviously and throwback
and open AI and Google, they hire like the the smartest kids
on the block. But yeah, this turns out to be,
you know, doing pretty well if you, if you, you know, just do a

(34:03):
little bit of coding. Yeah.
Speaking of things that are, I think going well, it seems very
clear that Versal has done a lotof cool stuff with AI already
and is on the cusp of think a lot more.
I know plenty of our audience are familiar with her work, but
I'd love to maybe give a bit of an overview before we dive a bit

(34:25):
more into structure and securityof a agents of what you see as
the next stages of enabling developers around the world to
build with AI. What, what do you see as you
know kind of wherever cells AISDK is going and and what
what's the plan I guess? Yeah, no, totally.

(34:47):
I think we so it definitely is the AISDK at the core.
We we're shipping version 5. So it's probably has G8 when
this podcast comes out. There's a very late data right
now already released. It's you can just search X.
It's very well received. People love it.
It's a very, it's, you know, it's a breaking change, but of

(35:10):
the type that people laugh because we really just listen to
what people were struggling withand making it better.
And and it really kind of drivesdown and kind of making the
things that people want to buildfor agents be easier to model
with, with TypeScript and and and so forth.
So that's at the at the core. And I again, like I would
recommend using an existing example to kind of go from.
We also published the so-called chat SDK, which is really a full

(35:34):
blown like chat, chat BT like chat app.
So if that's what you're building, right, if you actually
have AUI, we there's now I agreethere's a bunch of funded
startups that are really just forks off this template and and
I think we have our first Unicorn in the books.

(35:55):
So that's a good starting point if you want to build a chat UI
because that's kind of tedious and they're all kind of you kind
of want them to look the same. So they're familiar, right.
So this has all the features built in.
What we are working on is like more infrastructure offers sell
great for doing this type of asynchronous programming that

(36:18):
kind of comes up in this core ofthe agent because it does it
runs for quite a while, it does all these like cool calls.
So you need what you would call like very reliable, very
persistent compute where obviously if this if your 15
minute running agent fails afterfive, you need to you need to

(36:41):
have a way to continue from fromyour last checkpoint.
And so this is something we're heavily investing in.
So we're investing in in a queueproduct directly inversell on
into a workflow management, etcetera.
But people should definitely check out then is our AI
gateway. So that's also directly
integrated with the AISDK. And the idea is that with the if

(37:04):
you use the AI gateway, you haveit configured, you can use every
AI model from any provider without any API keys.
So like, so if you want to, you know, communicate 2 comes out,
you guys give it a vibe check with your testing applications.
That's literally just you just change the string from whatever

(37:25):
model you have in there right now to a new one.
And you don't have to get, you know, go to the Chinese website
and and figure out how to get anEPI key.
Same with Gemini's probably evenharder.
But like, you know, so it's, it's really just a convenience
and that's doing development. Like we, we, we had this like
playground, we gave everyone thelike lower rate limits and we

(37:46):
thought we could just give everyone this for development as
well. Rate limits are super low at
this point, right? Like, but that's also why it's
free, because how much would youever need yourself, right?
But then if you want to go to production it, you know, you can
also access all of the same models and, and we bill you for,
for market rate here. So it's a, it's, it's really
just our way to get people to really try AI in a, in a

(38:09):
frictionless way, because it's so difficult sometimes to get to
these models. And then the final point is that
we do work on like we have our marketplace.
So like there's lots of things that you just want to do from,
from workflows, memory, browser use, etcetera that we are all
going to build where we make it available through the agent as a

(38:30):
bit of a special case. We also ship the sandbox.
So this is for the case where you, I mean, either you're
actually building a coding agentor what happens is that your
agent says, well, actually I'm really bad at math, but I just
generated some source code that would solve this problem.
And so you, you need a place to run it because it's like, you
know, might be very insecure code.

(38:52):
So that's what sandboxes are for, so that you can just
essentially run this like one off code in a in a in a safe
manner where there's no access to any of your secrets or any of
that stuff. Speaking of security, there are
plenty of valid concerns around everything from prompt injection
to personal identification leakage.

(39:13):
How do you think developers should be considering security
and privacy as they build AI tools?
Yeah, yeah, I mean, thanks for mentioning it.
It's a, it's a really important topic.
Like the the there are several threat vectors.
Problem injection is the the most concerning one.

(39:35):
Folks probably have heard about this like it's because it goes
through the news when people like jailbreak ChatGPT or, you
know, make some model do something that it wasn't
designed to. I think these jailbreaking use
cases are like, they're usually referring to like someone who
has access to the model type prompting themselves have

(39:57):
relative, you know, full access.That's like like that's kind of
the last generation. No problem.
Injection happens through you controlling the eyes fully in
the back end, but some user control input gets in there.
Right. And what we have to understand
is that there's like the I talked to them, these tools,

(40:19):
right, tools maybe reading from your database when the responses
go into the prompt. And the only thing stopping the,
the model to literally take thatthat data as gospel for what
it's supposed to do so that you kind of tell it not to.
But obviously, as we know from jailbreaking, people find ways
around that, right? So that's like, we have really

(40:41):
like in a way mind blowing, right?
So like the response of my toolsbecome part of the prompt
period, but not in any way jail,like jailed from and or treated
in any special way. From our perspective, this is
just like the the rest of the prompt.
And so and so it's, it's really like, like it's a similar
situation to a sequel injection,but with sequel injection, like

(41:03):
I think we, we're in a world where if you follow best
practices, that's it's not a problem, right?
Like you escape the, the, the user inputs and there's zero
risk. Now, obviously people get it
wrong and so that's why there's still sequel injection attacks.
But like in principle, it's a soft thing versus on, on prompt
injection. Like there's nothing you can do.

(41:25):
Like it's, it's like it just becomes part of the, of the
prompt. OK, I shouldn't say there's
nothing you can do, but like there's like you cannot make it
like save as in I 100% ensure that escaped my sequel content,
right? Like it's not, it's not like
that. And so we have to, we, we
unfortunately have to put security at a different layer.

(41:47):
So like, I mean, you can sanitize the inputs, right?
You can, you can check them against malicious stuff.
Very difficult, right? Because you can prompt inject in
Spanish or whatever, right? Like this is actually not hard.
But you, yeah, I would still encourage to do it.
But then the other thing is whatis the worst thing that can
happen, right? Is it is an important question.

(42:12):
And so like I'll give you an example.
If I, if I give my LLMA tool to make a SQL query and I and it's
really meant to only search within a single tenant in my
database or a single user, right?

(42:33):
But the part where that's enforced is part of the tool,
like it's just an argument that's supposed to put in the
user ID. Then I can trick that tool into
making arbitrary queries, right?If instead I give the LMA tool
that has hard coded kind of whatthe conditions of the query is,
then then I cannot escape that, right?
And so I, I, I need to like do stuff like that where I go and

(42:57):
make my tools so constrained that even if I consider
everything that is kind of part of the AI generated part of the
query, if I consider all of thata technical controlled, the
attacker shouldn't get anything that they shouldn't get right.
So, so I do have to think about these like in depth situations

(43:19):
rather than being able to say like, you know, this is trusted
software and I can, I can I always trust everything it does?
Are there other critical security and privacy
considerations that builders should be prioritizing when
creating and deploying AI agentsor other AI systems, especially
ones that are interacting with sensitive data, like for

(43:40):
example, the customer support agents or performing other
actions within enterprise systems?
Yeah. The other big one is just
broadly data acceleration. So this is where you try to like
make the model reveal data to a third party that it wasn't
intended to do the the both GitLab and GitHub actually had

(44:04):
essentially the same vulnerability in short order
where and it's, it's actually a good example for what I was
describing, right? They basically people just put,
I think, pull requests to publicrepositories, which are a
technical controlled right. And into these pull requests,
they just put prompt instructions.
And because the, you know, that now becomes part of the what the

(44:26):
AI does, right? And so next step is what you
know, now this AI, let's say there was corporate usual can't
do that much. And the first thing you can do
is make a comment. Well, what they did is they
said, OK, the comments are marked down.
I get to put images and I get toput query strings in the images

(44:47):
and I get to put the entire source code of the repository
into the cursing or whatever, right?
I said something like that and so so the the lesson is that you
have to treat the output of the LMS and trust it.
We, we've had to do exactly thiswith Galileo where we, we have
these Luna two small language models we developed to enable

(45:09):
real time guard railing where wewill both actually apply
guardrail metrics against both the inputs coming in and then
the outputs as well. Because to your point, it's, you
know, if something gets missed, you need to also see, hey, is,
is this leaking personal identifying information?
You know, is data being exfiltrated here And you need to
be able to block that risky content on the fly or else there

(45:31):
is major YouTube risks around this and potential compliance
vulnerabilities that you can runinto if you're, you know, a
regular aided company. So I love that you're thinking
about this because I think it's really crucial that you, you
know, refine the rules for your AI systems without having to
constantly redeploy code or elseyou put your whole system at

(45:54):
risk. And, you know, the, the magic
and the beauty of an LM is it can do things you don't expect
and can solve problems without you telling it exactly what to
do. But we have to account for that
in the other direction as well. No, I honestly agree.
It's like it's in, you know it is it's really early days,

(46:15):
right? Like this is going to be, I
mean, and, and across all dimensions, right?
Like we learn, we're learning much better how to build these
programs. We will learn more about its
security properties. I think we, but we know enough
today that it's certainly clear that it's something you have to
pay attention to. What haven't we covered so far?

(46:37):
That is, you know, burning in your mind right now around
agents or AI or, or what builders should be doing.
Yeah, I think we, we, we did a good overview my like, I mean as
as a versatile employee, what I pay a lot of attention to is
kind of just what are people building and how can I make that
better at scale, right. Like the, and the way we do that

(47:02):
is that we go look at the older adopters and and get their
feedback. And then we try to build
abstractions that make doing thething that people maybe we did
by hand something that that everyone else can do with a few
lines of code. And so we're definitely in that
phase where where that's what's going on.
But, but, but that's also that'salways, that's, that's part of

(47:25):
the fun part. And, and in parallel, I think we
like, you know, we're collectingkind of use cases, right?
Like right now, I think we have this phase where things are
pretty advanced in the, in the coding space and other, other
verticals that kind of, you know, interested, but like not

(47:48):
as advanced. And so I'm, I'm, I'm definitely
spending a lot of time finding patterns and, and happy to very
excited to share those in the future.
Are there particular companies that you see doing, Let me
rephrase this, are there particular companies that you
think are doing excellent work or have use cases that are

(48:09):
especially exciting? Yeah, this is this is a good
question. I think the, the way that we're
space is, I think set up right now is that, you know, you have
the, the coding space, which I think obviously is exciting and
there's there, I think there's more innovation to come, right?
Like I, I don't think we're in a, in any way of a steady state.

(48:29):
No, I, I already mentioned that I am personally very excited
about AI based code review and we talked about security.
I think that's particularly interesting inside there.
So like I think we'll live in the world where I don't like I
don't have a winner takes all market for a review agent

(48:52):
because I I want like the security expert and I want
security expert on this topic. And then I want the world class
team on this other topic and I also want their stuff.
And like, you know, the marginalcost for me having 2 is like
very low, right? So I, I really see us being in a
world where like you, you hire quote UN quote, all these

(49:12):
experts. So that, that's going to be a
thing. Obviously we are quite far along
on the, on the support agent side.
And I think we, we, that's, that's a, you know, certainly
the most mature side on the Internet.
I'm personally super excited andinvesting a lot of time in

(49:33):
agentic commerce, which I think is going to be the first use
case in the, in the kind of consumer side that that's going
to take off beyond the, the chatbots, like the, the, the big
chat bots, right where there, there's just like so much
creativity to, to be unlocked, for example, in, in finding like

(49:56):
in doing coherent clothing styles between various brands
and so forth, where like AI really, really, really can help
in maybe that's just me. I would love to have someone buy
my clothes for me. No, I I think there are some
exciting opportunities in e-commerce, no question.
And Speaking of exciting opportunities with AI, I'd love

(50:17):
to give our audience the opportunity to learn a bit more
about For Sale and what you're building.
Where can our listeners go to keep up with the truly
innovative work that For Sale isdoing in the AI space?
Yeah, I think we, so we we have a few kind of things in in the
space that I think people shouldcheck out.
I talked a lot about the AISDK. That's probably it's the most

(50:39):
tangible thing, right. And I think, you know, we have
the new version coming out. If you start a day do with the
beta, if you listen to this podcast when it actually comes
out, probably version 5 is NGA. So AISDK dot dev, that's the
place to go. I also mentioned chat SDK, which
is like the, the specific, like I'm going to build a chat bot
and I'm starting with one that has already all the features.

(51:01):
So if that's what you, what you're going for, I, I
definitely encourage people to go.
We did drop mention of V-0 a couple times in this, in this
conversation. So V-0 dot def is our software
that knows how to take your ideainto working full stack app.

(51:22):
And yeah, I think like that thatis has been revolutionizing how
people collaborate in early stages of a of a of a project.
And you know, for some folks actually bring it all the way
into production. But I think what we are really
focusing on that is in this in this ecosystem of of kind of
folks who are quote, UN quote, tech adjacent and people who are

(51:42):
writing code that there's a really smooth transition between
them. And yeah, otherwise you can
always find me on the X platform, the everything app.
And you know, for LinkedIn, if that's your vibe for like I,
we're trying to be super accessible answering questions
like especially if you like, like my number of priority is

(52:04):
that you guys are having a easy time deploying agents to resell.
If that doesn't work, please ping me, DME, etcetera because
that's our number one priority, that this new type of software
actually has a good place to runin the future.
Yeah, I believe it's at Cram Force if I recall correctly.
Yeah, I don't want to. I don't love it, but it's

(52:25):
unfortunately way too late to change.
It's memorable, so that's good. It is memorable.
Malta, so good to chat with you.Fantastic to catch up and thanks
so much for joining us on that podcast today.
It's been a ton of fun. Thanks for having me, this was
super fun. That's a wrap for today's
episode, Malta. Thanks again for coming on.
And as we continue Season 2 of Chain of Thought, we're

(52:45):
committed to bringing you even more valuable conversations with
the brightest minds and AI like Malta, even if they are
occasionally throwing a little water on the fire that is our
burning enthusiasm. I think that's a really
important element of these conversations.
And it's one of the reasons why I was particularly excited to
have Malta on is that only our Malta and the team at Versail

(53:08):
building incredible things like their AISD KV0 to enable
developers and builders and non-technical folks around the
world to build with AI. But they're also really thinking
through, you know, what do you need to do to observe, to
evaluate, to improve your AI applications?
And our, our goal is to empower builders around the world,

(53:29):
whether they're full time engineers, whether you're
someone who's learning with the the knowledge and insights you
need to build the future of AI, to build your website, to build
whatever it is you want. Because that is the exciting
part of these AI tools is the creativity that it can unlock
and the speed with which it can enable you to get to that MVP.
And the best way that you can help us reach more leaders like

(53:52):
Malta is by leaving a quick rating or review on your
favorite podcasting app, you know, giving us a like on
YouTube, if you're watching on there.
It really does make a difference.
It helps us to have more incredible conversations like
this one. And with that, thank you so much
for tuning in. And we'll catch you next week
with another episode of Chain ofThought Malta.
Thanks again, It's been a pleasure.
Thank you.
Advertise With Us

Popular Podcasts

On Purpose with Jay Shetty

On Purpose with Jay Shetty

I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!

Stuff You Should Know

Stuff You Should Know

If you've ever wanted to know about champagne, satanism, the Stonewall Uprising, chaos theory, LSD, El Nino, true crime and Rosa Parks, then look no further. Josh and Chuck have you covered.

Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.