All Episodes

June 24, 2025 • 36 mins

Master the art of building AI agents and powerful AI teammates with Diamond Bishop, Director of Engineering and AI at Datadog. In this deep dive, we explore crucial strategies for creating self-improving agent systems, from establishing robust evaluations (evals) and observability to designing effective human-in-the-loop escape hatches. Learn the secrets to building user trust, deciding between prompt engineering and fine-tuning, and managing data sets for peak performance. Diamond shares his expert insights on architecting agents, using LLM as a judge for quality control, and the future of ambient AI in DevSecOps. If you're looking to build your own AI assistant, this episode provides the essential principles and practical advice you need to get started and create systems that learn and improve over time.


Guest: Diamond Bishop, Director of Engineering and AI at Datadog

Learn more about Bits AI SRE: https://www.datadoghq.com/blog/bits-ai-sre/

Datadog MCP Server for Agents: https://www.datadoghq.com/blog/datadog-remote-mcp-server/


Sign up for A.I. coaching for professionals at: https://www.anetic.co


Get FREE AI tools

pip install tool-use-ai


Connect with us

https://x.com/ToolUseAI

https://x.com/MikeBirdTech

https://x.com/diamondbishop


00:00:00 - Intro

00:03:55 - When To Use an Agent vs a Script

00:05:44 - How to Architect an AI Agent

00:08:07 - Prompt Engineering vs Fine-Tuning

00:11:29 - Building Your First Eval Suite

00:26:06 - The Unsolved Problem in Agent Building

00:31:10 - The Future of Local AI Models & Privacy


Subscribe for more insights on AI tools, productivity, and agents.


Tool Use is a weekly conversation with AI experts brought to you by Anetic.

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
I don't want an agent that's running around and could just
take down my system because it'strying to help.
I want a self improving system. A self improving agent system.
Every time someone uses my product or my agent and it works
well, I want that to feedback into data set.
Every time it doesn't work well,I want that to feed it back into
data set and then I want to see that automatically improve my
system over time. Welcome to episode 45 of Tool

(00:20):
Use, the weekly conversation about AI tools and strategies
brought to you by Anetic. I'm Mike Byrd, and today we're
talking about building AI teammates, the agents, the
evals, what you need to know to make yourself and assistant.
This week we're joined by Diamond Bishop, the director of
Engineering and AI at Datadog. Diamond welcome to all use.
Hey thanks for having me. Absolutely.
So do you mind giving us a little bit of your?
Background Sure I'm a long time AI or ML guy.

(00:42):
I'll be dating myself a little bit, but I've been in this field
for 15 plus years, having workedat one point on the assistant
for Windows Phone. That's how long ago this was on
Cortana. Working on Alexa after that for
a while, helping get that from beta to Georgia and working on a
bunch of like fun AI products jumping around.

(01:03):
I worked at Pytorch at Facebook for a little while, ran my own
startup that was building AI assistance and now I work on AI
agents and evaluation tools and all the like for Datadog.
Very cool, would you mind give us a little insight to what bits
AI is? Bits AI is the main product that
I drive here right now, and BitsAI is really meant to be a set
of AI agents that help you get things done across Datadog.

(01:26):
So anything you would use for Devsec OPS, you know, if I'm
going to be debugging a problem that's coming through my system,
BITS will help you debug, run log queries, run trace queries,
all sorts of stuff like that. And really our goal in general
is to either sit, sit alongside you and save you time, or
ideally be able to do tasks for you that you don't have to.
So I always say I'm trying to get people to use Datadog left.

(01:49):
I don't know if my CEO likes me saying it quite that way, but
that's generally the goal for our our assistant.
What do you? Consider important design
principle or approach to make sure that you build the trust
with the users so they can startputting their agents in these
high value situations. Yeah, that's very important,
especially if you're talking about production systems, right?
I don't want an agent that's running around and could just
take down my system because it'strying to help.

(02:11):
So a lot of it for us right now,there's kind of two parts. 1 is
making sure that we're understanding the quality of the
agents that we're putting out there.
This is a bit of that eval aspect.
And another part of it is building the right, I call it
kind of either escape hatches ordouble checks, right?
Where it's like, I want to make sure that if you're going to
take an action that the human checks first.

(02:31):
We keep the human in the loop for a lot of the major actions.
Over time we expect to reduce that kind of like the self
driving world, right? There's a bit of it a if we're
on the highway and everything's good, you don't have to do very
much, but if I need something, I'm going to change it up.
I kind of want to make sure thatthe human's still there and can
kind of make its choice. We're not quite at the level of
the Waymo situation where we canbe completely, you know, hands

(02:53):
off the wheel. And we think that's very
important for earning trust is surface the right information to
our customers, surface the rightexpectations, and then make sure
the human can do things like check APR if we're trying to
check in code, you know, hit thebutton to actually spin up some
servers if we think that's the best thing to do next to solve a
problem. And we won't do that
automatically for you yet. Very cool and I definitely want

(03:14):
to dive deep into evals, but just about the escape hatch
idea, I've had some people on the show talked about human in
the loop and how important it is.
What type of modalities or interfaces do you think are
appropriate for kind of managingyour fleet of agents right now?
Is it just going to be a series of pop ups?
Will we get emails or texts? What do you think is a good
interface for that? Our goal in general is to try to
be where you work. So for a lot of our customers

(03:36):
that's like Slack or Teams. So right now Bits will actually
talk to you in Slack. And if it needs to show you more
information, we have a page thatwill bring you to that you can
click through to. But for very simple things,
we're going to reach out to you.We're going to give you a
summary of what Bits is doing. We're going to ask you a
question. We're just going to do that in
Slack, just like another teammate on your on your team
would do. Perfect.
Now kind of pulling it back to the the building of agents.

(03:58):
Do you have either heuristics orjust a general principle for
when something should be an agent versus just make a script
for it? What what are the criteria to
kind of bring AI into the mix versus just having deterministic
code run through a process? I think the very first part is
can you do it with deterministiccode?
If you can, you should no reasonnot to like.
I think like I think it's Satya who said something like, I don't

(04:21):
use my Ferrari to like go acrossthe street to get groceries or
something like that. I think he was talking about
like different size models. But I think it's very applicable
here too, where you have all this power and this power is
useful for certain things. You don't need all that power if
it if it can be solved with a couple lines of code.
So first thing I always ask people is like, can you have you
tried writing this in like traditional software?

(04:42):
Do you know that the test cases you have can't be solved with
that? And then the, the most of the
time when it can't be, it's because there's something
ambiguous or there's some like long tail of, of options you'd
have to try. So for like the SRE agent, the
reason it can't just be a simpleworkflow is that anyone who's
debugged the system knows that you can't just follow like a few

(05:02):
steps in your runbook because ifyou could do that, we would just
automate that. Any programmer would have
automated that already. So there's a lot of this like
reasoning and kind of making choices along the way and you
might branch off once you see some data.
So I think a lot of it comes down to will you change the
workflow or change your mind on what needs to be done based you
see some based on the data that you see.
And is there a very long tail ofcases that you're not just going

(05:25):
to be able to encode in a bunch of FL statements?
Yeah, no, absolutely. The fuzzy nature of LLMS give it
its power, but as soon as you rely too much on it, that's when
hallucinations and mistakes kindof get in the way and derail a
process when when you're going about billing agent.
So we've already determined thatthere's some ambiguity there.
So we need the part of the LM tokind of take ambiguity and and
make sense of it. What are some thoughts you have

(05:45):
around architecting an agent? Is there a frameworks you
follow? Do you try to make sure that
there's checks and balances along the way or or how do you
architect a a new agent? I think a lot of it in the very
beginning is less about the framework and more about you as
a dev or your team knowing that you have appropriate
observability into what's happening.

(06:07):
So you need to be logging kind of what your agent is doing and
have ways to kind of replay the choices that were made.
And that's more important than any particular framework to me.
And a lot of that is just back to the first time you build
something. There's going to be so many
things that go wrong. And because it's the stochastic
kind of fuzzy system, if you don't have really good like

(06:27):
logging and debug ability, you're just going to kind of be
banging your head against like, you know, the door, the wall,
whatever it is that you're closeto for quite a while while
you're trying to figure out how to actually make your system
work well. And just on that note of
observability, is that an area that agents can help us with?
It is, I think one of the thingsthat I'm thinking a lot about
right now, this is not a productthat exists yet is, you know,

(06:49):
who watches the Watchmen in somesense?
Like can I, can I build an agentthat tells me that's smart
enough and knows how to build other agents or like knows how
to do this kind of eval process and improvement process that I
can get some guidance because especially if I'm someone who
doesn't build these normally, wehave a lot of people who are
getting into the field now, I don't always know what does it
mean to build an eval data set? How do I actually go and then

(07:11):
run this eval? How do I go and test new ideas
out? How do I go and make sure it's
improving? How do I hill climb against it?
Should I be fine tuning? Should I be changing my prompts?
Should I be doing a bunch of things like that?
I think there's ways to actuallyget help from another AI system
to kind of inform you on what todo better to.
And it's very similar to what wedo with the alert investigation
work that happens for the SRE agent where you might go and
look at a bunch of logs and say,hey, here's some errors we're

(07:33):
seeing. You might go cluster a bunch of
information and say, we're doingerror analysis automatically for
you. And by the way, your agent is
really bad at answering questions about, you know, I
don't know what someone's doing in the morning if you're a
calendar agent or something. And then it might not be able to
solve it for you automatically, but it could at least surface
this information. So I absolutely think that
there's a world where there's almost this secondary agent or

(07:56):
AI system that helps improve theone that you're building.
Yeah, I think we're going to start seeing a lot more of that
too. I had someone on there talked
about the idea of it being almost like more of a video game
interface rather than classical text box.
One thing you mentioned, and I'dlike to hear just your general
thoughts on it, is, is prompt engineering versus fine tuning.
Do you have a line when you choose one or the other, or is
it so situational that there's no real rules for it?
I think this is kind of back to the what is your experimentation

(08:19):
framework look like and what have you tried so far?
So always try prompt engineering1st.
It's cheap, it's easy to do, youcan do try a bunch of things
very fast and as long as you have a good eval data set, you
can try it and see if it works. And if it works great, maybe you
don't need anything else. I think once you start getting
to a point where you're not getting any return by doing

(08:40):
prompt engineering, by changing the context and you have enough
data, which enough data is fuzzy, but you know, you've got,
you know, couple hundred, couplethousand data points that you
think are reasonable, then try fine tuning.
It's not that expensive to try. A lot of the proprietary
providers offer AP is for doing that now and see if you get any
gains from it. The big risk with fine tuning is

(09:02):
kind of the same risk with like training your own model.
New models are going to come outall the time.
You want to be ready to try those out too.
You don't want to lock yourself in.
So if it's a ton of work to do fine tuning, don't spend the
time. But I think it should be getting
easier over time to just say, hey, I'm going to quickly fight,
try some fine tuning compared tomy eval.
If it doesn't do anything, great, throw it away.
If it does something that I think is measurable, yeah, let's

(09:24):
use it. And then when the new new model
comes out, let's go compare the two again.
Let's do some AB testing. Let's say.
Does the newest, you know, anthropic model beat my
fine-tuned model? If it does, throw out the
fine-tuned model and just feel comfortable with that.
Yep, there's actually a project that's going to be had their V3
release soon called Augment Toolkit and it should really
streamline the process of fine tuning models which I think will
help democratize it. But I agree, prompted during

(09:47):
first try to exhaust that rope completely.
On the note of data sets for creating these evals and
whatnot, do you have a strategy for either automating it or is
having your like golden data setso important that it should
always be handwritten? I think it's a mix.
I don't think that it needs to be all handwritten, but I do
think that you need to have kindof human alignment in this

(10:09):
aspect. So a lot of it in the beginning
is like, you don't have any data.
Most people, right? You start, you're just building
this agent or the system, and you have maybe a few examples,
like 5 examples that you care about.
So start with that. Start with whatever you have.
That's fine. But then you need to move
quickly to either generate a lotmore, either synthetically,
which is possible, so you can take examples that you have,

(10:30):
even use LMS to try to produce new examples, or find other ways
to perturb your examples and produce a wider set of synthetic
data. And then ideally you'll move
quickly to get real world data with your system and start to
have either you or some other customers kind of interacting
with your system and logging this and starting to build up a
data set. And a sampling of that should go

(10:51):
to humans. So always spend take a sampling
of that and do your human annotation or human check to see
what ground truth should be. But I think if you can, if your
system is set up for it, you should be doing LM as a judge
for a chunk of it and just comparing doing what's called
alignment to compare human evaluation versus the LM as a
judge. And make sure that you bring
those two closer together and you can trust the LM as a judge.

(11:12):
And then that way you can run and build ground truth or close
to ground truth data sets for a lot larger data than like you as
a single person are going to be able to do.
You can't go and tag or annotatelike 1000 or 2000 data points,
but you can annotate 100 and it's completely reasonable and
you should do that in the beginning.
So on top of collecting the datasets, what other things do

(11:33):
people need to get into place inorder to start building these
eval suites that they can eitherrun on their project or even
have as a personal eval suite? I think there's a couple things
here. One of them is having a really
knowing how you're going to actually measure success.
So you need to have your successmetric defined and how you're
going to actually do evaluation to say this test case passed or

(11:54):
not. And I recommend largely,
especially in the beginning, you're, it should basically be a
pass or fail for most cases unless it's very quantifiable.
Because if you give people like a Likert scale of like one
through 5 or you give like 10 dimensions that you're trying to
measure, it's just too much in the beginning.
Like you're, you're over, you'reyou're not really thinking
enough about what's the core problem you're trying to make

(12:15):
the system solve, I think. And you can add on things later.
You can add new dimensions that you care about later, but you
really just want a simple single, probably custom metric
for yourself. The other part that I recommend
is don't use off the shelf metrics.
A lot of people will just pick up a library and be like, I'm
going to use a hallucination metric.
It's like, yeah, that's nice to know, but that's probably not
your main thing. If you're building an agent that

(12:37):
you know is made to help people write a report on a particular
to a particular set of data, youprobably have an idea of what
the success of that report lookslike.
And you should be able to measure that yourself in a
custom way rather than saying did it hallucinate or not.
So think a lot about that. And then after you kind of have
a measure of your evaluation, you need to have a process for

(12:58):
sampling data, which means you need to have a logging process.
You need to have a way to get feedback, and you need to have a
process for annotating that data.
And that looks like usually grabbing a label studio or
something and having a human process for doing some labeling
and then setting up some sort ofLM as a judge alongside that.
So I think those are kind of like the early steps.
Excellent. And on the note of LM as a
judge, especially aligning it with the human alignment, is

(13:20):
that a very manual process? Because I imagine you have to
kind of run the LLM as a judge against a certain data set and
then look at it yourself and just say, do these match or is
there any way to kind of expedite that process?
It's pretty manual in the beginning.
I would say the main way to expedite is just make sure that
you have experts you trust. That's either you or someone
else in your team or people you're working with, maybe your

(13:40):
customers. Sometimes if you're doing early
design partnership with someone.Because the people who actually
care about that output and writeit normally are going to be the
ones who can fastest tell you ifit's like a good a good result
or not. Pulling it back to like building
the AI agents, when someone's getting started, they they hear
about all the promise and they want to start investigating,
putting one together themselves.Have you noticed a trend in
issues that people will face from their first building that

(14:02):
if they had a little bit of advice they might make less
mistakes? I think the big one is thinking
about this, how you're going to evaluate it from the beginning.
There is a natural instinct and we run into this ourselves all
the time within day. A dog to build something quickly
that looks cool. You're like, Oh yeah, demos.
Well, looks great, let's ship it.
You know, it's like and not realize like you're going to be

(14:22):
spending the next six months just doing eval and like
understanding if it worked, how it works and what you're going
to do. And if you don't think about
that from the beginning, it's you, you sometimes miss things
that you could have done for cheaper.
So a good example of this is like if I'm building an
interface with an agent, I should very simply have a way
for a customer to give us explicit feedback or I should
have a way to get implicit feedback from a customer.

(14:44):
So that's a thumbs up, thumbs down.
That's knowing if they click through on the thing they want.
And if I don't build that in from the beginning, it becomes
hard for me to get that understanding of how well it's
working. And I think that that's also
true for some. Back to the human in the loop
aspect. You have to assume your age is
not going to be right every timelike that.
These systems are stochastic. These systems are wrong

(15:05):
frequently. And you need to have designed
for or at least start to think about what does it mean when
it's wrong and how do I build a good process that doesn't just
like it either has to be 100% correct or we throw it out
because it's never going to be 100%.
Correct. And as people start building
the, the intuition as to what metrics to gather, do you have
any idea for what the hello world for AI agencies?
Is there something that people could do to kind of get into the

(15:26):
routine of of building somethingfrom the early stages?
I think a lot of it is just like, pick something that you do
yourself relatively repetitivelythat you would be the expert on
and build something small that could be gathering data from a
few different data sources that you normally do every Monday
morning when you're getting intowork and want to, you know, get
caught up on something. You know, pulling, seeing what

(15:47):
PRS went out over the last week or, or, and you know, what's
kind of your update, pulling together status update.
That could be just doing something very basic that you
don't actually want the agent toautomate, but you at least know
how to check to see if it workedwell and just try something out.
Like, I think it, you know, right now it's such early days
that we should all just be hacking on, on little things
here and there. And what what's your tool stack

(16:09):
right now? Do you use Cursor Windsor for
anything like that? Personally, I kind of do a
wield. I use cursor and I use Cloud
code and the reason that I used to and I actually even codecs
occasionally right now is because I think there are
certain projects where I'm just trying to get something like I'm
trying to get the bare bones out.
I'm starting on a new project. I'm I'm OK to Yolo mode it, you

(16:30):
know, I'm like, hey, let's just try something like cloud code,
go for it. Let's see what happens, right?
I'm trying to create a demo. I'm trying to show something
what I mean rather than writing a spec.
That's one of the things I really like doing now with
these, these products is like sometimes I'm trying to explain
something and I don't want to have to go and like write a
whole document. I just want to get a demo out
that's like, this is kind of what I'm talking about.
And I can do that far faster nowthan writing, which is kind of

(16:51):
wild to think about. And then I use cursor.
If I'm more pointed, like we're at working in an existing code
base, we're trying to just like add a feature or make some fixes
here and there or ask questions about a code base.
I find cursor to be more useful for that.
Very cool. And are there any other AI tools
in your general workflow, something like voice input or
help you with your writing that you use on a regular basis?

(17:11):
I pretty consistently use Geminidirectly and in particular, I
actually use LM notebook a lot. I write a update for my for the
company, really for just anyone who's interested as like a kind
of a weekly newsletter thing every Sunday night.
That's here's what's happening in AI space that probably

(17:32):
matters to us a Datadog. And, and it's mostly it's for
people who like don't, aren't, you know, living on Twitter or
living in, in, you know, always looking at every new update that
comes out because there's so many new things coming out.
And I'm trying to just be a little bit pointed on like our
product teams, our engineers, everyone like that.
I'm like, Hey, here's something interesting that we might want
to try out this week. And I will throw all that data

(17:53):
into LM notebook every Sunday night and create a podcast out
of it and create some little like kind of Q&A things out of
it. And I actually really love that.
I've been doing that for I thinkI'm on issue #40 where I've been
using it. So I've been using it every week
for for quite a while. That's so cool.
Yeah. And I live quite rural so
everything's a long drive away from me, so being able to take
anything written and turn into an audio source is is massive
for me. And just having something read

(18:13):
out loud doesn't quite do it as much as like a podcast style
conversation. It's still a little light, but
like, you know, we have people like it because for us, it's the
subway, right? A lot of us are in New York, so
same thing of like, we're just hopping on the subway for 30
minutes and then, you know, yeah, listen to something.
Have you noticed any, whether it's internal or external
hesitation towards adopting a a tools, are people still kind of
wanting to make sure they hang on to the old way of doing
things or has it been pretty universal that these tools are

(18:36):
going to improve things so may as well start using them now
There's. Definitely a mix still and I
expect that for a long time. We have a lot of people who are
just very excited about the concept and the potential who
are hopping in and just trying anything.
And I'd love to see that. And I think that's how you need
to be. You need to be curious, you need
to be trying things out. But we definitely have a lot of
people who or contingent of folks who maybe tried something

(18:57):
two years ago or five years ago or 10 years ago and it just
doesn't work, didn't work well for them then and they're very
skeptical of it now. And that's fair.
And I think it's really hard to if you don't take the time to
get to know them and try it out.Because if you just use cursor
or cloud code like once without really learning the process,
you're going to create somethingthat's probably not great.

(19:18):
You're going to kind of like a garbage in, garbage out
situation. And then that'll turn you off of
it for good if you're not a curious person who wants to try
a bunch of things out. So we definitely see some of
that skepticism. I'd say the way that I work
against it is very much a like, let's just keep showing people
what's possible. I think the way to do it is you
kind of drum up interest and youget people you know, you kind of

(19:38):
prove by using the tools and showing them yourself what's
working well and what's not. And because we're starting to
see a wider range of of tools being adopted, do you have any
insight as to or any thoughts onthe principles for what makes a
great AI tool? I think a lot of it is related
to did they really design for the cases where you need to have
human in the loop? Did you really design for

(20:01):
allowing customization? I forget who wrote this
recently. I think maybe Stevie.
I talked about giving making it possible for people to adjust
the prompts yourself. I think that's terribly
important because if horseless carriages.
That's the name of the article. Because if you don't actually go
and. Make it so that anyone who's a
power user who wants to learn itor who wants to change it can

(20:23):
customize it to themselves. You're going to get things like,
you know, Gmail writing in a style that's not you and then
you kind of get turned off of it, right?
Or like, you know, the coding style, not listening to the set
of rules that you have provided it.
And I think people like cursor get that they have they have
rules that you can provide. Cloud has rules you can provide.
A lot of the tools that are really good out there are ones

(20:44):
that can work well out-of-the-box, but also be
customized to how you work everyday.
I fully agree, and as someone who tends to be a power user of
of most of the tools I use, I think it's invaluable.
Do you think there's a line, whether it's in like the
capability of the tool or even just the the interface, where
giving users this type of customizability could actually
shoot themselves in the foot because the prompt will be
either poor styling or it just kind of steers the model in the

(21:05):
wrong way? Is there too much
customizability that could actually negatively impact these
tools? I think it's definitely possible
but I'm still all for giving it to folks.
I just think you have to have smart defaults though cuz most
people we know don't actually change a lot of the settings or
customize and anything like that.
But I think it's giving them thepotential to do so, but they
don't have to. The biggest worry to me about

(21:26):
giving a lot of optional optionsand a lot of customization is
that sometimes the people developing the tools become lazy
then on what the defaults are because they say, oh, someone's
going to just change it to whatever they want.
You should still have a very clear golden path kind of
workflow that works well out-of-the-box where people can
try it out without having to do any customization.
Is there an approach where you can allow this type of highly

(21:46):
customized user data to be able to fed back in to contribute to
the eval system? So I'm thinking like if we test
for golden path, everything looks good, user decides to
change things and just catastrophically impacts
performance. Is there a way we could catch
that ahead of time? I.
Don't know if we can catch it ahead of time, but I do think
that if you log and test the kind of things you're seeing
real users doing, you should be able to find cases where these
are failing. So that's a lot of like, OK, if

(22:09):
they change the prompt this way,it doesn't pass emails anymore
and maybe we should service to them that to them in some way.
Or maybe we should provide some guidance, some guardrails that
start to to kind of keep them ontrack.
Or maybe we're OK with that because we make it very clear to
them, if you change this, it's not going to pass what we think
is working, but as long as you like it, then that's fine.
I think there is like there's levels that you can do to kind

(22:29):
of close that loop. I don't think it's one of those
things where you just kind of ignore it completely.
I think, I definitely think depending, especially if you're
building a SAS tool, you can logthis information and come back
and kind of like get that data flywheel going anyway.
If I do discover this this bad input that leads to this thing
and I start at my eval set, would you recommend just going
with the production data or should I kind of multiply it by
doing synthetic copies of it to add a bit more variety to my

(22:53):
eval suite? It depends a lot on your
particular use case and how large of a data set that you
have. So if you're seeing examples
that you don't have in your existing eval data set,
synthetic data is very useful for that.
So this is very common if you have like a, a, what is it like

(23:13):
1 positive example for like 99 negative examples, right?
Which is pretty common for people who do things like fraud
detection. And they don't just sit there
and train only on this unbalanced data set.
They either do like resampling and duplicate effort or they use
synthetic data or they do any number of things to kind of
rebalance their data. And I think if you're in those

(23:35):
kind of cases where it's very, very unbalanced, you should find
options there. But if you're distribution of
your data set looks a lot like adistribution of your users, and
it's not terribly unbalanced like I think it's OK just to
stick with your production data.OK.
And how would you manage having,say you're trying to build a big
data set partially for evals butalso partially for training a
model, Would you kind of keep itthe same?
Do you separate them because they're different purposes or

(23:56):
how do you do your data set management?
That's a good question. I think again, it kind of
depends on how much of the training are you doing.
Many people working on agents today are like at most doing
fine tuning. They're not doing like heavy
training of a model. If you are doing like complete
training of a model that you probably do want a different
data set that is maybe representative and more you're
making sure it is well distributed and like fits the

(24:20):
real use cases very well. But I think if you're just doing
eval, that's OK to have many eval data sets that you've
created. So you might say like, hey,
here's an eval data set that just represents our power users.
I'm not going to train just on that, but I will do eval on that
as kind of a tranche of data that I want to look at.
I might have another eval data set that's purely synthetic data
or like purely my golden path data set, or I might have
another eval data set that is like just a sampling across the

(24:42):
board. I think it's very reasonable to
have a number of data sets that you're kind of experimenting and
testing with. But when it comes to training,
that's different because you do need to make a choice.
And like, how do you put this all together to train or fine
tune a model? In which case, yeah, you
probably want to make sure that it's appropriately sampled and
kind of distribution looks good,and you might want to keep that
as a separate thing. There's also some worry about

(25:05):
overfitting. You probably do want some sort
of blind test set, especially ifyou're training a model where
you're not just going to train on everything that you saw.
You want to chunk, keep a chunk of data separate.
That is kind of best practice here.
OK. And one thing, I actually
haven't heard it before, but it makes perfect sense.
It's having those multiple data sets for evals along different
lines. Do you manage it with Git?
Is there a certain program you'drecommend?

(25:26):
Or do you just keep Jason L files?
Like how would you have these different data sets in in your
code? Bases, right now we define all
that in code ourselves. So it's just like in Git, like
you're saying, it's like, hey, here's the here's the data set
that matches to the particular these set of samples.
And then we keep all the actual data itself in just data

(25:47):
storage, right? It's just like S3 ish kind of
thing. I think different people do it a
little differently. If you're doing like large scale
training, you might be using something like weights and
biases. But if you're just doing eval
with a kind of homegrown eval data set, completely reasonable
just to kind of write it this way yourself at, you know, just
keep track of it in a, in a simple get in a simple like
file. What is the next big unsolved

(26:08):
problem for agent building If ifsomeone wants to come up and
spend their own agent, where do you see gaps currently that
could help accelerate people's development of these agents?
For me, it's very much this. How do you improve an agent once
you've built it? I think that it's kind of back
to the e-mail discussion. I know I talk about emails a
lot, but I think that this is kind of like the reason I think

(26:29):
about a lot is it is somewhat unsolved and how do you
automatically do that? I want a self improving system,
a self improving agent system. Every time someone uses my
product or my agent and it workswell, I want that to feedback
into data set. Every time it doesn't work well,
I want that to feed it back intodata set.
And then I want to see that automatically improve my system
over time. That is not solved yet.

(26:52):
I'm actively working on some parts of that.
I know other people are working in different parts of this.
I think it's a very hard problem, but I think it's the
the thing that's going to help, you know, a million companies go
out there and build agents. I want the one and two person
companies that can build these kind of systems and you can't do
that if you require like a full time applied scientist who's
sitting there just sifting through data all the time.

(27:12):
We should be building the tools and applications that make it
easier for one or two people to just build these applications
and kind of hit go and see them work improve every day.
I would love a tool like that. Do you foresee that being
something that's an offering from either a big research lab
company or even open source project?
Yes, I expect that parts of thiswill probably come in from open
source. I think open source is always a

(27:33):
good place to look to see like which parts of these tools.
I don't expect to see a full endto end system that does that in
open source because it is a verycomplex and relatively expensive
thing to do. So I think you're going to see
parts of this coming from the open, from the big labs.
And then I think they'll be people like us who build more at
the application level, who will probably build more of an end to
end system that pulls together these different things.
So hooks into the AP is for finetuning for open AI and anthropic

(27:57):
or, or, or fireworks or something like that.
And then ideally can combine things that we know about your
observability data and how customers are using your product
with the annotated data that youmight have got from a different
system. I think they'll there's kind of
this like orchestration and interaction layer that's missing
right now. Do you see any projects that
excite you or any possibility for like the ambient agent

(28:20):
future where instead of me having to go to a chat box, tell
it what I want, it can pick up on queues or signals from my
other work flows and then just start automatically running a
process? Absolutely.
I some of our agents do this already.
So the AISRE, as you'd imagine, you don't want to have to go and
tell it when something's wrong. So right now we are triggering

(28:41):
it off of an alert going off. So if you're like normally would
get page instead of paging you, maybe your Asre should be kicked
off 1st. And that's just a natural
workflow that's occurring that you wanted to run the background
before you even get to your computer.
And then when you get there, youcan start to interact with it.
I expect more and more and thosekind of proactive or
asynchronous agents to exist outthere.

(29:02):
You know, e-mail comes in. I've heard of a few companies
working on this in early stage where they're just triggering
workflows and an agents to work off of just emails coming
through or your text coming through or, or phone call coming
through or anything like that. These are all things that I
think we can easily place agents.
And for those are very clear avenues of like data comes in
initiative process. What are your thoughts on the

(29:25):
more device? We're starting to see that do
passive audio collection where you kind of have a necklace
walking around. It just picks up things.
Do you do that as something that's going to be a valuable
import towards agents, or do youthink it's more of a privacy
concern which will probably get squashed because public opinion
on having always on recurring devices isn't exactly there?
I think it's an interesting space.
I'm still not sold that they're going to be a common occurrence
yet, but I do think that we should.

(29:48):
We've seen a change in how people expect these products to
live, like products like the AI pendants or Rewinds product or
anyone like that or what was theother one?
Friend, friend.aiorfriend.com. I think people are open to them
more than they used to be. We're not in the glass whole
days anymore. If you recall the, you know, you

(30:09):
wear Google Glass and all of a sudden you're banned from
everywhere. I don't think we're in that.
And I am seeing more and more people wear these kind of
pendants and try them out. I think a lot of it just comes
down to do you get enough value out of it.
People will be OK giving up someaspect of privacy as long as,
one, you are trying to you kind of earn their trust by showing
that you are keeping this data private as well.

(30:30):
And two, you're getting value out of it.
If you're not getting value out of it, that's very different.
No one wants to just go around, record their day and then not
actually get anything out of it.But I like the idea of having a
infinite memory within reason. I do think it's a little
awkward. If you don't have one, and
someone else you're talking to has one, then you might kind of
be careful about what you're saying to them, and maybe that
becomes a kind of unnatural itself.

(30:52):
Yeah, I'm actually noticing it too.
With online video calls, people are running granola in the
background, so they transcribe the whole thing without really
getting consent. It's the way things are
evolving. 1 area that's been a big focus of mine is, is privacy
and security. And with AI, privacy is pretty
much the polar opposite from getting all the data to make the
systems run adequately. Do you see local private offline

(31:13):
models being able to compete with these bigger models,
whether if it's like an orchestrated fleet of highly
specialized ones or do you thinkthe gap between like the Super
clusters, data centers and what people trying locally is going
to be too vast that they'll never really be an alternative
besides these large centralized models?
Well, this is where I think the task you're doing matters a lot.
I don't think we'll ever, there's nowhere anytime soon

(31:34):
that we're going to have local models that are better than the
big models at everything. But if you're just doing speech
recognition transcription, yeah,local is fine.
Like locals, actually, we brought word error rates down on
these smaller models so much nowadays that you can easily run
a speech recognition model that sits on your computer, your
phone, whatever, and it's just as good.
And your and your audio doesn't need to go to the cloud.

(31:57):
Same for very simple tasks. I think that there's a lot of
like if I'm going to go and do classification, you know,
tagging my photos or classifyingthe text notes that I'm running,
you can probably have a small LLM.
You know, I guess that's kind ofcounterintuitive.
They're calling them Slims now. It's not, you know, SLM small
language model, not a small or large language model that's
sitting on your machine that cando that, knows you, is

(32:20):
customized for you with maybe a distilled version of like a much
larger, smarter model that is able to work maybe even in
conjunction with a model in the cloud.
I really like the hybrid systems.
I think that could help a lot with privacy where you do some
work on local and maybe some of the local work anonymizes some
of the input that you have and then you send it more complex
kind of problems up to the larger model in the cloud that's

(32:41):
used by a lot of people nowadaystoo.
Yeah, I saw a paper actually whodid it but recently where they
were able to get full encryptionso you can run things locally
when it needs help from the bigger model sends it out.
I'm also very optimistic about that.
You briefly mentioned the voice recognition ring locally.
I use Super Whisper all the time.
I'd say 90% of my prompts are spoken rather than typed.
And because I kind of use open interpreter for some tasks, I

(33:01):
can actually control a fair bit of my computer with my voice.
Do you use speech to text at all?
A little bit less than I used to.
I used to use a lot more when I would drive, but I take the
subway now and you don't want tobe that annoying guy on the
subway. And then everyone, all of us are
in open offices nowadays too. So I think that's the two
downsides. Like I love speech attacks.
As I said, I worked on Alexa, I worked on Cortana, I worked on

(33:22):
the Kinect for Xbox a long time ago too.
Like I love speech modalities. I think that that really is the
direction we're going over time.But there is an aspect of this
that's, well, I don't want to beannoying in a public space.
So I do a lot of typing. Yeah, I've seen some people come
out with little like whisper models, not whisper the model,
but you can just whisper to it, you know, maybe I, I think if
you're in a situation where you feel socially obligated to

(33:43):
whisper, you probably should be using voice to text anyways.
Precisely. We'll, we'll see how society
evolves. I work from home, use it all the
time. It's been great.
Yeah, at home, I at home, I do use it.
As long as I'm not trying to like wake up my kid.
I'll be like, if I'm like yelling at this at the computer,
I don't want to be waking him upnext door.
Yeah, absolutely. Yeah.
Got to be responsible. Do you have any other general
advice or insights on evals? So when people do build their

(34:04):
agents, they're able to kind of have a little checklist or just
some best practices in the back of mind so they can make sure
that they cover the foundation to get a good performance.
Yeah, I think the there's kind of a few things. 1 is basically,
as I said, make sure you actually know how to evaluate
your particular custom task. 2 is have a really good process
for error analysis basically or like digging through your data.

(34:28):
This is kind of the second part of like just look at your data.
You shouldn't be just logging this information and and
automating everything. You should actually feel pretty
comfortable going and looking atsingle instances of data, going
and looking at clusters of data and just digging in and have a
good tool for doing that. That might be something you
build custom yourself. That might be something that
exists out there that you're just, you know, know how to

(34:50):
query well. And then 3 is just know how to
close that data loop. OK, you get this information.
What am I going to do with it? Does that know how to
experiment? Know how to run those
experiments? If you have that kind of loop
going, everything else will be will be extra all.
Right. All right.
Last question for me today. What does techno optimism mean
to you? For me, it's a lot of being

(35:12):
optimistic about improvements wecan bring with the current
technology that's out there and where things are going.
I think that there is a aspect of folks sometimes who don't
want things to change and that'sthe, the, the kind of
pessimistic view to me. I think we can do a lot more
everyday. We can do a lot more for people.

(35:32):
We can do a lot more for, you know, the environment.
We can do a lot more for our children, a lot more for where
things are going in the country over the next 50 plus years.
If we lean into using technologyto solve a lot of our problems,
that is, you know, doing more with AI, that is doing more with
just kind of where we're gonna, you know, improving things like

(35:54):
nuclear power, improving things like battery storage for our
solar systems and leaning into that rather than being worried
about it. Yep, Diamond, this has been
super informative. Really appreciate you coming on
and and sharing your knowledge with all of us.
Before I let you go, is there anything what they wants to
know? Where can they keep up with you?
Find me on Twitter. I'm Diamond Bishop on Twitter or
XI guess is what we're supposed to call it now the I'm looking

(36:16):
for people who are building agents and especially early in
that development cycle, just looking to get more feedback To
me, it's a lot. It's very important to kind of
talk to people who are early in the stage to see what problems
they're hitting, how we can helpyou do eval, how we can help you
improve these things automatically.
So if you are someone doing that, reach out to me.
Happy to chat. We'll link everything down
below. And Diamond, thanks again.

(36:37):
I'll talk to you soon. Thank you.
See you.
Advertise With Us

Popular Podcasts

On Purpose with Jay Shetty

On Purpose with Jay Shetty

I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!

NFL Daily with Gregg Rosenthal

NFL Daily with Gregg Rosenthal

Gregg Rosenthal and a rotating crew of elite NFL Media co-hosts, including Patrick Claybon, Colleen Wolfe, Steve Wyche, Nick Shook and Jourdan Rodrigue of The Athletic get you caught up daily on all the NFL news and analysis you need to be smarter and funnier than your friends.

The Joe Rogan Experience

The Joe Rogan Experience

The official podcast of comedian Joe Rogan.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.