Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
The most important thing that I like to ask people is if I use
your app and I complain, how much effort is it for you to
take that complaint and turn it into an eval?
And the best teams, it's like one or two clicks.
If that the teams that are NGMI,they haven't even thought about
the answer to that, right? And so, so I think that's really
the goal. Like can you get to a point
(00:20):
where it's two clicks from someone complaining to something
ending up in your eval's? You're listening to Founder
(00:41):
mode. The podcast where builders share
what's really working instead ofjust pitching the dream.
Today we're talking about something that sounds easy but
turns out to be brutally hard, and that's going from your AI
demo to production. Yeah, our guest today has spent
the last decade building the infrastructure that makes
scaling actually possible, and now he's applying it to help
(01:04):
companies make their AI reliable.
Yeah, I mean, we've all seen thehype, the the amazing 32nd demo
that really hits, but then six months later it's still not
live. That's why this episode really
matters. It's about the systems feedback
loops and boring details that make AI work for real teams.
Kevin, we're seeing a lot of failure in AI implementation.
(01:26):
The the MIT report that just came out, 95% of enterprise AI
adoption has driven 0 ROI. What's your take?
Yeah. I mean, you know, everybody's
striving to now be in that 5% because they've all heard that
the 95% and they're all looking themselves in the mirror and
being like, which side am I on? And knowing that they're not on
the good side more than likely. And so I think we've been
(01:48):
fortunate we found a couple of the five percenters in our work.
And I think what, you know, stood out to me when I read that
is like, what were the things that sort of made you show up in
that case? And typically it's a lot deeper
work, a lot deeper integration, a lot more hands on, sort of
like understanding the use case and then really thinking about
it as a business and not saying,oh, I've got this cool UI widget
(02:11):
or AI widget that I can apply tolike make something see or
something happen. And so for us, it's really been
about finding something that's like materially hard and then
spending a lot of time on it to deeply understand it and how the
business sees it. Because today it's so easy to,
you know, jump into ChatGPT and grab a doc, a piece of paper or
just some data and ask it a question and it will give you an
(02:32):
answer. Yeah.
But then seeing that show up dayin, day out, that's where it
really the hard stuff begins. So, yeah.
So I think the the real trick there is just figuring out like
where the business value is and like being sure that you can
build the thing that matters to sort of like actually achieve
and land in that 5%, which is where you want to be.
Yeah. I think it makes a ton of sense.
And it's not about just participating in the AI hype
(02:54):
cycle, right? It's doing things that drive
business value. Totally.
Anker's built across like multiple generations of
infrastructure, I think from these structured data to
unstructured search to AI agentsand you know, done it quietly
but at significant scale. Yeah, he's also one of the
clearest thinkers, yeah, in the space.
And he has no interest in selling you hype.
So should we, should we bring Anker on?
Let's do it. Anchor, welcome to founder mode.
(03:26):
You've gone from databases to MLto AI, and now you're building
honestly what's synonymous with eval with brain trust.
Great to have you here. Excited to be here, thanks for
having me. Yeah, Anchor, we're, you know,
we, we do a little prep here. We kind of look at all of our
guests backgrounds and the things that they've built and
done. Super impressive.
Your, your, your experience and your history.
You've built across so many different layers of the stack,
(03:47):
right, Like search databases, AIand now AI reliability.
What's the common thread that kind of brought you here?
Honestly it's databases. No matter what, I, I feel like I
can't escape them or to me, everything looks like a database
problem. We even like at Braintrust, we
already built our own database called Brainstore.
(04:08):
But you know, I, I think I've, I've always, I've always been
fascinated by this idea that if you help someone kind of create
a really good system that lets them capture data, work with it
and then iterate with it, then they can build really good stuff
in, you know, back in the day when I was working at MEMS
sequel, it's now called single store, we helped a bunch of
(04:29):
enterprise companies that were generating, you know, tons of
data, like financial companies who if they could make trading
decisions in 30 seconds instead of like 30 hours, they would be
able to make a lot more money. Or, you know, industrial
companies that were trying to get data from sensors and make,
you know, better predictions about when things would fail.
And it's kind of the same thing,I think in AI, the more
(04:51):
effectively and the more quicklyyou can use data to improve the
quality of your product, for example, like the much better of
a product you build. Was there like a particular like
pain from your past work where you're like, wow, it's it's
evals is the point that's going to be the one for AI that's
going to fix it? And they kind of triggered this.
You know what? What is now?
Brain trust. Yeah, you know, I I almost
didn't want to start working on brain trust because I was so
(05:13):
fatigued from this problem. So my last start up in PIRA, we
built AI based document extraction and it was really,
really hard pre ChatGPT. Now it's a much easier problem.
But you know, back then it was areally hard technical problem
and we supported a bunch of different document types for
example like invoices, bank statements, insurance documents,
(05:35):
etcetera. And we realized after Bert came
out that we could train one transformer model to do really
well at all of these things rather than train like different
models for for each of them. But what we found is that every
time we improved 1 document type, we ran the risk of
regressing another one. So we might like improve the
invoice performance and then break bank statements.
(05:57):
Now you probably see where this is going.
Our financial customers were notparticularly excited about that.
And so we had to get really goodat doing evals.
And as soon as we built evals, we went, I think we used to
argue about model decisions for like 3 or 4 months at a time.
We went to shipping a lot faster, and I saw that really
(06:18):
transform how we built stuff. But then it became really hard
to get data to do evals. And that's when we started
figuring out, OK, how do we linktogether our observability stack
to getting data to play with offline?
That felt very weird at the time.
And we had to write a bunch of scripts and build a bunch of
stuff, but I didn't think too much of it.
And then Figma acquired us, and I led the AI team there.
(06:38):
And, you know, that was like right after ChatGPT came out.
So we dropped everything that wewere doing and we started
thinking about, OK, how do we use LMS to make Figma's product
better? And you know, funny enough, we
ran into exactly the same problems.
Like we kept building stuff and then another team would start
playing with it and break it. And, you know, we, we ended up
basically building the same tooling again.
(07:01):
And after doing that twice, I was like really sick of this
problem. So much so that I didn't really
want to talk about. It and then you started a
company to solve the problem. Yeah, And you know, this
brilliant guy, Ilad was like, hey, you know, you, you kind of
like built the same tools at both of these companies.
But now it's not just you and a corner are working on AI stuff.
There's a lot of companies like Instacart who are building AI
(07:23):
products. Maybe other people have this
problem. And I was like, oh, man, I'm so
sick of, you know, working on this internal tool.
But we talked to a bunch of companies including, you know,
Instacart, Notion, Sapier, Airtable, Stripe, a bunch of our
early customers, and they were like, yeah, you know, we're just
starting to to figure this out. And that's how brain trust kind
(07:44):
of came about. That's amazing.
So you said that going from prototype to production in AI is
is kind of an uphill battle. What do most teams
underestimate? I think, you know, once you ship
your product, you start to get feedback from users that you
didn't expect. Honestly, that's true with any
product. But in AI, what happens is, you
(08:05):
know, someone uses your product.Let's say you're building, we'll
keep using the Instacart example.
Like let's say you're building atool to help people construct
their their grocery cart for theweek.
What happens is people will findthings that don't work, they'll
complain to you and then you'll realize like, oh wow, you know,
the user is right. I need to go and edit my prompts
or I need to go and like change,you know, the model I'm using or
(08:27):
maybe try decomposing the problem what whatever it is.
And so you try doing that and you play with whatever the user
was trying to do and it improvesthat user scenario.
And then you ship it and then another user complaints and
says, hey, yesterday I was able to create, you know, my shopping
cart, but now it's not adding bananas to it.
You're like, Oh my God, what youknow, what did I do?
(08:49):
And basically, you know, when you're prototyping, you tend to
have this razor focus on just a few examples or maybe one thing
at a time that you're really looking at.
And that's a really great way tomake progress and iterate on an
AI product. But when you actually ship
something and you have a lot of users, suddenly you need to take
into account kind of the collective wisdom that you've
gathered over time. And I think that's that's where
(09:12):
it gets really, really challenging.
I think it can feel a lot like, you know, literally like playing
whack A mole. Like you improve 1 user's thing,
another thing breaks, you improve another thing, etcetera.
And I think that's where teams start to to see the benefit of
evals. You know, early on we had this
heuristic that if someone hadn'tshipped their product at least
(09:33):
three months ago, it wasn't worth talking to them about
brain trust. And we would often meet teams
and they were like, hey, we don't need evals.
Our product is great. My mom uses it.
She really likes it. The models are great.
You know, they're getting better.
I don't need this. And then kind of like 3 months
later, they'd come back to us a little bit hungover.
Hey, are you guys still there? And I think that timeline has
(09:57):
accelerated. We'll now understand that evals
are quite useful, but the general, the general thing is,
you know, is still there. The phenomenon.
How do you help teams kind of move faster without losing
trust? Like what's kind of the key
thing, Like if you were going tolike, you know, pick the one or
two things that a team should doto say, you know, I want to move
faster, like, you know, increasemy velocity without like losing
(10:18):
trust. And sort of like you said, what
you built yesterday is not goingto break tomorrow, kind of.
Thing, yeah, I think there's really two things that you
should get good at. The 1st is you should have a
really good environment for doing iteration.
We have something called the playground in our product that
it's really simple. You can test multiple models and
prompts or agents. You know more complex things if
you want side by side with a bunch of data points.
(10:41):
And you don't need to start by computing fancy numbers and
scores like just running your prompt on a bunch of examples
and looking at the output with your eyeballs is a pretty good
way to start. And when you're, if you're like
sitting in your terminal and playing around with something,
you're only working on one example at a time.
So I think just getting some basic tool in place that helps
(11:02):
you do that. And then of course the 10
examples that you start with will suddenly turn into like
10,000. And then it's really useful to
actually have some numbers to help you sift through
everything. But having some kind of like
iteration mechanism, I think that's really important.
And then the other thing is having a really strong
connection between what happens in production and what you're
able to test. And I think This is why we've
(11:25):
had this really sort of end to end experience from logging and
tracing all the way back to evals in Braintrust from
basically the beginning. It's funny, like one of our
first customers was Airtable andwe didn't have logging when we
started. And there was an engineer at
Airtable who basically created an eval.
And then you see where this is going.
(11:45):
You just kept adding stuff to it.
And soon it became, you know, the biggest eval we had ever
seen. And we were like, hey, what's
going on? He's like, yeah, these are my
logs from production, and I'm just using, I'm capturing them
in braintrust so that they're inthe same format as my evals.
And that was a huge, you know, it's like, OK, yeah, that
actually makes a lot of sense. Baby logging.
Yeah. And so we, we, you know, we, we
(12:06):
turned what he was doing essentially into, into logging.
But having that that feedback loop I think is is just
incredibly important. So you've been public about
comparing models like opening eyes and and Anthropics.
What is good model selection actually look like today?
I think you know there's two analogies for it.
For what models might be 1 is they might be like CPUs.
(12:30):
And by that I mean like you can take the same set of
instructions and pipe them through ACPU and you know, you
ought to get exactly the same result, but different CPUs have
different performance profiles. Like if you run, you know, Intel
CPU from three years ago, it might weirdly be better at
certain things, but likely not other things, etcetera.
(12:51):
And then I think another analogyis databases.
Again, I think a lot about databases, but every database
speaks Sequel, but they speak slightly different dialects of
Sequel. And some databases are
dramatically better at certain use cases than others.
And I actually think LLMS are a little bit more like databases
than they are like CP US. Like the same English string
sent to one LLM might produce like really different things in
(13:13):
another. And kind of like databases, I
think you need to do a bit of, you know, benchmarking and
diligence when you decide which model or family of models is
your best bet to solve a particular use case for some
period of time. But then you need to focus
almost all of your effort into making that model work really
well. And I think the best teams that
(13:36):
we work with, basically, if you were building a SAS product,
let's say like 10 years ago, youmight re evaluate your
relational database every three years.
Or if you're very, you know, advanced every year or something
like that. If you think about AI,
everything is compressed. So the best teams that we work
with re evaluate their model choices maybe once every one or
(13:57):
two months. And they re run a bunch of
benchmarks and sort of understand, OK, well, it seems
like now in GPT 5, GPT 5 Nano islike weirdly fast.
And so I can suddenly unlock this user experience.
I didn't think I could have donebefore.
Let me go like play with that for another month or two.
We do see teams that never re evaluate their model choices and
(14:18):
you know that I think they oftenlose product market fit.
And then we also see teams that spend all their time comparing
models side by side and they never ship anything.
So we kind of think this in between point is is what works
best. How about like actually eval
fatigue? Like, you know, once you get to
scale, you're going to get, you know, obviously lots of evals
and lots of data coming back. Like what's sort of the best
(14:40):
practice to sort of understand what to look at, how deep to go
and to make sure that you're sort of balancing this sort of
like tuning versus like shippingand moving forward?
Yeah, I think the short answer to your question is RM-RF.
Don't be afraid to delete things.
The long answer to your questionis, I think my framework for
evals is not to think of them asbenchmarks, it's to think of
(15:02):
them as time savers. So as humans, I think when we do
evals, our job is to influence what an LLM is doing to better
match the expectations that we have for what a user wants.
That's it, right? It, that's our, that's kind of
our, our, our, our role in, in, in this weird world that we live
in. And let's say that we have like
(15:24):
15 minutes a day to spend working on that.
If you have, let's say you have one user and that user uses your
product five times a day, so yougenerate five data points, then
your eval might be you spending 15 minutes looking at what that
user did five times, rememberingwhat they did yesterday, and
then maybe tweaking something sothat you have a better shot at
(15:45):
improving the five times they use the product tomorrow.
Now let's say you have, you know, 500 users and they each
use the product 100 times a day.You know, now you have 50,000
data points and you just can't look at all 50,000 interactions.
And I think the right way to think about evals is OK, I have
these 50,000 examples. What are the five to 10 that are
(16:07):
actually worth my time looking at in more detail?
And I think if you think of evals as kind of like a
prioritization function, then it, the answer to your question
becomes kind of obvious. Like if you have too many evals,
that means that you're just generating too many things which
seem like a priority. And so maybe you more
aggressively prioritize within that or maybe you remove things.
I think the mistake is when people start to collect like
(16:31):
50,000 evals and then they generate a bar chart instead of
actually looking at the data. And they say, OK, now I'm, I no
longer have time, you know, to look at the, to look at the
example. So I'm just going to look at the
bar chart. And I think, like with anything,
right, with any example of anyone using data, that is not
the right thing to do, right? You you posted this great line,
(16:51):
I'll make sure I get it right. All agents converge on while
loops with tools. What actually makes that design
work? I think it is inherently what
the LLM, what the LLMS do, right?
Like if you think about stripping away all software
abstractions and you know, really just thinking, imagining
(17:14):
the most simple architecture that if it worked would be
extremely simple. And if it doesn't work, it's
only because the LLM is not yet smart enough.
That would be a while loop of tools.
And I think because LLMS are improving so quickly and because
the model, the people who are improving, the LLMS are sort of
adopting and building with this mindset, you can look at how
(17:36):
cloud code is built. You can look at the Open AI
Agents SDK, you can look at how ChatGPT works when you actually
interact with it. It's all the same, you know,
while loop with tools idea. That means that this
architecture is the only one that really withstands the
iterations that models make. Now of course the reality is a
bit more complicated, like sometimes you need to think
about state, sometimes you need to have one LLM delegate a task
(18:00):
to another simpler LLM. But I think that arriving on
these more complex designs with a purpose like hey, I used a
while loop of tools and I got stuck because I need to delegate
this task to another LLM for cost reasons.
I think if you arrive on that from first principles by
actually experiencing the problem, then it justifies
(18:21):
investing in a more complex design.
And then I think, you know, if you arrive on it with that, like
from first principles like that,if the model becomes better or
cheaper, then you'll also have no ego or no attachment to that,
right? So if suddenly the, the, the
assumption, which is that the model is too expensive to do a
(18:41):
while loop of tools goes away, then you'll just revert back to
the simpler design. So that's, that's just what we
see. I think some of the best teams
do. It was interesting, we were
having lunch a couple weeks ago and we were talking about how we
were solving one of our problemsand kind of went back to a more
deterministic solution. And you were like, I, I, I think
you guys are going to flip on that a bunch.
And I think this reminds me of that.
And like, how have you sort of seen that?
(19:03):
And like, you know, when I'm sure there's stuff in your
guys's world where you're looking at that and you sort of
say, oh, this time, you know, the LMS not smart enough.
I need to kind of like coach it or make a more rules based kind
of code. And then you end up saying, oh,
now they got smarter. How have you, you know, do
advise teams just constantly test?
Is it like you have an eval framework work that sits on top
of that or what would you sort of approach that problem in a
more, like I said, first principles way?
(19:24):
Yeah, I think there's really twothings you can do.
So the most important thing you can do is you should capture the
thing that's not working as an eval.
And the reason that's really important is anytime a new model
comes out or anytime you have some kind of idea about how to
improve things, you want to be able to re evaluate the
assumption as efficiently as possible.
(19:45):
So let's say like you're trying to build something in the
healthcare space and you want tohave something, you want to have
a model look at patient's notes and you found that if the
patient's notes are more than 3 pages long, the model gets
totally stuck. I'm just making this up.
And, and so you're like, OK, what do I do about this?
Well, the first thing you shoulddo is capture several examples
(20:07):
of the model getting stuck so that let's say GBT 6 comes out
tomorrow. You can click a button and then
find out if it's still stuck. So make sure you do that now.
To actually work around it, I think there's like really two
things you can do. The first is you can hack and
hacking works. And the second thing, which I've
seen a lot of people start to do, and I think it's actually a
(20:30):
pretty powerful method, is just to reshape the user experience
around what the models can do today, with an eye towards
enabling more as the models get better.
If you think about some of the most popular AI applications,
they are simply allowing the user to do things that from the
outside, if you look at something like Cursor, you're
(20:51):
like, Oh my God, they got so lucky because the models are
really good at coding. Well, no one actually knew that
until Cursor really became popular.
And I think the Cursor team is pretty smart, right?
They iterated like crazy and figured out like, oh, these
things actually work really well.
Let's actually figure out how tocraft a user experience which
highlights these things. So I think that mentality is is
(21:13):
also really powerful. What is good observability look
like in AI apps and and where are the where are most teams
kind of blind right now? Yeah, I think observability is a
bit of an overloaded term. And the reason is I I think it's
really important to think about the business outcome that you're
trying to drive by investing in observability.
It's also expensive. And so it's important to think
(21:35):
about like what am I actually trying to solve?
So in the traditional software world, I think a really good
product is Datadog. And the problem that you're
trying to solve by investing in observability with Datadog is
uptime, or, you know, uptime in performance.
So, you know, we use Datadog andlet's say someone is having, you
(21:55):
know, the brain trust UI is slow.
We rely on Datadog to go and figure that out.
And I think, OK, you know, up time is really important to us.
Therefore we invest in observability and pay Datadog a
ton of money and you know it, ithelps us achieve that goal.
And AII think the purpose of observability is fundamentally
different. It's quality.
(22:16):
So an investment in observability is an investment
in the quality of your application itself.
And if you work backwards from that, you know, you ask, OK, how
does observability contribute toquality?
I think the most important thingis that it lets you build a
feedback loop between what usersare experiencing and what your
developers and subject matter experts, product managers are
(22:38):
looking at regularly. So the most important thing that
I like to ask people is if I useyour app and I complain, how
much effort is it for you to take that complaint and turn it
into an eval? And the best teams, it's like
one or two clicks. If that the teams that are NGMI,
(22:59):
you know, they don't they, they,they haven't even thought about
the answer to that, right? And so, so I think that's really
the goal. Like can you get to a point
where it's two clicks from someone complaining to something
ending up in your evals? We need to start a not going to
make it like shame name and shame segment.
How I mean to go into that one alittle deeper.
(23:20):
Interesting. Like so when you say like one or
two clicks and that's like, hey,they report a bug because they
shake the app or they put, you know, they click something on
the UI to report a bug. And then behind the scenes
they're sort of capturing that. Or they're like, are there teams
that are actually taking user feedback and just saying every
bug sort of captures the input or the sort of change and then
just sort of makes a new eval for that?
Yeah, I think there's there's some people who go overboard.
(23:43):
So you asked this really good question about like what happens
when I have too many evals and how do I interpret that?
I think that one of the lessons I learned pretty early in
traditional ML is that while youcan be very liberal with the
data that you use for training, you have to be very careful
about the data you use for validation.
(24:03):
And I think that evals are very much an exercise in applying
human judgement and taste to an LLM, you know, and what it what
it generates. And so I think at some point it
is quite valuable to have a human curate the set of things
that end up in your eval's. Now I think scale like dictates
(24:24):
everything. If you have one user and your
user uses the product five timesa day, then you should look at
every log and then say, oh, thisone is fine, but this one didn't
look so good. Let me add it to my eval's.
As you start to scale, there area lot of different things you
can do. The thing that has worked really
well for our customers is using online scoring.
(24:45):
Online scoring lets you write code or use LLMS to look at your
data and then figure out what islikely to be, you know,
potentially bad and then and then save it for evals.
And there's a lot of, you know, really interesting stuff you can
do there. And then of course, the other
signal of course is user feedback.
So you can capture, you know, explicit user feedback like
(25:05):
thumbs up or thumbs down or implicit, like are they shaking
the phone or whatever. And that tends to be a good
signal, but it's very, very noisy.
So I think you know, using LLMS to post process stuff, applying
curation with some human insight, all that is very
important. What's next for evals, agents
and AI infrastructure of the next few years?
You know, so one of the weird things over the past couple
(25:28):
years of building brain trust isthat we work with teams that are
building like the coolest AI products in the world, but our
product has had like almost no AI in it.
And you know, I feel like we, we're building like a previous
generation product in in some ways.
And the, the reason is that until recently, models have not
been really good at looking at each other's work.
(25:50):
I think like a good metaphor forthis is a dog looking at itself
in the mirror literally like if you give an LLM some of the work
of another LLM, it will get confused and and sort of have a
hard time understanding what is it versus, you know, the other
LLM. Luckily for all of us I think
(26:10):
Claude 4 GPT 5 and not yet Gemini 2.5 but hopefully.
Listen up, Google. Yes, please Logan.
Yeah, everyone fam, please. The model is clearly smart
enough, but it's just not good at calling tools anyway.
The most recent vintage of models are, you know, they have
enough of an identity, if you will, that they're able to, to
(26:31):
decompose this. And I think with advancements
around reasoning and so on, we've just seen in our, in our
evals that they're able to now actually meaningfully improve
the work of another LLM. And so we've started unleashing
this in our product. We just released something
called Loop, which is an agent that's built into brain trust.
It's now accessible as an MCP aswell.
(26:52):
And so you can use it, you know,in whatever medium makes the
most sense. But it is incredibly good at
actually looking at the work of another LLM and improving
things. And we've seen great performance
for things like automatically improving prompts, looking at a
data set and figuring out, OK, what should I add to this data
set to make it better? You know, looking at logs and
(27:14):
trying to figure out what shouldI pull from these logs that
Kevin's not already testing and,and help him.
Yeah, exactly. And so I, you know, I, I think
again, if you go back to what evals fundamentally are, right,
it's like as a human, I'm helping to guide a model, do
something that aligns with my orusers expectations.
(27:37):
The implementation of that untilrecently was very manual.
Like you look at a dashboard with your eyeballs and then you
click around and you're like, OK, let me like type 65
characters to see if I can tell the model that I'm going to lose
my job if I, you know, whatever and see if it changes things.
And I think that that part of itis, is, is really going to
change over the next year. It's the implementation of evals
(27:58):
and creating the feedback loop from logs to evals is going to
be very, very LLM driven. But I think we're still going to
be guiding the taste and judgement that sort of goes into
that. Where can people get involved
and follow your work? So I'm on X slash Twitter.
My username is ANKRGYL. As an Indian person, it's kind
of hard to get your name, so I discovered when I was like 11
(28:20):
that if I remove the vowels. That was available.
And, and, you know, it's, it's actually worked well, it's, it's
universally worked for me acrossevery platform.
So none of the other, you know, 1,000,000 people with my exact
name. I figured it out.
Yeah, exactly. Yeah.
The. And what's the what's the brain
trust website? Braintrust dot dev.
Awesome. And then any offer anything you
(28:41):
want to share with Founder Mode listeners for for Braintrust?
Yeah, if you listen to the podcast and you try out
Braintrust and you want a sweet deal, hit me up.
You can reach reach out at ANKURat Braintrust dot dev or hit us
up on discord love. It all right, that was super
clear and just like such a refreshing take on what's
(29:01):
actually working in AI. So thank you for being here,
Anker. Thanks for having me a lot of
fun. That was one of those interviews
where the answers really can help us and me tomorrow.
So I think, you know, it's not just theory, but it's on the
road map. Yeah, you're, I know you're
going to call a couple people after this.
What stood out to me was how Anker really broke down the way
the models have evolved and how that has implicated his work at
(29:21):
brain trust. And now he's solved this problem
multiple times. It's been amazing.
What were your top five, Kevin? Yeah.
So I think databases are everything.
I think this idea that evals need to become, you know,
mandatory once you start to see regressions, the idea that you
ship fast, but then have this closed feedback loop, LLM sort
of need to be evaluated monthly,right?
You know, in the old days you'd look at infrastructure every
(29:43):
year or two. And now it's like LLMS are
moving so quickly. You need to sort of like take a
quicker look on a more, you know, frequent basis.
And then finally this idea that as long as you're not using
Gemini models can improve modelsand I Gemini will get there.
Google, you know, they're grinding hard, but I think that,
you know, was the big ones I sawtoday.
Yeah, if this helped clarify your thinking on AI development,
(30:04):
please share it with your team, especially the people that are
thinking beyond the prototype. And if you're enjoying Founder
Mode, like subscribe, drop us a comment, give us a review, and
let us know your feedback. And that's a wrap.
Because building AI is hard, butshipping it doesn't have to be.