All Episodes

June 25, 2025 49 mins

Unlocking AI agents for knowledge work automation and scaling intelligent, multi-agent systems within enterprises fundamentally requires measurability, reliability, and trust.

João Moura, founder & CEO of CrewAI, joins Galileo’s Conor Bronsdon and Vikram Chatterji to unpack and define the emerging AI agent stack. They explore how enterprises are moving beyond initial curiosity to tackle critical questions around provisioning, authentication, and measurement for hundreds or thousands of agents in production. The discussion highlights a crucial "gold rush" among middleware providers, all racing to standardize the orchestration and frameworks needed for seamless agent deployment and interoperability. This new era demands a re-evaluation of everything from cloud choices to communication protocols as agents reshape the market.

João and Vikram then dive into the complexities of building for non-deterministic multi-agent systems, emphasizing the challenges of increased failure modes and the need for rigorous testing beyond traditional software. They detail how CrewAI is democratizing agent access with a focus on orchestration, while Galileo provides the essential reliability platform, offering advanced evaluation, observability, and automated feedback loops. From specific use cases in financial services to the re-emergence of core data science principles, discover how companies are building trustworthy, high-quality AI products and prepare for the coming agent marketplace.


Chapters:

00:00 Introduction and Guest Welcome

02:04 Defining the AI Agent Stack

03:49 Challenges in Building AI Agents

05:52 The Future of AI Agent Marketplaces

06:59 Infrastructure and Protocols

09:05 Interoperability and Flexibility

20:18 Governance and Security Concerns

24:12 Industry Adoption and Use Cases

25:57 Unlocking Faster Development with Success Metrics

28:40 Challenges in Managing Complex Systems

30:10 Introducing the Insights Engine

30:33 The Importance of Observability and Control

32:33 Democratizing Access with No-Code Tools

35:39 Ensuring Quality and Reliability in Production

41:08 Future of Agentic Systems and Industry Transformation


Follow the hosts

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Atin⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Conor⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ Vikram⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

Follow⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠Yash⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠


Follow Today's Guest(s)

Joao Moura: LinkedIn | X/Twitter

CrewAI: crewai.com | X/Twitter


Check out Galileo

⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Try Galileo⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠

⁠⁠⁠⁠⁠⁠⁠⁠Agent Leaderboard

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
I want to say it's shaking up everything.
It's shaking up the stacks, is shaking up the protocols.
And maybe I'm going too far here, but it's shaking up the
market. A bunch of like companies that
were code and code coasting like, and now everyone is like
in the trenches and like, no, no, no, this is it.
This is the real deal. Like this thing can change the

(00:20):
market in a way that's meaningful.
Welcome back to Chain of Thoughteveryone.
I am your host Connor Bronson and I'm delighted to have my Co
host Vikram Chatterjee, Co founder and CEO of Galileo.
Vikram, great to see you on thisFriday.
Thanks, Connor. Happy Friday to you.

(00:41):
Yeah. And it's an especially exciting
conversation we're about to havebecause we are lucky to have Joe
Mora, the founder and CEO of Crew AI, joining us.
Joe, so good to see you. Hey there, thank you for having
me. Very excited to be here and
chatting today. So many stuff for us to go over.
Oh my God, no kidding. The AI agent space is just

(01:02):
undeniably electric right now. We've got people trying to
create new frameworks. We have enterprises scrambling
to find their agent strategies. As I know a lot of folks are
talking to you about. Developers are, I think, both
excited and maybe a little overwhelmed by this year face of
innovation. And you, Joe, are right in the
eye of that storm building crew AI.

(01:22):
So we're very keen to talk to you today and to discuss how to
make great AI agents, whether it's at a startup, whether it's
at an enterprise, and to explorethis emerging AI stack, what it
looks like, where we're headed, how companies can build reliable
agents that deliver on the promise of customer value.
And we also want to get your thoughts on interoperability.

(01:43):
There's so many different frameworks being discussed,
whether that's Google's A to a agent, the agency effort with
Cisco, that we're both a part ofgovernance and broadly how
agentic workflows may reshape how software works and what
we're doing on a day-to-day basis.
So let's start with this broad piece of what is the AI agent

(02:05):
stack and how would you define it and its core components
today? I love that.
That's such a great question. And by the way, given the scope
of our recover and there's goingto be a 5 hour episode, there's
going to be pretty chill I. Hope you don't have a hard stop
here. Yeah, but I, I got to say that
this the idea of a stack to something that I, I start to

(02:26):
thinking about after I actually spent three days in kind of like
a another company outside this company is huge, right.
It was like a 900 people in there and I was invited because
they were a partner and we're about to coach them one sell
crew AI so they could resell us our enterprise kind of like

(02:46):
licensing. In the end there I was, I was
talking with these people and I was trying to understand like,
all right, like this, these guysare closing this insane deals,
like $20 million deals. And I was trying to understand
like what is that a product? And I start to realize a little
bit about their product and I start to look at agents and

(03:08):
that's like, well, that's very similar.
And you if you look at the market right now, it's actually
it is very similar. Like you, you look at all these
companies, they usually start onthis discovery process where
they have this like this desire to get enabled or educated.
And with that comes a bunch of interesting technical questions

(03:28):
like memory or graphs versus events in all the open source
projects. And now those are interesting
questions, but honestly they arecommodity.
Like yes, you're getting memory.There's 20 different ways that
you can do that. Like that doesn't really matter
at the end of the day. But once those companies start

(03:49):
to think about, all right, we'regoing to be a genetic native,
right? And that's just a fancy way of
saying like I'm going to have hundreds or maybe thousands of
agents running in production andX years time.
Then the real questions pop up and you start to thinking about,
well, I'm going to provision them.
I'm going to think of authentication, I'm going to
evaluate them, I'm going to measure them.

(04:10):
And that for me goes into this idea that it's not like one
thing like these companies will need to sweep and it's a sweep
because of the stack is many different things.
And it's for all the agentic resources, not only the agents,
the agents, the tool, the authentication, the
authorizations and everything's in between.
So I think it starts on the bottom on the, you know, like

(04:32):
data management level, so your data bricks or Snowflake or Big
Query. And then you work your way
through MLM orchestration, authentication, connectors in
stops that like this idea for genetic apps where you have an
actual UI that you can actually interface with this agents.

(04:52):
But sorry, I'm going to leave itthat because it's something that
we can talk about that that alone is a huge topic.
I'm also curious, Joe, when you think about stacks generally,
right, But the few, if you go back to cloud time frame, it
goes back to that IPA framework and sense that there's
infrastructure, this platform and then all of that allows you
to build applications on top. We are starting to hear a lot

(05:15):
from the, the cloud providers aswell as a bunch of others around
how this IPA framework almost changes for agents like the, the
infrastructure that you need forthis.
It's obviously there is the, there's the, there's the compute
side of things, but then there'salso the whole idea of like, how
do you actually build out these these agents at scale very soon?
And then there's a platform sideof this where true, which plays

(05:38):
a very big role there, but how do you help people build out
these agents quickly? But to your point, the platform
piece also starts to include what does auth look like and
what does the communication protocols look like?
How much of the how much of the stack, the IPA stack, so to
speak? And you think it's going to stay
the same as what it was for the cloud versus it's going to have
to change fundamentally for multi agentic systems, which I

(06:00):
personally think is like literally by they're already
already 5. You're going to have those agent
marketplaces we're going to all be talking about.
Don't recreate your own, I don'tknow, agent for travel booking,
just use these five agents out there.
And so there's going to be a marketplace.
What changes and what doesn't change from a stack perspective
you think? I mean, that's that's happening.

(06:20):
I mean, I heard, I heard it's happening, it's coming.
There's many people like we haveour own marketplace by now and
the episode comes out, but there's other companies they're
also launching their own. So I think it's it's right on
the corner. Yeah.
If I got to choose, I wish more stayed the same.
I'm a simple kind of like guy. I like protocols that work that

(06:41):
have been around for forever, like HTP and RAST is just like
it's my bread and butter. But I think like honestly that
there is a, there is definitely a push for new pieces out there,
right? And and that is changing a
little bit. So I think some of the
infrastructure remains the same,yes, to some extent.
But we're seeing even like companies questioning their

(07:02):
infrastructure choices, a lot ofcompanies falling back from the
cloud, like I don't want to SAS,I want to South host.
And some companies like, I don'twant to South host, I want to
Prem, I want a physical server. Like when you run some of this
math and there are companies that are batting on this and
this is the way that things are going to go.
I honestly, I, I'm not 100% showthat one way or the other yet,

(07:24):
but there are big companies batting that like on Prem and
physical servers are going to bethe thing for the next 10 years.
Because like, if you're running this at a huge scale and you can
save 60% by running this like like in a physical server, it
might justify you doing it. I'm not sure yet if that's the
way that things are going to go.But I want to say it's shaking

(07:45):
up everything. It's shaking up the stacks is
shaking up like the pre-existingis shaking up the protocols with
all the new protocols coming. And, and honestly, and maybe I'm
going too far here, but it's, it's shaking up the market.
I mean, aren't you look out there a bunch of like companies
that were code and code coastinglike you're winning.

(08:07):
Like you don't come out of the woodwork like like like the
Sergei right from Google when you saw that guy like come up
and like in the news and give interviews like has been forever
like and now everyone is like inthe trenches and like no, no,
no, this is it. This is the real deal.
Like this thing can change the market in a way that's
meaningful. So everyone is out there and you
see Bendy off out there, you seelike all the big players getting

(08:30):
in the dirt. So to your point, going back to
your questions, I think there's about a lot of the things that
will change. I think the new protocols,
definitely MCP and a 2A kind of like had a head start and that's
kind of like pretty interesting.I think MCP has like wants to
expand now to also include agents.
So it's very clearly going to belike AWS against like Microsoft

(08:53):
and Google. It seems kind of like a battling
for that protocol there, but I think a proper standards is
going to take like years to establish.
An interoperability has to be a priority here because as you're
both highlighting the opportunity here is to have
thousands of agents potentially working together.

(09:13):
And that doesn't necessarily mean only internal.
You're often going to have to interact with other companies,
agents, you're going to have, infact, you probably already are
in some cases on the Internet today.
So how do we ensure that this ecosystem actually has the right
frameworks and the ability to deliver on this promise instead

(09:36):
of becoming even more chaotic? The hyper scratches are going to
play a role in this. This is very clear.
So they're integrations with them.
So for example, Korea is natively integrated now with Bad
Rock agents in Azure AI Studio. So we just announced it with
with the Microsoft folks on the beauty van in Seattle in a few

(09:59):
weeks ago with AWS. But I think the hyper scales are
going to play a role. But beyond the hyper scalers,
there are companies that are going to have their own agents,
right? So if you really think about it,
like the service now was a big one, SAP is a big one and and
Salesforce another big one. And companies are going to use
those agents. There's no way around it, right?

(10:19):
So right now, a lot of the work that we do is working with this
company is figuring out those connection points.
And some companies are not readyto kind of like settle in the
framework on kind of like a protocol and they just want to
kind of like, all right, let's do HCP and that's OK.
So I think as long as like as long as there is
interoperability, I think how weget there, the customers don't

(10:43):
really care. But interoperability does need
to exist. I mean, that's the one thing
that we hear from everyone in the market is no one wants to
get them then they're locked anywhere in the stack.
They want to change our lens at any point.
They want to change frameworks at any point.
They want to change like memory at any point.
They would just want to have flexibility.
Yeah. And actually to that point of

(11:04):
flexibility, let's, if we go from, if we transition from the
enterprise down to the actual developer, that sort of build
stuff, right. We're seeing for a while now
that the software engineers finally gotten all these tools
that they can just use and buildstuff.
But going back to that like age war question mark, which they
all have around, I can build outthese these pilots really

(11:27):
quickly and I can I can, I can do really cool things in a
hackathon. But as soon as we start talking
to even startups here in San Francisco around like great, but
this is this is a massive business that you can build.
You can build agents to try to completely, you know, upheaval
the entire supply chain market. They're like, yes, that's what
you're going to do. Day three of that.
They're like, it doesn't quite work.

(11:49):
We need to kind of figure out how to make this work.
It has 100 more failure boards than than than we talked before
with a simpler rag based system.This is the big bottleneck that
I'm always very curious about because there's a whole big
stack, but then there's the actual production ization stack
and there's something to do withthe new SDLC process always for

(12:09):
for these non deterministic systems.
So maybe something to talk through is like what, what
should developers be thinking about with the like can can you
use your own existing IDE, your,your, your existing testing
frameworks, etcetera. And but where do you change
things up in your SDLC process such that you can actually build

(12:30):
that scale in a very reliable way with these multi agent
systems? Yeah, I think honestly like it's
as you said, it's a different mental, mental model, right.
We're talking about like this idea of like another terministic
kind of like framework. So if you think about
traditional software engineer independently of like whatever
language you use, you're talkingabout a very strongly typed

(12:51):
systems where you barely know, like you very clearly know what
it is coming in, what are the transformations that are
happening and what's coming out.And with agents and like AI
apps, you don't. So writing deterministics tasks
become a problem and things likeLM as a judge become way more
important in here. So I think like that's where
like things like some of the folks that you, some of the

(13:12):
things that you folks are doing are like very important because
if you're not measuring, like you don't know if like it's
working. What we see is like people
sometimes struggles at differentpoints.
They struggle to get it started because they never done
something like this and usually we help them quite a lot, but
then they also struggle to get it better because they get into

(13:36):
you know, like a kind of like a short kind of blanket kind of
thing where they fix one way then the feet go out and scale
that they fix the feet and the blankets short to get on top.
So I in some they're operating without having like the regular
CICD testing that you would do and more like mature thing.

(13:56):
So I think there's, again, there's a stack out there.
There are tools out there that can help with you, but I think a
lot of the engineers out there, they're not like you and I,
they're leaving and briefing them of the day today.
They're still catching up, honestly.
So let me pause it this to both of you then.
You've both alluded to some of the edge cases, the some of the
opportunities here and, and obviously Cruz doing some

(14:19):
incredible stuff to increase theaccuracy and reliability of
agents. And Galileo is taking our own
approach of like how can we provide evaluations,
observability and reliability tools for folks who are building
agents? But what are the other areas
where significant innovation needs to occur for agents to

(14:39):
realize their potential? Are there particular layers
where that are very nascent still or need additional
standardization? I'd be curious to get both your
opinions on this. Now I can tell you that like for
me, there's a few big layers. I think authentication is 1 very
clear. There's a lot of people working

(14:59):
on this, right? Like Okta is launching their own
thing. The Microsoft launched their
entry integration and all that, but it's still not a very clear.
Like it feels a lot of like smoke mirrors, right?
Like, yes, you now can fingerprint and you can
authenticate, but there's like this problem runs weak.
He's not only authentication is the scope of the access and then

(15:22):
dynamically changing that scope of access as the execution goes.
The opinion who is running this?Like, I mean, just a few days
ago, right, GitHub that the hugelike instant where their new MCP
server was giving access to private repos and they shoot it.
So I think like, I think Bikram,you kind of hinted into this,

(15:42):
like bringing the systems in production creates so many new
factors of attention that you'vegot to have that you you don't
watch for. Yeah, 100%.
I feel like the there are a couple of these layers Connor,
but it does require cross acrossthe industry adoption of certain
standards for this to actually happen.
The one weird battle that I talkabout, and I'll do this in 30

(16:05):
seconds, just to keep it short, is in the financial services
space, honestly, because you know, it was, it's really,
there's a big interoperability problem in amongst banks for
sending money from one bank to the other, right?
And in, in, in Asia, across the China and India and about a
couple of other countries acrossEurope.
They built out an interoperability network across

(16:26):
the, all the States and across the EU from across countries
that was adopted by Rudy. What that allowed you to do is
send money from one bank to the other bank by just like using a
very simple protocol. I'm, I think the same thing
needs to happen here because once you, you know, you have to
have these protocols for being, which everyone can just agree to
the worry that I have on that side of the stack on the

(16:48):
protocol side of the stack is the same thing that happened
with models where people are thinking of that part of the
stack as something that I can own as a cloud provider, right?
And, and this is going to allow me to become, to have the wedge.
So as I can own the entire agentic market.
I think that's the, that's a very short sighted way of doing
things, right? It's kind of like how open AI
just came out with a close source mark, you're just going

(17:09):
to own this and monitize the hell out of this.
Google came out with bird way back, which was the right way of
doing this in my opinion. I'm biased because I was part of
that, but I feel like that was that was good because now go
builders, let builders build, let people do more with this.
So if you like the now the same thing is happening with
frameworks where you have A to Aand a bunch of others and the
ones which start to get where you're taking a cross industry

(17:32):
approach to this. So you bring everyone together
and start to adopt everything around this one framework.
And that's going to be very important on the at just the
communication protocol side of things.
So I think that's one side of the stack, which is hopefully by
the end of this year, we've see something kind of come up where
everyone just galvanizing around1 standard.
I think that's going to be huge.But the other pieces, to Joe's

(17:56):
point before I agree with him, there's something around
authentication, which is really going to be important as well,
where you might see Octa coming in, but you're also starting to
see certain security players come in because now there's non
human identities from a securityperspective, which is becoming a
bigger and bigger thing. Again, Okta is playing there, a
couple of others are playing there, but if you, if you look
at it, it's kind of similar. Again, I go back to cloud

(18:17):
because it, cloud brought in theadvent of Okta actually had to
be built out for that data docket to come in for
observability. You ought to have all sorts of
cloud security companies came and got bought by Cisco.
So the same thing is going to happen all over again.
I feel like the whole Stacks community, but the biggest thing
which worries me is this. It's a communication protocol
where I'm starting to see peopleget a little bit cheedy around,

(18:38):
like I own this now, I own this.It shouldn't be that way.
Somebody has to come in and say like, we all own this, let's
work together as Switzerland andlet's come up with something
which everyone can own. It's open source.
Yeah, I feel the protocol for sure.
And we had people ping those about like, hey, let's lounge
like this, let's do that And. And honestly, like, yeah, it's
interesting to see. It's definitely, there's a,

(19:00):
there's a sentiment there and you can tell that people feel
like if they launch the protocol, they they get the
market a little bit. And honestly, I don't think that
that's the way that would work. I think honestly, even though
whoever launches a successful protocol, I don't think like
this person is going to get muchof an edge.
If anything, they're going to spend a lot of time working on
this while other people are beauty.

(19:22):
Yeah, exactly it is. I feel like it should just be
some kind of an open source protocol.
And the MCP piece is interesting, but that's not the
same as these communication protocols.
I, I was very excited about the MCPS piece because at least that
allowed some standardization around tool calling, which is
great that that was a huge unlock.
That was such a big pain point. But I feel like now the next

(19:43):
pain point, which has not been completely faced by people,
honestly, is the multi agentic communication protocol.
If anyone, very few people, I think Joe alluded to this
before, very few people in the enterprise are actually building
on multi agentic systems, true multi agentic systems.
And so the pain hasn't felt. I feel like by the end of the
year, with these marketplaces coming in, so many people will
base that pain that we are goingto have to start working towards

(20:06):
some kind of a communication protocol that we can all agree
to and then start to say no to abunch of these others.
Agree. What about the governance side
of this as we start to unleash more and more agents into our
businesses, into our day-to-day lives, particularly as they get
more complex with these multi agent systems that are capable

(20:27):
of more than maybe some of the the junior coding interns that
we started with. In certain cases, there are so
many second order and 3rd order impacts that can potentially
occur here. And you know, Vikram, you
brought up financial services here in the, you know, the idea
of the, the SWIFT network approach.
There are also a ton of regulations that kind of come in

(20:48):
there, and banks in particular need to consider that, But most
industries have some regulatory frameworks to concern themselves
with. What do you think developers and
organizations that are building AI agents should be thinking
about when it comes to governance and I guess the
actual not just observability but also maintenance of these

(21:11):
systems? I'll be very used to get those
thoughts of this, especially given very true sets in in the
space. And I know that you probably are
talking to a lot of financial services too.
I it's actually fresh in my mindright now because I just before
this kind of a call with one of these with the top ten bank in
the US and they're first of all,they're building agents like
they're they've I take that back.
They've built a multi agentic system for territorial use cases

(21:35):
already, which is amazing to seefirst from a just an adoption
perspective. First of all, I just take a
second to understand this like it's this is new technology.
This is a top ten bank in the United States and they already
built in multi agent system. So it's sure and now and and
then some. But the second thing was the
question was about governance. The question was, how do I, I
have all these policies from my moderate risk management team,

(21:57):
from my other teams that I and my other stakeholders have
auditability and have all of these other responsibilities to
uphold to how can I take all these policies and actually bake
that into my system so that I can actually show them exactly
how this this these agents made their decisions #1 which is more
of a visibility protocol. And the second one is more about

(22:18):
how do I bake these different policies into my system in the
form of different kinds of different kinds of evaluators.
And that kind of where, you know, Galilee has been playing
quite a big role. We've been thinking a lot more
about this. We don't believe the future of
the system of this industry is going to be a set of magic

(22:38):
metrics, which has been I feel like very much generally, I have
one point O observability, whichis like I have these 10 metrics
and I have these 15. That's nonsense.
Like we keep coaching people about how it's all about your
use case and your business and how you can bring those policies
and bake that into the system inthe form of really high accuracy
custom evaluators that you can build out.

(23:00):
And I think that's where that's where things are going to go.
And I think it's going to get more and more solidified from a
product process perspective in these companies around, you
know, how do you build these high accuracy evaluators and how
do you bake that into your SDLC process and our CICD process?
And that's where that's to Joe'spoint before about education and
coaching. That's generally the coaching

(23:21):
that we give them as well. Like don't come, don't, don't
come in asking for, I don't know, faithfulness metric.
It is again, like it's ludicrous.
Ask the question of like what's important for your business, for
your use case. That's let's think about that
more deeply and then we'll give you that metrics IDE for you to
be able to create that in 2 minutes and then bake that into
your CICD as there says, it's high quality, low cost, low

(23:43):
latency and scalable. That's that's why I feel like
that that move has to be made for the auditability,
traceability and governance sideof things.
But I'd be very curious to hear your thoughts, Joe, in terms of
like when you're talking these banks and telling them that he
can build all these crews, I'll bet that their eyes light up in
terms of the thousands of use cases they can unlock.
But then their eyes probably getinto the side of like wheat.

(24:04):
But how do I get to governance and security and everything
else? Yeah.
I got to say, it's interesting because I think like highly
regulated industries are moving exceptionally fast on this.
I mean, financial industry for sure, but also insurance
companies are moving exceptionally fast.
Like there's a bunch of industries that traditionally
move slowly and they're jumping on this.

(24:25):
And I think it's because like they see the potential, right,
for like the efficiency gains that they can have on some of
these because usually right, they go to the industry means a
lot of back office work. So that means a lot of potential
for automating. We're seeing like, yeah, we're
talking with some of these banksand we might be talking with the
same bank and they're, they're building quite a lot.
And it's interesting to see because I agree with you.

(24:48):
And, and you see that even on the use cases, right?
When someone come to us and likethey, they ask us like, hey,
what should I be doing or what are, what are other people doing
usually? Like we disqualify them as a
customer and we don't spend to with them because they're not
ready, right. And even if they show up with a
use case, if they don't know what a success looks like and
then what's the point? So you, you got to have you got

(25:10):
to, you got to know like what you're going to do first, then
what a success looks like for that.
And then then we talk about how we're going to measure it.
Because if you don't measure it again, like there, there's no
point in kind of like viewing this out and not being able to
tell if it's been successful or not.
So I agree with you. I think like successful metrics
here, you, you can do a few things with LMS, judge where

(25:32):
you're going to say what is the quality and what is the
hallucination that it's pretty good.
That takes you far enough, but you got to go custom.
There's no way around it, right?For certain use cases, there are
going to be more taking actions and be measuring that.
For other use cases, you might actually be in measuring back on
the bottom line and how you're going to measure that.

(25:54):
I do think that knowing what a success looks like and then
making sure that you implement measurements for that is what it
locks people to move faster. And this goes to your point way,
way, way, way above, like traditional kind of like
evaluation for LLMS where peopleare basically doing traces,

(26:14):
right? Like, oh, I do traces and I
collect actors around those traces.
Well, that falls very short now when you're talking about
agents. There's so many more layers in
there that you want to have observability on it that you can
measure and evaluate things on that, that yeah, you're missing
out and selling this for like traditional, kind of like LLM,
like tracing. Yeah.

(26:35):
We've actually recently releasednew capabilities around that as
well, particularly targeted at trying to help builders to
identify and unpack these more complex use cases that you're
alluding to here Joe. And then also on the
customization side, we recently released a family of additional

(26:56):
Luna models based on our our in house research teams proprietary
work that are designed to to create custom SLM evaluators
that are especially useful for these kind of high risk
production deployments you're talking about and can enable
customization at a high level and an in depth focus for large

(27:17):
enterprises like FSIS. And in particular, we're seeing
a lot of help from that when it comes to banks who need to
create these custom metrics and these customized evaluation
frameworks that not only includethe LMS judge, but you know,
custom SLMS that they fine-tunedwith us.
And then also feeding in expert feedback from humans to auto

(27:38):
tune these metrics and provide continuous learning through that
human feedback. And I'm particularly excited by
this kind of hybrid approach that we're seeing evolve and
that we're seeing a lot of thesemajor enterprises really start
to think through and build out. Because to me that seems like
the direction to go to enable these vast multi agent crew

(27:58):
systems to really go into production as we get through the
rest of the. Yeah, I agree.
I agree. And this idea of liking having
like SLMS kind of like validating and like playing a
role here, I think like, that's huge.
The one other thing I would add Connor though, is that the
adoption of 100 crews across an enterprise, right?
The, the, what that does is it allows these developers to be

(28:23):
able to build these systems thatmassively reduce operational
expenses, right? Like at the bare minimum,
obviously then there's an entirerevenue company to this as well.
So the value of this unlocks is massive.
We've also heard, and I'm curious to get your thoughts on
this joke. We've heard about how because of
this, the, this, the complexity of the software system itself

(28:45):
for them is just much more than what they've had before where at
least earlier, I feel like they could wrap their head around the
fact that there's a prompt, there's a model, there's context
there chunks and I get it and it's like 4 or five things I can
check for. I can they figure it out now
It's like which tool was called whatever is the quality of the
handoff and which tool was not called And why was there?
Did the action actually get completed in the end or not?

(29:07):
And the bar for whether action got completed or not is much
higher than the bar for the chatbot actually respond to the
right thing or not. And so you know, what we're
starting to see is the number ofhelium modes that increased like
the other transition that's starting to happen or needs to
happen now just from our customers.
We keep hearing this is you can't just even custom
evaluators and stuff are not enough.

(29:29):
You have to almost start to givethem insights into here are the
five to 10 things that are goingwrong across all your crews,
potentially are not even wrong. It's more like you're the five
to 10 optimizations you can makeacross all these different
crews, across all these different agents.
Maybe you're calling 5 tools which are doing basically the
same thing. This happens to me all the time
when I'm building out apps and Iwould love it for some, some my

(29:51):
dev tool to just tell me that this is what's going wrong.
It's almost like a cursor, but for my reliability of these
tools. So we're starting to see that
transition also starting to playout.
And that's something which we just frankly launched ourselves
as well. We call it our insights engine
where we automatically tell developers like here are the
five to 10 things that are goingwrong, which you it's probably

(30:12):
going to take you 2 weeks to notice if if that, and here's
what you can do about it. I feel like that's going to be
much more important going forward than asking people to
just try to figure it out, especially the system is getting
so complex. So he is to get a parts about
the complexity system and how developers should think about
that. Yeah, I mean the systems are
definitely. Getting more complex because at
the end of the day, like this agents they're they're at the

(30:33):
layer on top of everything and you can have them perform RAG
and they can have them using these tools so I agree with you
that there is values of metrics that zoom in like but there's
also maybe even more value in metrics that zoom out right and
we do add some of that on on ourplatform as well where you can
zoom out and see overall like all right let me see what this

(30:55):
executions are looking like how are my agents behaving what are
my best agents? What are my worst why are my
agents that elucidate the most what I do want to think the
least? What are the tools that are
breaking the most? What I like my cash read hate on
some of these tools, right? There's a lot of metrics in
there that I think can be very helpful.
But I agree with you. I think right now what people

(31:16):
are realizing is one thing is, is like playing and building the
systems and as you sound like your prototype, something and a
half a ton is very exciting whenyou think about it's not even
about getting this to production.
It's more about saving this to hundreds and even 10s like like

(31:37):
that. And that comes with its own
complexity. I think some of like our most
advanced and like a customers now they're putting around 5 use
cases in production a month. What for me is in staying and
that was up pretty fast. Like honestly, if math and like
all right, like I did 5 one month, that's OK now and then

(31:58):
another 5 now who is controllingthis?
Like how big is this team who ismonitoring this, who is making
sure that this is working or not?
So I, I think there is a layer there of like a again, I there
is a stack and part of the stackis this control plane features

(32:18):
that they are getting me to havein order to control your
operation of agents as a whole. Yeah.
Yeah, agreed, Joe. Is this part of why?
Crew has been leaning into no code.
Is this rapid proliferation thatwe're seeing and how many use
cases folks are building and particularly as folks are

(32:41):
building more coding agents too?It sounds like from what I've
heard you say, crew is really thinking about this as a, OK,
we're going to be an orchestration control plane
versus always be helping build the individual agents.
Maybe we're going to make that really easy for you to do with
workflows that don't require youto do a ton of coding, but I'd

(33:02):
love to understand more about that strategy and how Cruise
approaching this. Yeah, I think some of her.
Thinking is, if you're really want to transform like companies
like that, you need to democratize access to this,
right? It can be a something that just
a subset of people is doing. You need to make sure that
everyone in the company is in power to do it.
And right now, even if you give like like no Co twos out there

(33:25):
for people, it's weird. You look at them using it.
And I know if you've ever had this happen to you, but you
probably have like some elder inyour family.
And do you know, like when they like they point you to the
control, like the remote for theTV and they're like, is it OK if
I press this? They're not sure if they should,
if there's something about breakand then I can disconfigure the

(33:47):
whole thing. That's kind of like what I see
sometimes. Like there's non-technical
people trying to use these systems.
And I think we can do so much better than that.
And like as an industry. So I think a lot of our focus
now is if you really think aboutit and you democratize the
access for people to do this. Agents, if you can view the

(34:08):
Vickrance point 50 amazing agents and you can put them in a
repository that this company cannow reuse.
There's so many different combinations that you can put
them together. They can that you can automate a
lot of their processes, right. And I think like then that gets
like these companies on the on the follow up stage that we are
too far steal. But I think it's going to be

(34:30):
somewhere in the future and thatis like less about automating
the process and not about rethinking them entirely in like
realizing that some of the stepsin there are not even necessary
anymore and what that means. So I think that's going to be
exciting. But we're definitely our
thinking there are no code is again, our crew has the open
source pro code where we have a dedicated team.

(34:53):
We look at that as a product andwe keep investing on that.
And that is very important for us.
I think it's the main kind of like asset is a huge way for us
to give back the community. But in the end of the day, like
if you really want to like transform the market, you need
to democratize access even further.
So as we. Democratize access to agents and

(35:13):
unleash these hordes of interns and, you know, sometimes PhD
level folks on businesses, there's this massive need, as
you've both talked about to observe them, to understand
their impact, to identify the, you know, the pros and cons of
what they're doing and improve and iterate on that.
With that in mind, what's the philosophy?

(35:37):
That. You both think developers and
other folks who are building these technical systems should
take when it comes to testing and ensuring the quality,
reliability and value of their agents.
Yeah, I got to say it's. Something that it's funny, and
Vikram will probably tell you the same, is like a day zero.

(35:58):
They might not care or not spendtoo much time on it.
They're just so excited and thenthey deploy it and then that's
OK. But then what happens is like
they have this thing ready in production, then someone deploys
and breaks It is not that it? Breaks.
Like the new thing that they're doing, the new thing works and
it breaks like something that itused to work before and now they

(36:21):
have to rollback. And now they go into that like,
oh, now we need to rent like 10 trunks to make sure that this
works before we deploy it and then becomes a manual process.
And that's the point where they like, all right, we need some
help either to or people, there's something that is going
to help us with this. So I think like that's kind of
like where, where things are tracking now.

(36:42):
People are moving fast, very excited.
And then they realize that as they start to reiterating fast,
they start and they start break because they don't have
measurements in place. They don't have like this
observability in place. They don't have all those things
in place that help them kind of like to to your point can like I
see this come from a mile afar, right?

(37:03):
And even preventing those from happening and that gets in the
way sometimes. But I think it's going to be a
key aspect. Like I keep telling people this
like some people, like sometimessome customers, especially on
their first call, they might asklike that the famous question
like why should I, Why should I buy your enterprise solution?

(37:24):
I can use open source. I was like, well, imagine what
they're running 1000 agents. An open source is going to look
like and like how are you going to control that right.
So there I get it. I don't think open source is
competitive with enterprise. I think they're complementary.
You still use open source, right?
And you can use any open source.That's the beauty of it.

(37:45):
So that would be a little bit ofmy take.
I think there's definitely like a component here of like, if
people don't start thinking about these from their zero,
it's going to come back to houndthem pretty fast and they're
going to need to go around and figure it out.
Yeah, I 100% agree. It gets the days zero first
principles thinking from a software engineering
perspective, it's still the sameconcept.

(38:06):
So you have to go back to thinking about what's the use
case and therefore what are the different tests I need to build
out and therefore and then what's and do some kind of ski
testing pre production and then also obviously in production.
The difference though, that I feel like is what is interesting
is all of this is still data science, right?
We've just lost the science in data science because now it's,

(38:28):
we've gone so far away from thisbeing called machine learning to
now AI, to now agents. We've forgotten that this used
to be all machine learning literally a year and a half ago.
And so the data scientists in the room look at all of this and
they're like, of course you needto look at the data.
Of course there are rapid feedback, feedback groups.
You talk to the software engineer, they're like, there's
no way in hell they're going to look at the data.
There's we, we, once you productionize this, that's done.

(38:51):
My CICD process should take careof that.
So it's a little bit of like that education and coaching.
And that's why these the term evals is becoming really popular
among software developers. But a data scientist is going to
be like, welcome to the club, man.
This is something we bring for for yours.
And now everyone's all about evals, but this is the same
thing. And so I feel like they're just
layering on the age-old concept of data science, but educating

(39:12):
developers globally about how todo how to do that in a faster
way as they're building these systems.
Because the ability to build these systems has become much,
much faster because of tools like True AII.
Just think the feedback loop piece corner of going from
identifying failure modes when you're trying to do some kind of
even the minimal scale testing. Because as soon as you put

(39:35):
something out there, the age-oldthing in data sciences, as soon
as you put something out there, introduction out of distribution
starts on day 0. New kinds of queries started
coming in, new failure modes aredetected and then you have to
kind of fix it. So I think where this industry
has to move very quickly is how do you start to identify those
failure modes quickly and then auto correct the prompt or the

(39:55):
tool call and all this other stuff, which is honestly all
software engineering. How do you auto improve that?
And that's the piece which is different than what what we had
before where in circa machine learning world 1.0 you had to go
and contact like name your labeling as provider, get the
data there, train retrain your model with that.
It used to take three months here.

(40:16):
I think it can it's. It's possible for us to just
quickly catalyst feedback loops come in and I, I think that's
going to start happening sooner,but that's going to be the big
unlock for people to feel like Ican actually rely on this thing
because my feedback loop automation system is just
working fine. And that's, that's, that's where
Galileo's goal is as well going forward.
It's not about magic metrics, it's more about automatic

(40:36):
feedback loop So. As.
Everyone in this market continues to build more agents
and you know, work with crew to bailed out new agents with low
code or expand upon their CICD processes with automated evals.
There is just a massive amount of work that needs to get done

(40:59):
and a lot of that we expect to be done by these agents.
But it also means we are on the cusp of a, a major
transformation. In fact, we're we're already in
that transformation. Joe, you brought this up
earlier, but even established enterprise companies are are
kind of freaking out. They're trying to figure out
their play in the agent space, and the adoption curve of this

(41:19):
technology is so fast compared to some other previous waves.
What does this mean for? Innovation in the next I don't
even want. To say beyond the next.
Year, but let's say through the next year, like what does the
things look like? I think they're adapting so
fast. And in particular, I'd be really
curious, Joe, if you wanted to share a bit with us about kind

(41:42):
of cruise approach and strategy for this next stage of the
company and this next stage of the agentic revolution.
And then Vicar might be curious on on kind of where you see
Galileo fitting in at that as well.
I think. I think yeah, you.
You bring a a good point. I think it's very clear that
this is going to be a massive industry.
I mean, the potential is there. It's not me and Jensen saying it

(42:05):
is. Again, it's all the signals that
is out there, right? There's people getting out out
there. You go into anyone's website,
his agents out over and like everyone's trying to capture
some of that attention that contributes to some of like the
agent washing that is happening where people are just like
qualifications that they're not.And that makes like everyone
that is not educated yet lives harder.

(42:26):
So they need to grasp some of that information.
But I mean, this is change in the industry.
One example that we are feeling,I mean, we because of the open
source, we get a lot of tractioninto our website, right?
And now going from like nothing to #14 and growing.

(42:47):
ChatGPT is the 14th biggest source of traffic in growing.
Like people are not even they'renot searching on Google, they're
chatting and getting a link fromKorea and jumping into a
website. So like what that means, right?
And then again, as as this models got a better and better,
if you really think about it, like, hey, you have an agent
that is doing a planning like for a feature and then you have

(43:08):
agents that are coding in well, what happens to Jira in that
way? Like why you need cards, right?
Like again, I'm like extrapolating here to extremes,
but you can see how like some ofthese automations and these
processes, they have like a veryheavy impact on kind of like
what, what business used to be and how business have been run.

(43:29):
So I, I think from, from more point of view, when it's a race
to the bottom in terms of models, prices, models,
qualities keep getting better and better.
It feels like whenever it feels like we're hitting a ceiling,
someone comes up with a breakthrough in one way or
another. So right now, like I'm clearly

(43:50):
is beauty to the future, like we're building features that
seem like people might look at. It's like, oh, this is not going
to work well, it might not have worked today, but it's going to
work monster now. So we're very much bidding for
the future a little bit and likeon the cutting edge and kind of
like how those things are going to go.
And I think that's a little bit and like overall division that

(44:12):
we're seeing on the market and the potential is right there.
I think this is a a unique time window for companies that are
operating in the industry given how hot the market is.
And what that means for crew is land grab is like literally
closing as many customers as we can and help them to get to

(44:32):
success. So a lot of what we're doing is,
is helping this customers, getting this use case out there
and see the value and step up and education plays a real role
in that. And that's why we do so much
content and courses and things like that.
So I, I would say that's like part of a strategy.
What is some of our culture is just like holding no punches and

(44:54):
making sure that we are like putting ourselves out there and
like shipping as fast as we can.Love it.
Vikram, I see you nodding. Along to a lot of this, I I know
that resonates with you, it does.
It definitely does. I agree that we are in a very
unique time and place right now and I think the the
responsibility like it seems like you know, all of a true AI

(45:15):
is doing a very specific thing, which is extremely crucial for
the foundation of agent development, right.
And our Galileo's goal for the last four years has been to
solve the measurement problem inAI, specifically on structured
data AI company started like 20 months before the GP moment even
happened. So we've been dreaming about
this moment like around where language models can actually be

(45:35):
used by developers because it unlocks so much.
However, our, the, the push and pull, I think tends to be that
how much should you build for what customers here and now or
developers here now are asking for versus also looking at where
the buck is going and start to build for that.
Our thesis has always been like,it has to be like a 6040.
And so, you know, from a reliability standpoint, it's we,

(45:57):
we see this world going in this direction where it's going to be
moving from magic metrics towards insights.
It's going to be moving from, you know, like building out test
driven development towards actual automatic feedback loops
from production. And, you know, having skill is
going to be really important andpeople aren't talking about that

(46:18):
enough. I think on the reliability side,
it's very much on offline evals,which, you know, here's a
playground, here's some metrics,try 5 different models.
They'll work, but very quickly you're going to have like to
Joe's point, before you're goingto have 100 different agents
talking to each other. That's one kind of form of load
testing and scale testing that'sgoing to start to happen.

(46:38):
You're going to have millions ofdifferent tasks that need to be
completed all the time from all of your users.
How do you scale reliability andhow do you scale these tests to
that extent? It becomes extremely important.
And what is, how do you control that kind of an ecosystem and
sleep at night knowing that yourbrand is going to, is going to
be safe, knowing that your usersare going to be safe.
So that those are the things that we've been, we talk about

(47:01):
all the time at Galilee. And that's why we launched our
lunar tools, a family of, of models of a lot of research,
which is specifically focused ona genetic evaluations and as
well as our insights engine instead of the metrics engine.
So we can automatically give developers the failure modes
before they even know that that's a problem, as well as
automatically tell them what to do next, right?

(47:21):
And the next step from here is do it for them.
So it's that's the phase approach that we see.
And I feel like it's not even anif question, it's a when
question. And that's why we invest so much
in AI research and infrastructure and not just a
software stack or reliability. But that's the future that we
see coming, coming through for developers.
And I feel like we all have our part to play and pushing the

(47:42):
envelope so that this reality ofmulti agentic systems can happen
much, much faster than otherwisecould have.
Yeah, I love that. Well said by you both.
It's such. An exciting time and such an
exciting space to be in and I'llsay personally, I am incredibly
stoked to have crew as someone that we're doing a lot of work

(48:02):
with, whether it's our Agent Next event series where jokes.
So excited to have you be a partof that when this episode comes
out. That'll actually be tonight and
then we'll have more to come after that, whether it's, you
know, our, our planned webinars,the, the integration work we're
doing, the content, having you on this podcast.
It's just such an honor to have you involved with the work that

(48:22):
we're doing here at Galileo. And we are, I think, just as
excited as you about the potential as industry and, you
know, the work we can do together in the coming months.
Joe, thank you so much for joining Vikram and I to share
your perspective on Chain of thought today.
You called it at the top of thisepisode.
We really need at least another hour or two for this
conversation. So can't can't appreciate you
enough for joining us. Yeah, I love that.
Thank you so much. For having me, this was very

(48:44):
exciting. And yeah, I mean, excited for
what we'll be cooking together. Yeah.
And. And definitely go check.
Out crew AI, if you haven't listeners, crew ai.com is the
place for everything. You can find so much there and
and Joe has so much interesting.He shares on his his Twitter
account, on his LinkedIn and allover the Internet.
They are doing some incredible stuff, really interesting use

(49:06):
cases. You can find fascinating blogs,
technical depth. Definitely keep an eye on them.
And if you're you're looking to build agents, check out core AI.
They are fantastic to everyone listening at home.
Be sure to subscribe wherever you get your podcasts and check
out the Galleo YouTube channel for more content like maybe a
weapon, Earth Korea coming soon,events, Steve Dives and of

(49:26):
course, every episode of this podcast where you can watch us
interact with all of our incredible guests like Joe Joe,
again, thank you so much. Been a pleasure having you on.
Thank you. Have a good one.
Thanks, Joe. Thank you.
See ya.
Advertise With Us

Popular Podcasts

NFL Daily with Gregg Rosenthal

NFL Daily with Gregg Rosenthal

Gregg Rosenthal and a rotating crew of elite NFL Media co-hosts, including Patrick Claybon, Colleen Wolfe, Steve Wyche, Nick Shook and Jourdan Rodrigue of The Athletic get you caught up daily on all the NFL news and analysis you need to be smarter and funnier than your friends.

On Purpose with Jay Shetty

On Purpose with Jay Shetty

I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!

Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.