The AI Agent Trust Gap: Bridging Risk to Reliability | Elastic’s Philipp Krenn - Chain of Thought

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:04):
Welcome back to Chain of Thought, everyone.
I am your host, Connor Bronson. And today, we're tackling one of
the most urgent challenges in our industry, how we build trust
and reliability into the next generation of AI agents.
And this is a topic you've probably heard myself talk
about. You've probably heard Galio CEO
Vikram Chatterjee, who's joiningus in a second to talk about.

(00:24):
And it's definitely something we're going to cover in our
conversation and interview that will air in the back half of
this episode with Phillip Kren, Director of Developer Relations
at Elastic, as this topic is very close to all of our hearts.
But first, to set the stage, Vikram, it's great to see you.
Welcome back to Chain of THOUGHT.
Thank you so much, Connor. It's here to be back here again.

(00:44):
And I mean, our listeners here aren't able to access this part,
but I have to say I've been really enjoying your internal
chain of thoughts that you've been putting out for Galileo
team members internally. Maybe we'll leak a.
Couple of minutes of some of those at some point, but it's
been a lot of fun hearing your insights and your perspective.
And I think that's what I value so much about these

(01:05):
conversations is obviously the team at Galileo has been hard at
work. We've just launched our free AI
reliability platform for improving agents.
Go check it out at Galileo dot AI.
There's so much more informationthere.
We think there's incredible tools that every developer and
AI builder is going to love, butwe want to highlight a few of
the biggest announcements and talk about why they matter.
Because we're using these terms,reliability and trust to talk

(01:29):
about why our suite of Luna, twosmall language models matter for
real time guardrails and scalable evaluations.
And we have our new Insights engine.
And this all means what exactly for engineering teams
everywhere. So at the core of that
conversation is agent reliability.
Vikram, why is Galileo and why are you so laser focused on

(01:51):
agent reliability? Overall, we're focused on the
concept of reliability, right? Because that's the number one
problem that's plaguing most enterprise AI teams today.
They, it's easy to get from zeroto POC, going from POC to
production. It has always been an age-old
problem in data science, and it's no different now,
especially with developers and data science teams kind of

(02:14):
almost merging together with this and becoming this new
concept of an AI engineer. The the need for understanding
how reliability can actually occur and they can sleep at
night knowing that things are going to be safe for this non
deterministic system is extremely important.
The the way we look at this is when it comes to agent
reliability, it's two things. One is the reliability problem

(02:35):
just becomes 100X more crucial because agents are more complex
as a system. Even we can't talk to developers
in these organizations that are that have that have gone
through, you know, prompt engineering two years ago and
that being the end all and be all of of of application
development to, you know, rag based architectures and then

(02:55):
more complex rag based architectures with multi tone
chat systems to now, you know, using these models as
essentially smart routers to plan tasks and call different
kinds of models and perform actions has done something
really interesting in the industry.
That is you move from chat completion to task completion.

(03:16):
And essentially because of that,the ideas that organizations
have around what the use cases can be have just exploded.
They've gone out of the realm ofjust a chat interface now to
literally talk completing any tasks.
So there are billions of dollarsof OpEx at at stake here to
reduce within organizations, right?
And then, of course, there's allthe other side of revenue

(03:37):
increase. So given the criticality of
these agentic systems when it comes to the future of running
businesses, as well as the criticality of the importance of
the of the reliability problem overall, in a word, for these
things together, agent reliability we think is going to
be an extremely crucial part of the future of AI.

(03:57):
That's why we've been focused onit for the last year and a half,
frankly, from some earliest developers who were working on
agentic systems when they weren't even called this, they
were just, they were just shocked by, I don't think I can
name them here, but they were just shocked by the idea that,
hey, we just used this model to call the right kind of model.
We used to this model router. They weren't, there was no
concept of an MCPPS 2 protocol for calling tools in the right

(04:19):
way. There's no concept of tools that
we're all galvanized on. But as that started to mature
more and more regarding to this point where we feel like 2025 is
the year where the protocols andeverything else is going to come
in place. But by the end of the year and
early next, it's going to be that's out of the bag and
everyone's going to be building out agents everywhere.
So it's an extremely, extremely critical problem to solve.
Completely agreed. And I think we see this need for

(04:41):
reliability in AI everywhere, both from a standpoint of
building trust with customers and users and also major
examples. I mean, we just saw last week
Grok 4 ran wild on us right after new alignment changes were
made and they had to shut the whole thing down and kind of
reassess. And I mean, that's exactly the
kind of problem that is that we're looking to solve.

(05:02):
And that paradigm needs a changefor AI builders.
That's right. And if you talk to anybody who's
been building out applications where there is some kind of an
massive amounts of tool calling and, and different kinds of
paths that the the application can take, which tends to happen
as soon as you're creating any kind of a like a multi turn
based task overall. And what happens there when you

(05:25):
talk to the developers is they'll tell you very quickly
that debugging and understandingwhere things went wrong and
understanding what the failure modes have are, are extremely
hard. We've all faced this first time
if you're bringing out applications, but when you're
doing this in production and, and the stakes are pretty high,
you can't make mistakes. So yes, what happened to rock is
an really an interesting tip it.And then, then, you know, when

(05:47):
you look at this at scale acrossthese organizations where they
have to adopt AI and they have to adopt agentic systems and
they're going all in. They can't do that unless they
have reliability protocols in place.
Completely agreed. And Galileo's previous work
developing a data and measurement layer for enterprise
grade evaluations at companies like Comcast and and Reddit and

(06:09):
and many others has really enabled a lot of our new
announcements here, such as the Insights engine, which we were
super excited to unveil to the public yesterday.
Can you tell us a bit about thatInsights layer?
Yeah. So I mean, at the core of what
we do at Galileo is we've alwaysbeen focused on the on the
measurement problem of AI. We we, we've not built the words
101st orchestration system and you know all of that stuff, but

(06:30):
the folk, the focus for us has always been on building on the
measurement, solving the measurement problem.
At the core of that lies the developer experience where they
don't want to get to necessarilyAUI, where they're flowing
through hundreds of metrics to make sense of it.
All right. So when you talk to developers,
they want to shift left as much as possible.

(06:52):
They want to move towards not having to go through each and
every single in order to understand what's going wrong
with. Instead, they want to figure out
what the failure modes are earlyon so that they can try to fix
it. Further shifting left from there
is taking all of these differentinsights and then actually
putting them inside of the code itself and inside of the browser
and feel like we're all going tostart moving at that direction.

(07:12):
And that's what RCA for these kinds of agent application
development is going to look like.
So this is 1 big step in that direction where we figure out
that, hey, if you want to eventually build out self
learning or self healing agents,you need to have some kind of an
engine that's constantly observing all the logs,
constantly understanding all thedifferent metrics, custom or

(07:35):
otherwise, that have been built out.
And really figure out from thereusing a reasoning engine and
reasoning model what the potential failure modes are and
give that back to the developer.So that's essentially what we've
done. We've taken all the logs and all
the data that we sit on, but we build out this this reasoning
engine based insights engine, which helps with automatically

(07:57):
surfacing these kinds of failuremodes.
And what we've seen in this in the early beta of this
application is it's really quickly becomes very hard to
live without because now you don't have to kind of go into
the weeds of every single node. Instead, you're just getting a
bunch of insights out-of-the-box.
And the game of whack a wall becomes much easier because now
you know exactly what's going wrong, we can start to fix it

(08:19):
and see whether that actually work or not.
The main keep on confidence of the inside engine is
automatically providing failure mode analysis and giving that
back to the developer #1 #2 letting them know how often that
occurred and whether that's a one off or it's a frequently
occurring problem #3 is not justtelling them here is a problem,
but actually telling them what'sthe solution and then moving
towards what they need to specifically fix as a next step.

(08:42):
So it's, so it's extremely action oriented.
Completely think that is the right direction.
As we've thought about this, it's really clear that, yes,
like we have added new agent metrics and we can talk about
that. Those are, I think, really
exciting. They're helpful.
Yes, we've added new observability views where you
can interface with what's happening with your agent and

(09:03):
your AI systems in different ways.
Those are important too. But it's this automation to the
insights layer, the insights engine that we've built that now
can take those inputs and just automatically identify failure
points, help you with that root cause analysis, let you debug
faster, and just give you improvement opportunities that
you can then apply and then watch for these failures in

(09:24):
production with guardrails through our Luna 2 SLM models.
I think this is where we see this holistic reliability story
coming together. And to me, it seems super clear
that the Insights engine is what's going to power our next
level of like, OK, now we have evaluation agents that are
working for you full time and we're moving away from the
static evaluation layer. Yep, that's exactly right.

(09:46):
It's a natural evolution I thinkfor in the world of evaluations.
And this is something which has been very core to us because
we've all this sort of our DNA being very much about the the
applied data science side of things, the infrastructure side
of things. And this fits exactly in that
bucket, right? It's not about just the software
layer and having AUI where you can see drops and charts.

(10:08):
That's really not what the developers are looking for.
They want the algorithms, they want the infrastructure aspect
of this abstracted away. Like in order to provide these
insights with millisecond latencies inside of our UI or
through our API takes a lot of work from an infrastructure
perspective. So that's exactly about these.
We've been working on and super excited to launch this, and this

(10:29):
is the first step. It was this direction of making
sure that all these agents can be self healing and self
improving over time. 100% we want developers to be able to
spend more time improving the product and less time digging
through logs. Yeah, exactly.
And they don't want to do that. They don't want to dig through
the logs all the time. So this engine is great, it
provides real time analytics, helps you identify root causes,

(10:53):
evaluations at scale. But it requires some serious
horsepower to do this as you tryto bring this to enterprise
systems. We've built Luna 2A family of
small language models to help power that, to help power real
time guardrails. Yep.
What makes these small language models, these SLMS, the right
choice for this job versus usingan LLM like GBT 4 for every

(11:15):
evaluation? So just to tie the two things
together, insights and Luna, they're at different parts of
the SDLC life cycle, right? So there's there's the building
side of things and the observability side of things,
which is where the main pinpointthat developers have is what's
going wrong? What should I do next?
What I and I tried to fix it, did it fix the problem that's

(11:36):
going to have to where they sit.And that's where the Insights
engine is trying to set them up for success by making that
entire process 10X faster. Now, once you actually do
identify, you know, some kind ofa failure mode that's at the
insights engine says that here it's three different tools that
are kind of doing the same thing.
Especially this one tool over here seems to be messing up in
this specific way. That's an indicator to the to

(11:59):
the developer that they might want to just quickly create a
metric of the spoke metric to detect that to see if that's
happening over and over again. Let's say they've done that.
Now what happens when you're actually going to production in
that scale? You want to be able to see if
that's happening at scale. And if that exact tool
malfunctions in that exact way, which the Insights engine has
already told you that it did, you want to take an action on

(12:19):
that. And that's kind of where now
Galileo is in the mix of the user experience because before
the task gets completed, if the tool malfunctions, we weren't
able to block that action completely or take a different
kind of action. Now, in order to do this, you
had to provide this kind of the,the metric computation and
inference has to happen with millisecond latencies and with a

(12:41):
fraction, a massive fraction of the cost of a large language
model. And then we surveyed the
ecosystem about a year and a half ago when we embarked on our
lunar jury, we realized that what people were doing in
production, I was just using an open AI model or a very large
model to do this. And the question simply was,
this is not sustainable. It's going to cost you millions

(13:02):
of dollars at scale as long as your product is at at that scale
and it's going to be super, super slow.
And that's kind of where we, ourresearch came in around using
some of the smaller language models, but also there Connor,
we realized that just using that's a llama model or
something like that. We experimented with a bunch of
them wasn't going to be good enough, right?

(13:23):
You have to because they're built for reasoning.
It's kind of like using a reallylarge tank, whereas you could
just use a small pistol. And so instead of instead of
just using these models out-of-the-box, we had to
dramatically work on some of these open source models to make
them bespoke single token generator evaluation focused

(13:43):
version of itself, which can be easily fine-tuned with data,
which is built for evaluations such that they can be extremely
adaptive to any kind of use case.
And that's the Luna 2A set of models that we've now come out
in the market with, which is much more generalizable than
Luna one, which is based on the bird models that it's very, very

(14:05):
core. And I think an important point
of this and what you're alludingto when you talk about stripping
this model down and and focusingit is how much cheaper it
becomes. This is something we've heard
from major customers who are looking to scale to millions of
traces a day. That gets really expensive if
you're doing a GPT model for allyour evals.
Whereas Luna can not only enablethe custom metrics you need, the

(14:27):
lower latencies that you need, but also do it at much reduced
cost so you can actually have valuations and guardrails in
production. And it's also enabling us to
create new sophisticated agent metrics like flow adherence,
action advancement, action completion.
That's right. Why do you?
Think both out-of-the-box and custom metrics are so critical
for ensuring agent reliability as we go into these larger multi

(14:53):
agent systems. The way we think of the
out-of-the-box metrics is imagine you're driving a car and
the the car stops functioning atsome point.
There are certain attributes of the car that are just common
across most cars. They'll always have an engine,
they'll always have like green Shields, etcetera, etcetera.
So at the very minimum, at any given point in time, you need to

(15:13):
know the health of those specific pieces.
And if it starts going South, then you want to be able to know
what's gone wrong. Why we we would rather not build
this. So the reason we did this is
because out-of-the-box, you needto know what's going wrong.
As a developer, you need to knowwhat's going wrong in an
unsupervised way because you have no ground truth at that
point, right? So that's zero to one problem is
really, really key and that's why we build these up.

(15:34):
The second reason we did this isbecause going back to the car
analogy, not everyone's a is a is not everyone's a mechanic.
Not everyone can go in and actually build the world's best,
you know, engine quality metric for the car and then start
detecting it constitute build the right sensors and all that
stuff. So we realized that when it
comes to rag systems and it comes to even agentic systems,

(15:56):
there are certain things that you would want as a developer,
which can be really, really coolto just have out-of-the-box in
an unsupervised way. And the reason why we also did
this and we own this versus just, you know, bonding it off
to a third party metric system is because we've realized that
out-of-the-box, the quality and accuracy of these metrics have
to be really high. And so we have, as our listeners

(16:17):
probably know at this point, we have a very large applied data
science team and we always have since day one because, and
they've been focused on the problem of measurement.
And so their whole role is focused on the idea of how do
you make these out-of-the-box metrics even more high
performing? Should we use different kinds of
models? We use different kinds of other
gardens. You've published research papers
on this and that's the reason wekeep pushing on out-of-the-box

(16:40):
metrics. But at the same time, you don't
just have the engine problems right in the cars, you also
might have some use case specific problems.
You're driving in the night, driving in the winter time, you
might want to have some extra precautionary measures which we
have no idea about. It's very bespoke, your use
case. And so that's where we've come
up with this idea, this this notion of making it very easy
for any of our developer users or product manager users or

(17:01):
subject matter expert users to do their own custom metrics for
agents as well. And I think this idea of, look,
everyone still has a car, we canapply certain of these metrics
across domains is something that's really showing well and
and demonstrated effectively in our agent leaderboard.
And I'll say anyone who's listening to this episode day of
release, there's a little sneak peek here.

(17:24):
Our new agent leaderboard V2 is actually live at Galileo dot AI
slash agent leaderboard and it includes average action
completion as well as tool selection quality metrics across
multiple domains. So you can can actually slice
and dice based off of banking, healthcare, insurance,
investment, telecom. We're going to add more domains
overtime. We're going to keep on adding
more of our out-of-the-box metrics.

(17:45):
But I'll give a sneak peek and say that GPT model did top that
chart for action completion and Quen actually made the top 4.
So super interesting data there.We're going to be releasing a
lot more about that in the following days and weeks.
Go check that out and and learn more about how we approach our
agent metrics and and how different LMS actually

(18:06):
effectively evaluate with those metrics and how they perform in
real world agentic scenarios. And we're going to keep on
pushing on this because as you point out background, I'm like,
yes, like there's stuff we can do out-of-the-box.
There are things we can evaluateto make sure we actually
understand the effectiveness of these different models, these
different systems, especially cross domains, which I think is
really crucial for companies arelistening that are maybe in

(18:27):
healthcare and they say, Hey, I need something that actually
works really well in this domainspecifically.
But if we just take a cookie cutter approach, we're never
truly going to solve the problems of enterprises that
have unique customer sets. Unique data sets.
And and that's where I think thecustomization element and the
scaling element that's enabled by Luna two really comes in for

(18:49):
us and and why it's such a crucial piece of how we're
approaching agent and AI reliability more broadly.
Exactly. I'm excited about the results
from the agent and Gator board. And it's also frankly very been
extremely crucial and important for a lot of our enterprise
customers just to have a, a Switzerland that's kind of
looking at this and has high quality data to look at it.
And it's all the data and everything else and the

(19:10):
measurement, everything else is out there in on GitHub for
people to see. So nothing's behind closed doors
for this. Yeah.
And, and anyone who wants to start leveraging Galileo models
or our platform for observability evaluations, I'll
say another key piece of the story is our integrations
approach where we want to be agnostic of whatever type of
agency may be building, whether it's crew AI, whether you're

(19:32):
doing something Lang chain, whether you're working with
Llama index or whether you're bringing in Reg components from
Pine Cone or we V8 or Mongo DB or Elastic, we can help.
We work with all these companies.
We're partners with folks like NVIDIA and, and many of the
people I just named. And our goal is to be truly
agnostic of of whatever you wantto bring the table and help you

(19:54):
customize your system in whatever ways you need.
Vikar, are there any other closing thoughts you have about
our agent reliability announcements that you want to
share with the audience? I'll just say we're just getting
started on this and a lot of this is very exciting.
It's a culmination of a lot of work that's been going on for
the last many, many months. But then there's a lot of other
very exciting stuff that's in the in the hopper right now that

(20:14):
we're going to be coming out to market with.
So very excited for developers to just try Galileo out for
free. They're building out agents no
matter who you are, where you are, go to Galileo dot AI and
just sign up for free and you could start using our platform.
If you want to help with anything, please reach out at
any point in time. We'd love to hear from you.
But there's a lot more coming out of it as never even seen
before. So I'm really excited about

(20:35):
that. We're we make some bold moves
all the time. And so we're excited about what
what's coming up soon and hopefully developers love it and
love feedback. 100% agree. Definitely give us that
feedback. We'd love to hear from you on
social media, whether it's aboutthis episode, about the
features, or anything else. Vikram, thank you so much for
setting the stage and sharing the vision behind the Agent
Reliability Platform. Of course.
Thank you. Thanks, Connor.

(20:55):
And when we come back, you'll hear my conversation with Philip
Kren, head of developer Advocacyat Elastic and a key partner of
ours in that platform. Stay with us.
For decades, the search box was our primary window to
information. Now, AI agents are becoming our
proactive partners in discovery and action.
But for an agent to be trusted and accurate as a partner, it

(21:19):
must be reliable. The link between the information
on agent retrieval and the action as it takes is where the
promise of AI meets that of production risk.
And that's why I'm excited to have Philip Kren, Director of
Developer Relations at Elastic, here with me today.
Philip, thank you so much for joining me.
Thanks for having me color. Yeah, I'm excited to have a
conversation about how Elastic is enabling incredible

(21:42):
innovation in the AI space, how we're working together with
Galileo and so much more about building, observing and
absolutely nailing these agenticand other AI systems.
Let's start with, you know, who are Elastic?
Like what are y'all doing in theair space?
People may know you as a companythat's been public since 2018.
What are you up to now? So we've been doing search for a

(22:05):
long time. So if you search anywhere on the
Internet, there's a good chance that you use elastic search in
the background without doing even knowing.
So my classic examples are if you search in Wikipedia or Stack
Overflow behind the search box, we're doing the search for you.
If you do anything on GitHub, almost everything on GitHub is
cached or powered by elastic search in the background.
So they use this very heavily. They didn't know that and many,

(22:26):
many other places don't know. I feel like all the excitement
has shifted over to AI where data is still important.
You need to bring your private data or you need to bring more
up to date data. So we're very happy to also
participate in in that then bring you your data like we did
before. I love it.
What's next for Elastic as you continue to grow in this AI

(22:50):
space? Right.
So I feel like we have for a long time we've been being kind
of like the data in the background, but with AI it's
like the the interaction mode almost changes.
Like one of the topics right nowis MCP.
So historically we've always hadREST AP is, but now you don't
want to work against one specific REST API of your data

(23:11):
store or whatever other systems you have.
But you will let your LMS just figure it out and say like this
is the AMCP connection that you need, get the right data, see
what actions you can run. So just switching that
interaction mode for us is a bigthing right now, having a proper
MCP server and it's just from fetching data as the first step,
but those who interacting with it.

(23:33):
So you could just say like buildme a Cubana dashboard that looks
in a certain way, things like that is more descriptive.
You don't need to know the tool like maybe inside out to write
the queries yourself, but you can just more talk to it or also
automate away little stupid problems like generating test
data. Now you could just tell her that

(23:53):
the LLM is like, oh, look at this mapping, generate me 100
documents with this type of datain there.
I think there are a lot of little things that we can do
today to make us actually much better overall, even though each
one of those might not look likea huge improvement.
And it's all of these combined that really move us forward,
plus everything that we've been doing with Rag lately that

(24:14):
search is just a very different game than it used to be before
so. Let's focus on that.
If I am a builder who is creating something with AI
today, whether that's an agent, some of some sort of compound
system, or a workflow, anything else, how would I leverage
Elastic? So we are generally best used to
keep your data, your LLM like classic RAG application where

(24:38):
you do the retrieval first, you get the right context and then
you generate the output with your LLM.
We could also be used for example to cash results like if
somebody has a very similar question, you might be able to
skip like regenerating with the LLM, but you just to get faster
and those are cheaper answer. You just reused a similar
question from before. And then of course there is the

(24:59):
evaluation side since we are also doing open telemetry and
the classic LX deck has been doing logging and it's doing a
lot more than classic logging. Nowadays with open telemetry, we
can collect all of the data. So we we go from keeping your
data and driving your answers tothen potentially seeing how they
are performing in terms of like other quality, but also

(25:20):
performance cost all of those. And developers, if you are using
the Open Telemetry standard to work with Elastic, guess what?
You can also use it to work withGalileo and bring a lot of that
observability data in and begin to evaluate it.
Which is why I'm super excited that we are partnered and
integrated with Elastic. It's been so fun getting to know
your team and and working with you over the last couple months.

(25:42):
How do Galileo and Elastic work together?
Yeah, it's, it's great to see that you're bringing that every
nation side and we're happy to be the data store to kind of
like keep all of that, but pulling out the information the
right way and then using OTLP like the wire protocol for open
telemetry to integrate all of that.
That's great. And that's kind of like feel

(26:04):
like maybe it is a bad comparison, but MCP is kind of
like the protocol for AI and open telemetry is the protocol
for all telemetry data, more or less.
So in a similar way, we standardize in these protocols
and then it just allows us and everybody else to park them much
Better Together. And how do metrics for agents
with Galileo work with agents that are leveraging Elastic as

(26:29):
their, you know, Reg store and search option?
Galileo helps developers be the guard rails for your data, for
your LLM generated data. So we're happy to store the
results of that and then kind oflike be part of the evaluation
of like how things have been going, but we're not doing like
the evaluation itself. We'll we're happy to leave that

(26:49):
up to Galileo to to figure that out.
What are you excited about in this new world of AI agents and
the on rush of all these AI applications?
Yeah. I always need to preface this,
I'm European so we are maybe slightly less excited, but I so
I, I think. You're laughing.
You're excited. Yeah, we are.

(27:09):
Excited. No, no, we are excited.
AI is exciting. I think it's the classic problem
of like we're often like over focused and like today and like
it's a classic story. My example is always self
driving cars like they were a big thing or everybody had high
hopes like a couple of years agoand it didn't quite happen back
then. But nowadays they're on the
street like I'm driving with a way more pretty frequently.

(27:31):
It's just doing its thing. I think AI is a similar idea
that we have a lot of work to dowhen it's still early days, but
there will be a lot of that coming out, even though it might
take a bit longer until we get to that, like production, like
this is what happens in reality phase and we're trying to do a
lot of things. Maybe takes a little longer to
actually get to the end goal, but we're on the journey.

(27:53):
So it's definitely exciting to be part of that.
But I'm also sometimes cautioning people that it might
just take a little longer to getto the that final result or even
if you don't see today or it's more like, oh, this is an
interesting thing. It might not happen in reality
yet, but there will be some major changes coming.
Yeah. What about Galileo and Elastic

(28:13):
together? How can we continue to build out
what we're already doing? What?
What can we grow? Into do you think?
Yeah, So I think a longer journey of like LM's becoming a
more integral part of the workflow and everything.
I think that evaluation side is like you don't have too many
hallucinations or you can figureout like how good the qualities.

(28:34):
It will also have other aspects around like performance.
But like initially we were very forgiving to LLMS, like oh, it
would take like 5 seconds to generate an answer.
Maybe that will change that. People will require faster
answers. Cost is always a question
though. I think that the cost of LLMS
actually is decreasing quite rapidly, so that the initial

(28:55):
fear that it would be hugely expensive has come a bit less of
a fear, but cost management is still a thing totally for
performance. Like my favorite story there is
always that I remember like a couple of years ago when people
were using Elasticsearch and running a query, they would say
like, oh, it takes 200 milliseconds to get the response
back. That's unacceptably slow.
And now it's often times like, oh, the LLM takes 5 seconds to

(29:17):
give me an answer, that's perfectly fine.
So I think we're still like thisearly excitement.
We're we're a bit more forgiving, but that will also
change like having fast answers and faster response times,
having it more real time. I think that will be something
that will become more and more important.
And then the evaluations are great for that, yeah.

(29:37):
I think for me, we're seeing a lot of the same stuff here,
where the challenge now is more about how do we create enough
trust for enterprises to go to production with their AI.
Right, because like at first youhave the novelty and like
everything is new and exciting, but as it becomes the standard
in the workplace or else like inyour private life, the standards

(29:58):
will rise or like you need to rely on this, otherwise people
will not accept it. Yeah, and that's why we're
talking about this phrase like reliability or you know,
predictability because obviouslythe non determinism is the magic
here. But we need to have an
understanding of the bounds withwhen certain things will happen.
Particularly with agents and multi agentic systems.

(30:19):
There's there's so much opportunity to solve knowledge
work problems. To.
Solve physical problems with well linked with robots, but if
it goes completely off task, there are risks to that.
And it's hard for major customers like financial service
providers or or other very consequential use cases to go to
production with AI when we're not able to provide them

(30:41):
guardrails, when we're not able to help them have a more
reliable system. That's a big part of why, why I
know we're focusing reliability and why Elastic is focused on
data reliability as well to backend all of us.
And I feel like we have almost gone through different cycles
with that already. I feel like initially when LLMS
came out, like many companies put them on their website as
like, oh, you have this chat interaction and then everybody

(31:03):
could talk the LLM into saying something stupid.
Like it's like, Oh yeah, you cancancel your flight for free.
I think there was even a court case in Canada.
So where somebody, yeah, well, where the airline lost that case
and like, like because it was their agent, then well, the
result kind of like had to, to stick.
And, and I feel like then in thenext phase, a lot of companies
kind of like took the LLM sometimes back and not, not let

(31:24):
them have these interactions because they could just be
talked into doing the wrong stuff too easily.
So I think today a lot of the interactions are often more
inside companies because it's more like you have you don't
have malicious actors, but you have your employees who
hopefully try to just do their job, but not talk the LLM into
doing something weird. But I I think if we have the
right guardrails in place, the LLMS will take a more front and

(31:48):
center customer facing seat again and if they can be
trusted. So let's, I mean, let's assume
that Galileo and Elastic together will be able to help
these LLMS be trusted and guardrail and set up right.
Oh yeah, we'll get back to that.But let's assume we do.
What's that picture of the future look like?
What does it look like when we have reliable production grade

(32:10):
AI agents and systems that are in place?
How do you think our our day-to-day will change?
I hope I will never have to callany hotlines again.
Oh please, or I have to wait on on some chat system or like some
agent to answer my questions. So I think there is just like
from the interaction mode will become nicer and and easier for

(32:34):
everybody. And we we don't have to have
abuse people on like some phone hotline to to take one angry
customer call after the other. So I think there is something
for everybody to win and then I hope to free these people up to
actually, I don't know, solve the problem in the background
rather than just taking the calland then forwarding it.
So it's some people are very pessimistic and are like, oh,

(32:57):
people will get lose their jobs and things will change.
I feel like we've always had that since the industrial
revolution. People have been, oh, automation
is bad and everybody will lose their job.
Doesn't seem to have happened yet.
I don't think it will happen now.
I, I just hope we can shift our work a bit.
Like initially you had to do a lot of manual labor and that

(33:17):
shifted. I think now that there are a lot
of very repetitive tasks that can be automated away, but there
is still a lot of other work to do in the background.
So I'm, I'm not afraid for people to lose all their jobs,
but just to, to move us to the next way of interaction and how,
how things are going. Speaking of background.

(33:37):
Work. A lot of that is happening
through things we already talkeda bit about, like guard railing
evaluations, observability, so that you can actually ensure the
reliability and trustworthiness of these systems that everyone's
building. What would be your advice to
developers who are building withagent frameworks, who are
experimenting with AI, who are trying to go to production,
about how to approach all the work that goes into making that

(34:00):
magic happen? Yeah, it's, it's interesting
because I feel like it's very easy to get started and build,
but really getting it to production is a is a hard
problem. Yeah.
And I think guardrails are like a very important part.
And you really need that evaluation side of saying like,
OK, it is actually doing what I wanted to do.

(34:21):
It is also kind of like in within the expectation of cost
and performance because nobody wants to put out the system that
gives you a bad user experience totally.
So I think going to production is like you need to have the
right tools. Having been in the observability
space for a while, it's like often times it's an
afterthought. It would be the same for AI, but

(34:42):
people will figure out sooner orlater that it shouldn't be an
afterthought. Then you need to actively work
on that. Maybe AI even puts you in a
better position there because like with the hallucinations and
giving wrong answers, like you have an even stronger business
impact or like business drive togive the right answers.
Are there particular areas aboutwhat's coming with AI or what's

(35:04):
happening today that you think people aren't paying enough
attention to? I think evaluations have been
one of these areas that have been undeserved.
So I was last year there were sorry, last week there was AI
dot engineer the a big conference around everything AI
and they had a track on evaluations, but it was one of
the areas where people were excited and very interested.

(35:26):
I would say just because we've been building for a while and
the building initially was was great, but now we're really it's
like how are things improving orchanging?
It's the same for search that you can, you have like with
vector search and with the LMS generating the output, you can
do a lot of things, but is it really improving the answers
overall? And like often times people

(35:46):
don't have like any system in place to figure that out.
You just throw something out, you hope it's better, but you
don't really know. And as we're progressing, I
think that that feedback loop will need to become better
because you have so many options.
But options, does it force you to make the right choice?
And finding or making better decisions is an important part

(36:07):
of that. I've been passionate for
technology for a long time. I think it's now it's an
exciting face being at elastic like or working on elastic
search. Search was often times like not
an afterthought, but it was not a fancy problem anymore because
search was never solved, but it wasn't like so centre as it is

(36:30):
nowadays. Search is suddenly interesting
again and also has budget, whichalmost surprised us or no, it
doesn't surprise us, but it's like we always thought it should
have the budget that it suddenlynow has.
But people have like this expectation.
You can get much more out of search again and the data that
you have. So that is there's a good change
and I think just like as AI getsinto other areas, it has a

(36:53):
rippling effect into many other things like observability.
Just interacting with your data and figuring out what what you
have in all the observability data that you have collected.
It just gives you great new options.
And I always hope that we can get out of some of the more
boring work and figure out more of the exciting work.
It's like, I think there's always the saying in data

(37:14):
science, you spend 80% of your time cleaning up data.
I think LLMS have actually addedsome interesting tools to make
that a lot better and faster. So we can focus more on the
exciting stuff rather than the the boring stuff.
And I, I hope that the same willhappen to observability data in
general or just to give you better tools to do things that
were hard before or we're very not necessarily even hard like

(37:36):
we're boring. Well, you.
Brought this up earlier, which is this idea of change and how
constant it is and the shift from so many folks working in
farming and fields and hard physical labor to where we are
today. And we're seeing this new
paradigm shift occur, but there's always a little paradigm
changes. So I'm with you on a like, I

(37:59):
think this broad systemic changeis going to continue to happen
and we have to continue to adapt.
And like, that's just a constant.
We can't expect the job will stay the same.
I'm also optimistic that many ofthe drudgery tasks are going to
be automated out. I mean, you brought up one
earlier, which is like, I don't really write tests anymore.
Like I might edit them, but like, I don't love writing
tests. I'm also, let's be honest, I've

(38:22):
said this on the show before, I'm not the best dad out there.
There's a reason I talk about things for a living.
I'm not always building things for a living and working
production environments all the time.
So it's nice to not have to spend as much time on that.
And I think there's plenty of other examples of this that
folks are using. Oh yeah, I I think there have
been many boring tasks that werehard to get rid of in the past.
But it's like, yeah, before we had the right machines, we had

(38:44):
to do the menu labour. There was no way around it.
Now we hopefully can do more of the interesting stuff.
I think there, this is like a great example of the partnership
that Elastic and Galileo have. What models does a company
build? And at Elastic we're not
building large language models. The models that we build are
about inference and then re ranking.

(39:04):
And then we rely on our partnersto provide either a large
language model to generate the answer or like in Galileo's case
with the new lunar models on theevaluation side.
So I think that is one of the the interesting aspect of like
these partnerships that you can build so many things.
And as we're building more that the small language model is

(39:25):
actually an exciting space because of cost and vacancy that
you want to build more of these that I think there is a a big
area that is bit underserved right now.
Or we're just exploring more. Like how can we just go from
like these huge general purpose models to something that is more
specialized, easier and cheaper to run and has a much lower

(39:47):
latency to get you to the results that you want to have.
Because, yeah, a while ago people were very forgiving for
latency because it's like this novelty, but that might not be
the case anymore. And of course, we want fast
answers. So building more specialized
models and give you what you want.
This is a great space to explore.
I I. Completely agree and we're

(40:07):
really excited about launching our Luna 2 models to enable real
time guard railing and evaluations for things like
private information leakage, toxicity, prompt injection
inject, prompt injection attacks, and prompt injection
attacks. This real time guard railing is
so crucial to help create reliable, trustworthy AI systems
and it's really exciting to see Elastic also investing in small

(40:30):
language models and the clear opportunity for these to
dovetail and work together in building reliable multi agent AI
systems is very exciting. Can you tell us more about
Elastic small language models and how developers can leverage
them? Yeah.
So we we have we have built something for elastic re
ranking. It's like you retrieve a larger

(40:50):
set of data and then you have a model that is more expensive
than the regular ranking that you have to re rank like a
subset. So we can run something a bit
more expensive on a smaller subset of all the data.
So let's assume you have a million documents and then you
retrieve in the first step the first thousand and then you run
the re ranker, which is more expensive and a bit slower per
document on that top 1000 document.

(41:11):
Is that was a use because that was not really possible before.
But now with a fast and efficient small language model
for re ranking we give you that option.
The same thing for creating the the embeddings or the doing the
inference. Right now we heavily build on E5
which is a multilingual model todo that in the dense vector
space. We also have a custom model for

(41:32):
the sparse vector space, which is basically keyword expansion
to find related keywords. So there's a lot of excitement
about doing this. And you don't need a Jared
purpose large language model. They would be too expensive, too
slow to do that anyway. So you need to find the right
models. I it's actually almost
surprising. I feel like we haven't found

(41:53):
that much in the the more tuned or specific area for language
models. So for example, if you have
observability data, I remember people were already a year ago
talking about like, oh, we'll have something that evaluates
that observability data and gives you the right output for
that. I think to a large part,
companies are still using general purpose large language
models because they can do that and that's what's available.

(42:16):
I guess we'll have to see when people build more specialized
models or or maybe it's just tooexpensive to build more
specialized models and we keep relying on the large ones.
But that's an exciting space. But yeah, for some something
like the evaluation side, your latency requirements, your cost
requirement will be pretty constrained.
I guess that's why you need to build your own models to do

(42:39):
that. Totally.
And and we've worked with a lot of enterprise customers on this
and that's exactly what they come to us for.
As they say, hey, look, you knowthis element is a judge concept.
We're using it. It doesn't scale to millions of
traces necessarily. Like it's really expensive.
The latency becomes a real issueif we want to go into
production, Yeah. So I I love that Elastic is also
investing in small language models to help with key
effectiveness and tasks I think.In our specific area, that's and

(43:02):
that's where all the partnerships are kind of grow.
My, I feel like I always describe this is like almost
like a lasagna. Like you have these little
layers in the the lasagna. So you have these layers and
everybody kind of like partners with the other layers.
So we're happy to do that, the storing of the data and then
having the retrieval there. But then you need the others for

(43:22):
maybe for the evaluation side and give us a call frameworks.
And there are so many layers in this AI lasagna right now.
There are a lot of components. So it's great to be able to
build that lasagna or cake. Amazing.
Well. Phillip, thank you so much for
joining us today. Really, really enjoyed chatting
with you. It's always good catching up
with you and I'm so excited to continue to build with Elastic.

(43:45):
Everyone who wants to out there,you should check out our
documentation, see some exampleswe have built out and give it a
try with Galileo's new free air reliability platform.
There's so much you can build using Elastic for your search
for your Rag and it's a huge opportunity ahead and we can't
wait to build more together. That was great.
Thanks a lot.

All Episodes

The AI Agent Trust Gap: Bridging Risk to Reliability | Elastic’s Philipp Krenn

Episode Transcript

Popular Podcasts

NFL Daily with Gregg Rosenthal

On Purpose with Jay Shetty

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}The AI Agent Trust Gap: Bridging Risk to Reliability | Elastic’s Philipp Krenn