Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
These generic metrics are most of the time not helpful at all.
They're way too generic. They don't necessarily correlate
with actual failures in your AI product and they don't actually
mean anything. And it's, it can be extremely
destructive because like, you know, people get seduced into
this, like, OK, I could just plug in this dashboard to my
system. You can get this dashboard of
metrics and can tell me like howI'm doing.
(00:20):
Then you kind of have this illusion of, oh, like I checked
the box, I'm doing evals and I am monitoring my system.
In reality, like you're not monitoring anything.
Just wasted a whole bunch of time.
Welcome back to Chain of Thoughteveryone.
I'm your host Connor Bronston and joining me is my Co host
(00:41):
Atandrea Sanyal, Co founder and CTO of Galileo Atan.
As always, great to have you behind the mic with me.
Always great to be here. Yeah, I'm excited for this
conversation because it's going to be on a few topics you and I
are particularly passionate about and which I hate to tell
our audience what they've probably heard us opine about a
few times. Because we have a special guest
joining us who's been at the forefront of applied AI, helping
(01:04):
over 30 companies navigate the complexities of building and
productionizing their products. Hamill Hussein is an independent
AI consultant, a luminary in theeval space, and has worked with
innovative companies such as Airbnb and GitHub, which
included early LLM research usedby Open Eye AI for code
understanding. Hamill, great to see you.
Welcome to the show. Thank.
(01:25):
You for having me? Yeah, it's absolutely a pleasure
because we've chatted a couple times about product philosophy,
how to approach AI products today, and you're obviously well
known for your blogs, your courses, and a particular
favorite of mine is your field guide to it, rapidly Improving
AI Products. You get right to the core of
(01:46):
what actually makes AI products successful in the real world,
and that feels like the perfect place for us to start our
conversation. You open that guide with this
concept of a tools trap that many companies are falling into.
Can you? Start by giving our audience a
bit of an explanation of this idea and why so many smart AI
teams are falling into this trap.
(02:07):
Yeah. So a lot of times when people
think about evals or measuring, you know, the reliability or
performance of their LMS in terms of is it doing the right
thing for the user? The first thing that a lot of
people's minds go to is like, OK, what tools can I use?
Can I just like abstract this entire thing away to some tools?
(02:30):
Can I just buy tools? Like, can I make it not my
problem? You know, is there some
abstraction or something that I can use to like have it so that
I don't have to worry about the accuracy of my AI product or the
performance of it or if it's doing the right thing, like
(02:51):
maybe something can just figure it out for me.
And, you know, until we get to like AGI or something like that,
I don't think it's possible, Butyou know, it's where people's
mind goes, goes towards. And so I think the number one
question I get is, Hey, what arethe tools?
And that's the wrong question. The question should be like,
(03:12):
hey, what's the right process? Like, how do you how do you
evaluate AI, you know, getting into tools later?
But what is the right process togo through?
Because no matter what tool you use, you have to go through a
certain process to evaluate AI correctly.
(03:33):
Yeah. And Austin, I know you have a
ton of thoughts about that process.
And both of you, I've seen discuss this idea of generic
metrics not being enough for many companies, you know, fancy
dashboards being a panacea and, you know, a Band-Aid solution,
not something that actually solves everyone's problems.
(03:54):
And this idea in your guide, as you put it, Hamill, of creating
a false sense of measurement in progress, or as you describe it,
that oughton, you know, AI has ameasurement problem.
Hamill, could you give us an example of how you think vanity
metrics have led teams astray? Yeah, so it's really tempting if
you're building AI tools and youknow, often probably, yeah, it
(04:17):
can provide more color around this.
But I've seen this with other vendors.
Not not trying to pick on Galileo or really anyone to be
honest, is when you go into a a pitch meeting around, hey, like
we can help you with your evals.You know, people want to see a
(04:37):
solution and the they want, you know, it's easy to kind of
present a dashboard. It's convincing to some extent,
if you don't know any better, topresent a dashboard with all
bunch of generic metrics. Hallucination score, toxicity
score, conciseness score, you name it.
It just so happens that these generic metrics are most of the
(05:00):
time not helpful at all. You know, they're, they're way
too generic. They don't necessarily correlate
with actual failures in your AI product and they don't actually
mean anything. And it's can be extremely
destructive because like, you know, people get kind of seduced
into this. Like, OK, I could just again,
going back to the tools discussion, I can plug in this
(05:23):
dashboard to my system. I can get this dashboard of
metrics and you can tell me likehow I'm doing.
And then you kind of have this illusion of, oh, like I checked
the box, I'm doing evals and I am monitoring my system.
But in reality, like you're not monitoring anything.
You just wasted a whole bunch oftime.
You don't really know what your failures are and like what the
most important things you shouldbe focusing on.
(05:46):
And so it just creates a lot of churn.
And you know, I think people aregetting a lot better with
recognizing that now. But I think, you know,
especially six months ago, it was the cause of almost all of
my consulting business with people that were confused and
hit kind of a roadblock or a wall in terms of, OK, we plugged
(06:12):
in this tool, we got this generic dashboard thing and we
don't really know what to do now, no.
I absolutely second what ML is saying.
In fact, I would go to the extent of saying that this
generic metrics problem has existed in machine learning even
before Gen. AI.
You know, even erstwhile ML workflows, you usually have a
(06:32):
held out test set, measure F1 scores and you know, just say
the model is good or bad based on that score.
Those approaches as well are akin to generic approaches where
they treat all kinds of errors the same way to in some
situations they're necessary, but they're in no way
sufficient. And these were also some of the
(06:52):
realizations that I personally had as well before even starting
the company. When we built Michelangelo at
Uber, there was no one stop metric that would be the panacea
for your problems. And the same patterns are
emerging again. And I'm curious to ask you,
(07:13):
Amel, what kind of patterns you're seeing.
But basically, just to take an example with agents, customized
architectures are kind of the way to go.
You can build agentic architectures in a million
different ways and customized architectures need customized
personalized evals, which also need to evolve as your
(07:34):
application grows and evolves and meets the new new kinds of
data. So one good question to ask, I
think for a, for a practitioner for a developer is rather than,
oh, what metrics do I need from a buffet of metrics, rather what
are the pains or potential risksin the workflows of my app?
Let's list them down and then author evals which are
(07:58):
customized to those pains and then constantly monitor those
pains because those pains will also evolve pains as in
potential risks and pitfalls in your application.
And then accordingly update yourthe set of evals that you're
using based on those those evolving pitfalls.
But I'm curious to know if you've seen similar patterns
smell. Yeah.
(08:19):
So one of the things that's really important to do with
evals is to ground it in your failures.
So how do you know what your failures are?
And like the thing that we harp on a lot and what we teach and
evals and what I write in my blogs constantly is look at your
data. But what does look at your data
mean? Look at your data.
(08:42):
So what's behind look at your data is this process called
error analysis. And error analysis is has been
around for a really long time, even before machine learning.
So that I've been around in social sciences.
I recently learned it. I thought, you know, the first
time I was exposed to it was machine learning, of course.
But it is a kind of this processwhere you go through and you
(09:04):
look at data and you take notes about what is going wrong.
And you then use those notes andyou kind of categorize them.
You say, OK, like what kinds of errors am I seeing?
And you can do, it starts very simple, like counting those
categories and seeing like, OK, what types of errors are
happening the most. And then you make a decision,
(09:26):
like what to prioritize from there.
And it's a very powerful technique that most people don't
do because no one has taught them to do it, I think.
And it's very simple. It's like the most simple kind
of thing. Like, you know, we're talking
about like opening a trace viewer and then like writing
(09:46):
notes and going through a bunch of traces.
And like, you know, there's somelike, OK, the same questions
always come up, like how many traces should I look at?
So and so forth. And there's some useful
heuristics. There's this concept from social
sciences called theoretical saturation, which is means like,
hey, keep looking at traces until you're not learning
(10:06):
anything new. So what we teach is like, try to
look at at least 100 traces justas a heuristic to get people
started because they have a lot of anxiety.
If you just say theoretical saturation, they get, they don't
even begin. They just get scared of the
whole process. The 100 is like concrete number
of people can know like have a goal.
And then like, you know, after you begin, you don't really care
(10:30):
about the 100. You're like, oh, I'm learning so
much. I think that's the
counterintuitive part of like going through individual data
points and reading what is happening like in a focus
session provides immense value and people don't know that until
they do it. And they're very surprised that
about the amount, you know, at the amount of value that it
(10:51):
provides. And so that that can inform all
of your evals activity, you know, it'll like it'll motivate
everything like what you should focus on, what you should write
an eval for, etcetera. Isn't that really like this
error analysis kind of like kindof bucketed in this activity of
(11:12):
evals was not even evals. It's like just development, so
I'll just stop there. That's super fascinating.
I actually kind of as you were talking, I'm drawing parallels
to certain sort of opinions thatwe make on the Galileo platform
itself because we are in evals and observability platform.
(11:34):
Is this new notion of quantitative insights or metrics
and qualitative? And the qualitative bit to me
sounded very similar to the the theoretical saturation workflow
that you're describing, which isthe error analysis process where
it's less about numbers between zero and one measuring low and
high. And it's more about more
(11:57):
abstract. It's, it's, it's at a much more
abstract level. Where are you achieving what you
were set out to do or, and alongthe way, what pitfalls or errors
are you seeing? Something we do in Galileo is
kind of drive the developer of the user to using what we call
log stream insights and long stream insights are more
(12:18):
qualitative insights on hoards of your data, like segments of
your, you know, long running sessions, whether it's like a
chat bot session or any kind of long running agent.
We would analyse data in bulk and give you qualitative
insights and then try to correlate them to potentially
(12:40):
having you build some quantitative measures based on
those qualitative insights. And hopefully the more
qualitative insights you, you, you find you reach that
theoretical saturation that you're talking about.
So I can draw a lot of parallelsand it's very fascinating to
hear kind of the the theoreticalsort of side of error analysis
(13:02):
and the practice of it being much beyond AI and machine
learning. I'm curious if the two of you
think that part of the reason this approach to error analysis
hasn't really truly been popularized in current AI
development circles is because we've seen this change in
Persona where most of the peoplewho were doing machine learning
(13:26):
work. Like, yes, there were engineers
involved, but it's a lot of datascientists who have kind of more
classically been trained on someof these error analysis
techniques, whereas software debugging is a different
approach often. And we're now seeing kind of the
the marrying of these two approaches with engineers who
are now becoming AI engineers and, and working very
differently and having to transform both how they think
(13:48):
about the software they create from deterministic to non
deterministic and also having tothink about their their
approaches in different ways. Is that what's driving this kind
of gap, you think? Or is it something different?
I think so, yes. I mean, I, I would say the first
epoch or phase of AI engineeringwas very much focused on, OK,
(14:10):
like we need to build stuff. We need to get, go to 0 to one
really fast and let's see what'spossible in a rough sense.
And, you know, now that you know, and so it was very much
the narrative and, you know, also the truth.
Like, you know, one of the most important skill to get started
was software engineering, you know, in that like, you need to,
(14:32):
you know, glue together a lot ofthings, use AP is, you know,
kind of full stack engineering really important.
And when it comes to OK, like, how do you know that this
stochastic system is reliable? That's a whole different skill
set that takes time to learn. And, you know, there's a very
(14:54):
large intersection between machine learning, data science,
and the skills you need to do evals often.
And the reason, you know, I tried to actually see how how
much I could get get away with in terms of like teaching
(15:17):
engineers evals without data science background or, you know,
the requisite, let's say background.
And you do hit a limitation really fast.
And like, for example, you know,Shane and I are teaching a class
on evals. We've taught over 700 students
so far of all different kinds ofbackgrounds.
(15:37):
And you know, like, for example,when we get into building LLM as
a judge, what we teach people islike, OK, one of the things
that's important with LLM as a judge is that you can trust the
LLM as a judge. And to trust the LLM as a judge,
you have to compare it to some human labels.
And there's things like the questions always come up such
(16:02):
as, hey, like, why is it OK to sample data?
How can you know? And like we, we show people
like, OK, if you want to know how much noise there is in your
judge, you can do stuff like bootstrap, bootstrap sampling.
People don't understand that. They're like, why is it OK to
like discontinues like sample a whole bunch of times from a data
set to get the distribution? And so we had, we found that
(16:26):
like we almost have to go back to classic statistics and teach
people that, which is not super tractable, to be honest, like,
you know, not in the format of, OK, let me teach you evals real
quick. You can teach fundamentals and
you don't necessarily need all that stuff to get started, but
you can you need like a fair amount of data literacy.
(16:50):
That's one side of the equation is like statistics, but also
it's all the analytical tools, right?
So like, how do you, how do you like dig into data?
You know, like let's say we're talking about traces earlier in
like clustering those traces or navigating them or analysing
them. Like you want to be able to like
really pick at data really fast and just do open-ended
(17:11):
exploratory analysis on it. And a lot of those data skills
come into play again when it comes to like digging into a
problem. And so like very quickly arrive
at the very similar skill set ofa machine learning engineer or a
data scientist. You don't necessarily, you don't
need to be training models, but I would argue that you shouldn't
(17:36):
be spending most of your training models anyways.
Like you know, you were looking at, you were doing a lot of
error analysis and debugging andwhat not.
So that's my that's my spicy hottake, perhaps in this podcast.
I, I don't think it's a hot takeat all.
I think it's a very, very legit take on on just the distinction
(17:56):
between software engineers and data scientists and answering
that key question, like in the new world of sort of meshed
roles and the AI engineer and what is, you know, mostly like
technical people are kind of undergoing this minor identity
crisis. And the answer kind of lies in
(18:16):
what you said, which is if you were to cherry pick one skill
that's needed for the software engineer to become the AI
engineer or, you know, to be efficient in the modern era is
really just the skill of understanding data and knowing
the difference between good and bad data or how to take bad data
(18:37):
and step by step move it to gooddata.
And just data literacy is how you put it.
I think that is the main skill because there's the other skills
which are, you know, knowing thesemantics of a decision tree
which is totally commoditized and you don't need to know, you
don't even need to know how to train models or fine tune them.
But to be able to understand this basic process of comparing
(19:01):
an output with a pre generated ground truth which which is
either human labeled or synthetic.
But just knowing the goods and bads of the practices, that is
what data literacy is. And if this skill is adopted by
ATR, a software engineer, I think they've set themselves up
for the future. Definitely.
(19:22):
And there's a lot of like related skills as well, like
designing metrics and the list goes on.
Like, you know, how to, you know, tell stories with data,
how to have a sense of like whenyour metrics are leading you
astray all the way down to like having good product sense and
having that be aligned with withmetrics, you know, potentially
(19:43):
doing AB tests, the whole suite of things is, is important.
My friends and I joke that we might have a new job title
coming called AI Scientist, but I try not to be the one who is
coining you. Wait.
A second we're talking about a IPMSAI engineers, AI scientists
(20:04):
now. Oh man.
You know, there's always, every time there's a technological
shift of some kind, there is kind of this sort of gravitation
towards the idea of a Unicorn. So we saw it.
We saw it actually like many times, like, you know, the most
recent time we've seen it is like actually in data science
itself where you know, initiallyat the outset of the data
(20:28):
scientist, we had the person that did everything, software
engineering, statistics, DevOps,so on.
And so far I think like people realize there's a little bit too
much service area honestly, and then kind of split it into
different kind of sub disciplines.
Maybe you may be seeing that with AI engineer if I were to
predict. Speaking of AI engineers, I know
(20:49):
one of the recommendations that you've made, Hamel, has been
that when teams are making AI investments, particularly when
AI engineers are helping make their decisions here, it's
really important just to have a customized way of viewing their
data, not necessarily a complex dashboard, so that they can
approach this debugging as erroranalysis in the right way, so
(21:11):
they can make decisions in the right way.
Because as I think as, as Austinand I have certainly experienced
working with folks, it's very easy to overwhelm teams with too
much data instead of enabling focus.
Why do you think giving everyonean easy way to see what their AI
system is doing is more impactful than some of the
(21:31):
sophisticated analytics that I think often we're trying to
reach for? Yeah.
So the guidance there is like, OK, there's a lot of tools out
there that provide a good way toget started, like Galileo, Like,
you know, you have a, a way thatyou can like plug in your AI
application and see your traces in a stream and kind of go
(21:53):
through that a lot, a lot of times in, you know, in your
applications, there's a lot of domain specific things going on.
Like you might have widgets thatyou're rendering your
application might be writing emails, you might have external
data sources that you need to reference to evaluate a, a
(22:16):
particular trace. You might want to have, you
might want to view the trace in the exact way the user is is
seeing it. For example, you might have
things in your trace that by default are usually not helpful,
but that take up a lot of space in terms of tokens, all kinds of
(22:39):
like little nuances. So what you want to do is really
dial in the data viewing experience so that you can do
this error analysis and like reviewed lots of data really
fast in a way that is very customized to you, that is very
contextualized to how you want to see data, all the data you
need to see in one place rendered in exactly the right
(23:01):
way. And so the reason that's my
that's our advice is just because of AI because like AI
you can vibe code. So, you know, AI is really good
at producing simple applicationsthat can render data and like,
you know, have simple, yeah, simple web applications like
(23:22):
render data where you have like,input fields and stuff like
that. That's something that is
probably, you know, below the bar where AI can clear, clear
those tasks very well. And so because of that reality,
we recommend that people in a lot of cases, like create their
own data annotation apps becausethere's just way too much value
(23:45):
to be had relative to the cost of doing so.
Isn't that the case like, 100% of the time?
But it's the case like a lot of times.
Austin, I know a big part of ourrecent product philosophy at
Galileo has been to give people more simplified views, whether
it's, you know, the graph view or timeline view, which we kind
(24:06):
of designed with the idea of like, OK, let's give them other
options to debug agents in particular as we look at these
kind of more complex systems as well as other reviews that you
know, may or may not be live by the time that this podcast
launches. And I know this is something
you're thinking about. This is something you're
thinking a lot about too, because as I alluded to, we've
kind of had conversations together with AI engineers, I
(24:27):
think just like Hamill has, who are going, hey, I need help
focusing here. I'm not really sure what to look
at necessarily. I'm not sure where to spend my
time at error analysis. What's your philosophy on how to
approach this? I guess observability and focus
layer that Hamill is talking about.
Yeah. I think beyond the graph views,
(24:49):
which is a feature that we offer, features like graph views
kind of tend to point to the broader philosophy of giving the
right abstractions to the user to be able to kind of do the
segmented root causing of, you know, these ever growing
(25:10):
sophistication in, in, in systems which have evolved from
simple rag to agentic rag to multi agent.
You want to give the users the right abstractions so that they
can shine the torch in the rightareas.
And that's where views like the graph view session views,
interaction views, these come into be able to give the tools to
(25:32):
the user to just be able to rootcause effectively.
And what that means, what that entails is you run your
application end to end and each request may sort of touch
certain parts of your application and light up the
nodes there. And each request will run
(25:53):
through a different sort of pathin in in your application, which
you can visualize as a bag or a workflow.
The first step is to be able to spot the anomaly where kind of
the ground level customization on the metrics as well as the
qualitative insights come in. But then these right
(26:13):
abstractions and the right viewsto be able to make sense of,
yeah, what's going on. And then there's the data that's
associated with because all, allthis is really is just data
flowing through a bunch of nodesand edges.
So once you spot the anomaly, you want to look at the data and
what you know went wrong with that.
So simplifying the views around the data is kind of the next
(26:37):
step from there. And just to be clear, like what
I described does not at odds with these things in tools,
They're just like supplementary.Like I, I also always want a
trace viewer like the ones in Galileo, because it can be like
a lot faster to search through that and just look at that
without, you know, sometimes I'mlooking, sometimes I'm looking
(26:59):
for something that maybe by accident wasn't in the
annotation tool or something else.
So it, it is really useful. And also like a lot of these
platforms like Galileo have AP is where you can connect your
annotation tool too and you know, write data back and forth
to it. So, you know, that's just
something I think about. Yeah.
And I think we all agree that's a a great best practice is to
(27:21):
leverage the AP is of whatever evaluation tool you're
leveraging. Obviously we, we hope that's
Galileo, but whichever eval toolyou're using, like using that
API to bring that data into other places where you can look
at it, look at it in different ways and kind of consume that
information and highlight it to business users, I think is a
fantastic thing to do. And Hamill, I know you've talked
(27:44):
about this idea of empowering domain experts who may not be in
an eval product every day to addtheir insights and help improve
these non deterministic systems.How do you think about, you
know, writing and iterating on prompts with domain experts
versus with engineers? Yeah.
(28:06):
So one of the biggest failure modes I see is also one of the
biggest drivers of my consultingbusiness is people outsourcing
evaluations to developers, whichis fine if you're building a
developer tool where the developer is a domain expert,
but usually is they're not. And the, the kind of the symptom
(28:30):
there are the root cause of people outsourcing eval's to
developers because they're thinking of AI like software
engineering. They're like, oh, it's AAI
development is a software engineering task.
And let the the you know, the moment you'd say anything about
AI development process, they're like just outsource to
developers. That turns out that always goes
(28:51):
really badly because, yeah, likeyou're, you're only guessing,
you know, and the developers don't have enough context.
So you want to involve the domain expert.
It's like, you know, if you're working building something for
lawyers, you want to involve thelawyer, you want to involve the
the legal expert at some point. And so, you know, when it comes
(29:13):
time to doing things like iterating on prompts, you
shouldn't have the prompt so removed from the domain expert.
The whole point of LMS is like humans can talk to computers.
And so if you obfuscate everything so much that the
domain expert can't talk to the computer, then you're kind of
burning the whole, you know, thevalue proposition of AI to begin
(29:38):
with, like 'cause you want to direct, you know, line of
communication between your domain expert in your in like
what's going into the AI in terms of prompting.
And So what I described in that blog post is a lot of like a
good pattern that I've seen workreally well is if you have a
user facing application, you know, have like an admin view
(30:00):
where you expose the prompt and allow the person to change the
prompt. Even if you don't want the user
to change the prompt, you have the like for your internal
purposes, you have an admin viewthat allows the domain expert to
change the prompt and and fiddlewith it.
It gives them like a more directconnection to what exactly is
happening rather than like having conversation, abstract
(30:21):
conversations about AI. And it should do this and it
should do that. It's really important that they
get in there and they are like experimenting.
Yeah. And I think it very much aligns
to what Galileo has done with our continuous learning through
human feedback feature because we feel the same way.
You need to leverage this domainexpert feedback.
(30:41):
You can't simply have it just bethe engineers who may be
depending on your business, you know, divorced from the bare
metal of what the products doinglike hopefully they are are very
aligned to that. But sometimes they have business
users who are translating, you know, key pieces of that for
them or the main experts who bring a lot of context.
And I know it's part of why, especially when we're looking at
(31:02):
custom metrics, but all of our metrics, we leverage, you know,
feedback from SM ES, you know, whatever type they make.
Maybe you can go in and say, OK,like let me get feedback on
these 10 traces and say, hey, this, this metric feels a little
off. Actually, this is pretty
accurate. Or, you know, here's a little
contextual feedback and then usea judge to translate that and
apply it and, you know, retune the metrics, something we're,
(31:24):
we're finding a lot of success with.
But I, I think there's a lot more opportunity to go deeper
here. To your point, like it feels
like too often, even in highly customized evaluation systems
for enterprises, we are just scratching the surface of the
human context that we can bring in.
I mean, it's, it's a very commonproblem for many organizations
(31:45):
that there is too much tribal knowledge that's not living in
documentations, that's not necessarily making its way into
systems. And to your point, it's so
necessary that we bring that human knowledge into our AI
systems because they perform best when they have the data
they need. And it can be as simple as, you
know, friction between technicalteams and understanding that
(32:08):
domain experts have of some of the jargon of your AI systems.
Like you gave this great exampleand one of your pieces about
translating Reg to just making sure the model has the right
context and really saying, hey, like, let's just put this in a
term that anyone can understand,even if they're not deep in AI.
What's your advice for AI teams who are looking to bridge that
(32:28):
gap and really bring their domain experts into the fold so
that they can be part of improving their AI systems and
their AI data? Yeah, let me like clarify last
point with some like concrete failure mode.
So like to look out for like 1 is OK.
There's a, there's, there's an aspect of like a prompt store or
(32:50):
like a centralized place that you could put prompts, which is
fine. But a lot of times what happens
is folks don't build like properly enough around around
that. They don't build like an
experimentation environment. And so like you have to change
the prompt there and then like commit it and then wait and then
(33:10):
like go somewhere else and like try something.
And that's like way too much friction.
So that is kind of, you know, that prevents the domain expert
from experimenting. A lot of tools have prompt
playgrounds, which are great. It's a good place to get
started. However, most prompt
playgrounds, they don't have access to your tools and your
infrastructure and your application code.
(33:30):
So they can't call, they can't perform rag and they can't call
tools and they can't do all the things that your application is
doing. And so you know, you can't
necessarily rely on that either.That's why you need this like
integrated. I forgot what I called it in the
in the blog post thing called like integrated prompting
environment or something. Try to make up a name for it.
(33:51):
Basically it's can you need to be able to play with the prompt
in your user facing application directly.
That's the only pattern, at least that I've seen, that's
worked reliably in terms of bringing the domain experts in.
Yeah, I, I, I'll just add a couple of points here.
First, of course, the, the need for easy to sort of easy to use
(34:13):
human feedback is critical. And some of our, yeah, like to
your point corner, some of our human feedback features which
must go much beyond, you know, just offering like binary
signals, thumbs up, thumbs down,and the ability to create your
own sort of feedback kinds of feedback becomes important.
(34:34):
But to Hamill's other point about just managing the prompts
and offering the subject matter experts the ability to tweak the
prompts to interact with the app, I think engineering wise
the matter gets a little bit tricky, especially for more
sophisticated applications like multi agents where things are
(34:56):
not necessarily driven by one prompt might have a series of
prompts which are triggered one after the other.
You don't have control over manyof them.
But more often than not it's it is driven by a kind of a seed
query, which is kind of the natural language interface to
any Gen. AI app.
So the engineering challenge kind of becomes how do you
(35:19):
abstract the entire application and make it available in front
of the user through a natural language interface, The user
being the subject matter expert,not the developer, but being
able to actually run the developer's app seamlessly.
So that to the SME, it's all about, here's my input.
(35:39):
I have pure knowledge about my input and the expected output,
but all the machinery in the middle you should be able to
abstract out for me. So the trickiness kind of comes
in the fact that I guess the challenges around how do you use
our APIs and the SDKS and of course all the, you know,
(36:00):
containerization technology to be able to kind of simulate a
version of the app, which may bea distributed app, it might be
running on, you know, two different availability zones for
that matter. It's just software.
So I think that's where the the challenge comes.
And we're kind of at a point where it's it's doable to
simulate, you know, sort of singular monolithic applications
(36:23):
and make this workflow availableto the SME.
But it gets challenging when theapp app itself becomes
distributed. And that's where kind of the lot
of engineering innovation is going.
Yeah, it's really non trivial. Like you have to be, you have to
think like you often can't like expose everything to the SME.
You have to say, is there a highvalue thing I can expose?
(36:45):
And you know, it just if anything else, it just helps
give them intuition so they don't think rag is a very
abstract concept or prompt is even an abstract concept.
You know, you'll be surprised like how many people think
prompt is an abstract concept because they say something in a
meeting and the expectation is adeveloper's going to write the
(37:06):
prompt. That's the worst thing that can
possibly happen. So whatever way possible, he
needed to get away from that. And So what I'd love to close
the conversation with. And Hamill, thank you again so
much for joining us. It's been a distinct pleasure
having you. It is just some advice like
what? What would be your summation,
your advice to a team that is looking to build their eval
(37:28):
system, that is looking to improve their AI product?
What would you tell them? Yeah.
So the biggest two kind of things that I can think of is
like 1 error analysis, also known as look at your data.
It's just, it can, Yeah, it solves so many problems as like
(37:50):
maybe 90% of the whole evals process is, is like looking at
your data, like you find so mucheven before writing evals, like
you just find, you'll just find so many bugs, so many things
change for improvement, so on and so forth.
And then the next thing that I can think of that makes a huge
difference is having an experimentation mindset.
(38:13):
And this one you have to cultivate a little bit.
There's some talks that I can point you to about how you
might, you know, reframe your thinking.
I mean, this is something that'sinnate to machine learning folks
and data science folks is like, you know, you don't have like
this waterfall chart of like howto build a machine learning
(38:34):
system. Like you have to you, you have
an idea of like different experiments you want to try.
You don't even know it's going to work.
But what you do have is a hypothesis of like, hey, like
this might work, this might not work.
Let's try this. Let's look at this afterwards.
And so you have to reorient a lot of things in order to do
(38:57):
that. You have to kind of, you know,
have a different language that you talk about within your teams
and sort of make sure that you're not don't have those
rigid approaches when it comes to this.
It's hard. Yeah, that's probably another
podcast, but those are my thoughts.
I mean, we can definitely have you back for for another
(39:18):
conversation because I think there is so much more we can go
into here often. How about you?
Any closing thoughts from your side of the house?
Yeah, I would say that. You know, erstwhile before.
LLM's AI was considered garbage in garbage out and now with
LLM's AI has become software. So software 3 point O is AI and
now software is garbage in garbage out.
(39:40):
So do Hummel's point. Do look at your data because of
garbage in, garbage out. And secondly, I would say that
there's three specific things that I've learned as kind of the
layers of AI reliability. The the bottom most layer is the
kind of the brass stacks set up basic monitoring traceability.
(40:02):
That's just stuff that we've solved, you know, before AI
happened. Traditional observability is a
partially solved problem, and there's certain things that.
Are done well there adopt those practices.
The second layer of the three layers is set up your prompts
and your metrics and consider them as your evaluation assets
(40:23):
they are your first class citizens they will evolve over
time have disciplined versioninglineage around them set up a
good system there and the third is the insights layer which is
the whole qualitative insights turned into customized
quantitative insights. So if you practice these three
things and kind of consider themthe three pillars of your AI
(40:44):
reliability, you've you've builta good 360 evaluations and
observability layer in your software.
And I'll add one more thing is take take my course, so it's
shameless plug, take the take the evals course.
It'd be a good way to learn about how to get set up with
evals. And I'll, I'll second that and
(41:04):
say also check out Hamill on X and on LinkedIn where he shares
a lot of fantastic content. We will certainly link both
those in the show notes. And yeah, Hamill's blog as well.
Is, is a great place to to go learn.
Hamill. Thank you so much for joining us
on the show. It's been a pleasure.
Yeah. Thank you.
And to our listeners, if you want more fantastic content from
(41:28):
Hamill and many other thought leaders, make sure you subscribe
to the podcast because we share information from industry
experts, perspectives from AI, luminarias and hot takes, plus
much more, both in the podcasting app of your choice
and on YouTube. So whether you want to watch the
conversation, listen in, or check out any of our other
(41:48):
content, you're from Galileo, you can find us all over the
Internet. We appreciate your support and
Hamill often thank you again forjoining me today.
Thank you. Thank you.