Your Key to AI Success is Hiding in Plain Sight | Cohesity's Greg Statton - Chain of Thought

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:01):
It's not my opinion of it. Are you pulling the host dagger?
Connor, you. Just you.
You made it too easy for me. It could have gone either way,
but it didn't. Yeah, yeah.
We'll go there. We'll go there.
Welcome back to Chain of Thought, everyone.

(00:23):
I'm your host, Connor Bronson onthe road at Microsoft Build.
And today I am delighted to haveGreg Staten joining me.
Greg serves in the office at theCTO and as Vice President of AI
Solutions at Cohesity. Greg, welcome to the show.
Great to see you. Thanks a lot, Connor.
I'm really excited to have this conversation with you.
Yeah, I'm excited to chat with you.
It's really appreciated that youmade time in between your busy

(00:46):
travels. You were just in the UKI know
you've been all over the place, talking to customers, talking to
folks in the AI space. And so we're delighted to be
able to dive into the critical challenge that is facing nearly
every enterprise looking to build with AI, harnessing the
vast, often siloed doors of internal data they have.
Cohesity started by tackling themassive scale of enterprise data

(01:07):
management and backup, and is now at the forefront of helping
organizations unlock that data for the AI revolution.
Greg Cohesity manages staggeringamounts of data, hundreds of
exabytes globally, much of whichtraditionally sat as a kind of
insurance policy and backups. But you've been embarked on a

(01:27):
fascinating journey to turn thatpassive data into an active
asset, especially for AI. Let's start there.
You've mentioned that Cohesity started with building an
infinitely scalable distributed file system, but focused
initially on those backups we mentioned.

(01:47):
Can you walk us through the key evolutionary steps that Cohesity
has taken, moving from this initial focus towards enabling
broader data intelligence and now AI?
Yeah, absolutely. And I think you, you said a
pretty critical word there. Fascinating.
It, it is really a fascinating story.
I think especially looking outside in, but, but from, from

(02:10):
the inside it, it's kind of beena destination where we've, we've
been heading towards or a stop along the journey ever since we,
we founded the company. Like, like you said, we started
off building this infinitely scalable distributed file
system. And a lot of us came from a
pedigree of, of hyper Conversion.
So now we had this file system, we wanted to bring workloads
onto it. And we took a step back.

(02:31):
We said, hey, you know, the world of, of data management,
data protection hadn't been, youknow, enhanced or revolutionized
in, in a very long time. So we said, hey, this is
something that companies all have to do.
They have to backup and protect that data to ensure the
integrity and the resiliency of that data last resort.
But if it's stored on an intelligent file system, there's

(02:53):
a lot more that we can do with it.
And, and I think that's kind of where we started to coin this,
this phrase of enabling our customers to re leverage their
data for operational efficiencies.
So we built this, this backup suite of tools that ran right on
top of the file system. It could connect to every single
major enterprise application both on Prem and in the cloud.

(03:14):
And we started off by enabling areally easy means for customers
to to back that data up and thensend it to wherever it needed to
be. If they needed to replicate to
other site, they needed to archive for long term retention,
they want another copy in the cloud made that extremely
seamless and extremely efficientin the way that we transport and
and and and save that that data.And I think 1 interesting

(03:36):
anecdote around this this time is we, we built the backup
software and we really wanted tojump towards this, you know,
enabling customers to gain, you know, additional insights from
the data. And we started off if, if you've
been a Kui City customer since the beginning, you would
remember this. But we had this analytics
workbench where we enabled customers to go out and use the

(04:01):
clustered file system with all of the CPU and memory in there,
plus the distributed data to be able to run MapReduce queries on
the data. And us being a whole bunch of
nerds, we said, Oh yeah, everyone's going to love to do
MapReduce queries on the data. And it's super simple.
You just write your own custom mapper and reducer and Java and
upload the, the, the JAR files to the cluster pointed at files.

(04:22):
In a way it goes shocking, I think literally nobody in in the
enterprise. But shocking to us was, was that
was hard to do. So we kind of pushed the pause
button on that and focused on, on kind of the next evolutionary
step of, of backup, then security, then accessing as, as
a file system. And then now kind of in this, in

(04:43):
this new phase of, of generativeAI being able to, to bring that
data, generative AI to help unlock that data's future
potential. So I really like that Cohesity
and you, Greg are both talking about how do we re leverage this
data that we already have, but maybe isn't being used.
As you start thinking beyond backup use cases into test

(05:08):
automation security, how did youpave the way for turning passive
data into active data? Yeah, it it really started at
its core. You know, I think traditionally
a lot of the data management, data protection companies before
us focused on the cheapest way to store data in a proprietary

(05:30):
fashion on the cheapest medium possible.
And there'd been a lot of advancements in, in compute and
in storage and in memory. So and we wanted to be able to
kind of tap into that and, and harness that.
So what we, what we did is we said, hey, you know, this data
is sitting there and our, our file system is this snapshot

(05:50):
based file system. And we can very easily by
leveraging the way that we structure our metadata within
the file system, instantly create kind of redirect on
right, clones of that data. So we can go in and say, Hey,
you have these backups of these,you know, hundreds or thousands
of VMS or this net app filer or,or this S3 bucket in, in AWS.

(06:12):
You know, how can we, how can weinstantly re provision that
backup, not only for recovery. That was kind of the initial
phase for this is like, hey, howcan we help our customers not
only backup but recover quickly?It's how we do these clonings of
it, but then also that allows access to the data.
And I think that was kind of a little bit of an aha moment is

(06:33):
like, hey, if we can back this up and if we can then re
provision this back out, it becomes extremely useful.
And, you know, companies are dealing with these, these
literal truckloads of, of data and trying to shift around these
raw bits, these ones and zeros. But data in and of itself isn't
necessarily extremely useful. You know, I think within IT, our

(06:57):
job is to be able to apply context to data, to turn it into
information for the business. And then the business can take
their understanding of that information and turn it into,
into knowledge. And so by being able to help in
the context collection through, you know, traditional metadata,
advanced metadata, is it allows us to be able to help customers

(07:19):
not only find the raw, raw information that's useful to
them, but the correct version, the correct point in time, you
know, and be able to kind of help enable this governance
layer on top of it to more efficiently move that data for
use with, with, with with other applications.

(07:39):
Let's kind of unravel this point.
You made it just now about efficient use of data.
What do you mean when you say that?
I spent a lot of time in, in, in, in IT earlier on in my
career and we were responsible for, for caring for and feeding
for the data that the drove the drove the business.
And there's lots of it. And I think if you talk to

(08:01):
anybody in the enterprise, no matter the size, if you point to
any given file in anywhere in their environment, chances are
there's 345 copies of that pieceof data, that file that, that,
that object. And there's going to be
different permission structures for all of them.
There's going to be different information inside of it from,

(08:23):
from different versions. And so when you're looking
forward now in today's world, trying to hop on and, and
leverage AI efficiently in your,in your organization, you have
some like really tough questionsto be able to answer for.
And it's not very efficient to back up a truck load, you know,
hundreds of petabytes of data into it virtually or physically

(08:46):
and ship it off to like AWS or Azure or GCP.
Now the cloud providers will love that, but it it's extremely
costly. It takes a lot of time because
physics haven't changed, you know, physics we're still bound
by by laws of, of physics. So the time it takes then you
you're extending that time to tovalue.

(09:06):
So the more that you can do in preparation of that data for
your end goal. So again, like, you know, side
bringing all that data onto a single platform.
So you can start munging it to make up a word maybe or to use a
word differently here, munging it.
Yeah, I love it. So you're going to munge that
data, you're going to be able tosift through traditional

(09:27):
metadatas and then with through Co EC, we can kind of help
create some more advanced metadata is on top of that, but
allows the enterprise to to havea governance view into this data
that they've never been able to easily have before.
And it's not like going and buying another piece of software
and then creating another copy of this, this data is being
reused. It's, it's primary purposes for

(09:50):
backup and, and with backup, youhope you never have to use the
data again. If you kind of think about it,
it's, it's pretty wild to spend money on something that's
holding something very importantto you that you hope you never
have to use. So that's why we, we decided,
hey, why not you, when it's not being needed for, you know, a
full data center recovery or your boss needs to recover an

(10:11):
e-mail. Let's start doing more with it
and help the enterprise become more efficient with managing
their their data. One of the key ways that
Cohesity is leveraging this backup data on behalf of their
customers is with this shift towards security intelligence,
detecting malware or ransomware within backup streams using

(10:32):
anomaly detection. How has successfully applying
machine learning for security influence the road map that
you've been building for more advanced AI applications?
Yeah, Yeah. I think it, it touches a little
bit on, on what I was talking about before around being able
to collect more information about that raw data, applying

(10:55):
more context to it. So we launched a suite of of of
security tools because again, this is, this is data that
touches everything in your enterprise.
So this is every application, every endpoint at times across
all different locations, physically and, and virtually.
And so while this isn't set up to replace, you know it, you

(11:18):
know, all of a company's, you know, security tool sets, it can
help augment and provide richer insights to help them make
decisions. Like for instance, we're backing
up this data and this, this datachanges over time.
So you know, the first time you,you go in and do a backup, it's
a full backup. So it's a copy of everything

(11:38):
from that point in time. But then going forward, we're
looking at the change in data between time, which provides a
really interesting insight. Being able to look at and model
the change of data over time allows us to create fingerprints
of this. And so one of the first things
that that we did in the machine learning space was like, hey, we

(12:00):
can actually put in some anomalydetection into the backup stream
as we're ingesting the data to be able to start helping our
customers flag data that could be potentially infected by
malware. And this is a very fast in line
method of being able to look like, hey, does this data like
dramatically fall outside of this, of this linear regression

(12:21):
line? And and maybe this is something
that that a customer should go look at and canmore quickly
alert upstream, you know, SEC OPS tools or or security teams
or socks that, hey, this application now, you know, the
the entropy has changed dramatically.
And when there's a large entropychange, it more often means that
that there's some bad actors in there, you know, encrypting all

(12:45):
of the all of the data or or wiping out the data.
And so and we've seen that very be used very successfully by our
customers. There's been a couple of cases
where our anomaly detection engine because of when the
backup was running actually caught the malware before the
SEC OPS tools did and a sock waswas alerted.

(13:06):
And so they received the alert from Koicity and they were able
to start implementing their security procedures to help
safeguard the company. But again, this is all now about
collecting more metadata, more meta information about this data
and how it's changing over time.So this this, this then now
provides a lot more context to help organizations or or AI turn

(13:29):
that information into knowledge.Let's.
Drill down there as orgs are looking to, as you put it, turn
this information into knowledge.Great phrase.
What's the next step look like? What?
What's the next thing that you think should be built?
I'm I'm so curious to understandmore about what the road map
looks like for you. Yeah.

(13:49):
So right now today customer go out and buy a Cohesity.
Hopefully they buy a lot of Cohesity, but what, but they,
they'll buy some Cohesity and, and they're often times, you
know, optimizing their, their backups and they're, they're
fully integrating the security suite and they're starting to
remove some of these extra applications and fees and, and
contracts that they have becausenow they can consolidate that,

(14:11):
that on there. It was about two years ago when,
when we launched our first generative AI applications.
Hey, let's, let's kind of tap into this and provide some
retrieval, augmented generation or RAG pipelines to this data.
You know, we focus on the information retrieval pipeline,
creating a semantic index of that data.
Our customers can then open thisup to, to other business lines

(14:34):
of business with their organization and they can use
that own data to, to apply context to this.
But then if you go back a littlebit further in the conversation
we had, this is additional meta information, like I can't even
call it metadata. It's not like a file size or
ownership. We're now capturing semantic
information. We're capturing topic and theme

(14:54):
analysis of this data. This is yet more information
applying it as context. And so, you know, as we kind of
look further forward, one of thekey areas that, that we want to
be able to, to solve for, and itwas what we're, we're building
now is how can it help companieseither build micro SAS, AI

(15:15):
solutions for their industry or for their line of business
through very simple API access to, to data?
How can it help empower ML teams?
You know, I think like you, you probably know if, if you, if
you're new joining an MLAI or data science program within an
organization and you need data for your, for your models,

(15:36):
there's not really any great cataloguing of this.
You go find the person with the most 10 year in that
organization and ask them where they can find this data.
And that's madness. And so I think as we're looking
forward is like, hey, how can wemake it just stupid simple to
find the exact piece of information and data you are
looking for to help you solve your business problem.

(15:58):
And so think of it like a, a global data catalog that that
encompasses traditional metadata, kind of augmented
traditional metadata with security enhancements and then
now a new world of semantic metadata and that change over
time. And this can empower lines of
businesses, This can empower security teams to be able to do

(16:19):
like deeper forensic analysis. And I think again, it's using
that single data fabric where all that data has already been
collapsed into. Yeah, I, I think it's really
interesting to to think about this idea of where data flows
and where you have these reuse opportunities.

(16:40):
And I, I like that everything you're describing thinks kind of
through this stage 1 of like, OK, let's leverage the data here
and then, OK, here's how we evolve it to next.
And my understanding of that is that essentially coincidence use
of Gen. AI started by tackling an
internal problem actually using GPT 3 and Co CD Cohesity data,

(17:04):
you essentially created an earlyReg system before it was widely
termed. Could you share more about how
that initial experiment and the aha moment that you experienced
realized or I guess help you helped you realize the potential
for leveraging backup data this way?
Yeah, no, that that is a really fun story.
So I'm, I'm a tinkerer. I always love tinkering with,

(17:28):
with new things. And, and while, you know, I, I
don't hold a APHD in, in, in machine learning or, or AI, I've
always been curious last 1520 years in, in this space, and
this was probably about three years ago now, I was running a
team of global field experts. So these were deep technical

(17:48):
experts in our product that helped our, our sales teams win
deals with large customers and then help those large customers
be successful. And because this team was very
knowledgeable, all of the new sales engineers and new people
coming into the company would always ask the folks on my team
questions over and over again. And often times it's the same

(18:11):
question. And it was hard for them.
While they always wanted to say yes and they always wanted to
help, it was tough to then constantly contact, switch,
answer a question they probably answered 10 times already and
then go forward. And when I started questioning a
handful of the folks in the team, I said, well, did did you
tell the person to read the documentation?
Yeah. Yeah.
Greg, we told them to read the documentation.

(18:31):
They still asked the question. Yeah, they still asked the
question. Did you tell them to read the
frequently asked questions that that we constantly spend time
updating? Yes, Greg told them to told them
to answer the OR read the the FAC.
Did they still have a question yet?
I still have a question. Then it dawned on me.
I was like, hey, everybody's going to ask a question slightly
different. And it's their the way that
they're asking the question is the way that the way they

(18:53):
understand the information that they know about today trying to
fill in gaps. And so there's a semantic divide
between what we think are frequently asked questions and
what a new person actually has. And it was right around this
time when I saw, I think probably a Reddit post of open
AI providing access to the GPT 3.
I said, well, this is pretty interesting.

(19:16):
I hadn't really spent too much time with NLP because words are
hard times. And, and I said, well, this is,
this is interesting. So I signed up and I got a bunch
of free credits. I was like, oh, this is sweet.
Let me play and and with the with the with the model, I could
send it text and it could send me new text back.
This is interesting. Transformers are really cool

(19:37):
when I already knew that coming into it.
But what I was what I started doing is like, well, hey, if I
send it some text and then I also pack in some more tokens of
like some additional context, I can help frame the response from
the model, even though the modelwasn't trained on this data.
And so I, I quickly whip, whip together a little web UI where

(19:59):
I, I created a like a, a semantic index.
I just loaded it in memory usingTFIDF vectorization and cosine
similarity to find, you know, paragraphs of text or chunks of
text from our internal docs thatwere semantically relevant to
the, to the, the users question.Pass it all to GPT 3 and it

(20:22):
answered. I was like, it was an aha moment
for me. I said, well, this is this is
great. So then we, we, we kind of
hosted our little internal prototype after running on my
laptop and, and started opening up to to some S ES and they were
like, well, this is, this is great.
Like, you know, it's not 100% accurate all the time, but like
this is this is great. And and then I started providing

(20:43):
reference links back to the files that we used to answer the
question. And they said, this is great.
I can get my answer. I can go verify it with the
docs. And it was at that point in time
our our founder and our and our new CEO caught wind of what I
built and they said, can you show it to us?
Yeah, of course showed him. He said we're going to move you
into our R&D organization and and we need to build this into

(21:05):
the product. I said, well, yeah, it makes
total sense because all the datawe were using was being stored
on Koici to begin with. And if, if we had this problem,
if I had this problem, my team had this problem, more likely,
more than likely there's a lot more people that are suffering
the same problems. And so we kind of started to
unlock this knowledge discovery problem to solve internal

(21:27):
problems that that we're also now helping our external
customers with. I don't think I realized that
you were on the SE side of things at that point when you
joined Cohi City. That's such an.
Interesting journey. I have had several jobs in the
way I like to describe myself because it's it's my 10 year
anniversary coming up at Cohi City and I think I've worked in
almost every single department except for finance.

(21:51):
And I think we're a better company for that decision for
the company not put me in the finance department ever, but I
was, I've been on the marketing side, I've been on the this, the
sales side, both in a kind of anarchitect role, an SC role, a
global resource and now in core R&D.

(22:11):
Love it. Yeah.
It's it's really cool to look atyour career journey.
I think folks who feel pigeonholed, maybe go check out
Greg's LinkedIn, which we'll certainly link in the show notes
because it's very inspiring to see how you've, you know,
continue to expand your technical skills and then also
applied them in different domains, whether it's like
starting as an application developer to now being AVP of AI

(22:35):
solutions today and doing core R&D.
So very, very cool to see. So obviously mastering retrieval
is so critical before even getting to the generation part
because as you mentioned earlier, you know, like it's
great. To.
Have something that works most of the time, but the more
accurate you can get it, the better.
And particularly in large, complex organizations that have

(22:56):
lots of data to unlock and to leverage, there's huge
opportunities here. So what's your thought process
around how organizations should be approaching retrieval for
their AI solutions today? A lot of companies will come and
ask questions like, well, Greg, you know, we want to start
leveraging Gen. AI internally.

(23:17):
We're, we're reading all the blogs and they're saying that
I'm going to save, you know, 10Xmy investment in just 90 days
or, or I can create this new application with just five lines
of Python code and it scales to Infinity.
You know, how, how do I get thattomorrow?
And I said, well, let's let's hold on, put a pin in that for a
second. I said there's, there's a lot of

(23:39):
hype around this, but there's also a, there's a ton of truth
to, to what people are, are talking about.
There's a ton of value to be, tobe gained, but I feel like a lot
of people skip over the hard part.
It's it's, it's less fun, it's less exciting, it's less sexy.
But I think for those of those that have been in the in the
world of machine learning and and data science, know that it

(24:01):
kind of it starts with with data.
And so I'll talk to a lot of of,of folks say the what you need
to first do this is a non-technical thing, but get to
your peers cross functionally within an organization and and
all agree on a governance model.So you need to 1st figure out
where all your data lives. Now you got this mapped out

(24:23):
globally. You say, all right, what version
of of this data's is going to beimportant to us to, to be able
to kind of get to our RN state now that you've got that
identified And it's like, well, of this data, there's probably
some, you know, data that some people shouldn't be able to see.
There's some datas that people should see.

(24:45):
There's probably some datas thatyou just don't want ever
interacting with an AI model internally or externally.
It just could be way too sensitive for you right now.
So you need to be able to then go and say, all right, now
here's the data that we that we want to be able to use.
Here's the access controls we want to be able to put on top of
that data. And this is, this is kind of a,

(25:07):
a stepped phase of, of data readiness, data preparedness,
data hygiene. It's something that that I think
we as an organization or industry have put aside for too
far too long because it's hard. It starts with consolidating the
data into one place, then applying that governance layer
that you've hopefully talked with all of your peers cross

(25:27):
functionally, which, no shocker,nobody.
Has. No, there's not a single person
when I say, oh, if if you talk to, you know, your marketing,
your HR, your engineering, you guys have a governance.
No, start with that. Start with a governance model
around the data. Greg, I hear you can just ask
you that since you've been in all those departments.
I can and I just said give me the access to the data, but

(25:50):
it's, it's it, it is by far lessexciting than saying, well, I
want to go test out, you know, the, the top of the leaderboard
language model or I want to thenplay with this new agentic flow.
You can, you can spend all of your time doing that, but it's
still going to give you garbage if you give it garbage.
And so if that adage of garbage in garbage out is still totally

(26:13):
true. But once you've got that
foundation, then then there's like some great steps that we
spent a lot of time thinking about around, you know, how do I
extract, how do I extract the raw data, raw text, image,
video, whatever out of that payload of, of, of raw data, You
know, how do I embed this? Like we had some, some serious

(26:33):
challenges around just simple, you know, embedding of, of, of
this data at massive scale. You know, how can I re rank this
data to to kind of ensure that that my, I'm getting the kind of
the best blend of precision at and and and recall across my
across my data state to help answer this question.
And then at the very end, you say, well, what, what's going to

(26:55):
be the best language model for this?
But there's a lot to get to before you get to play with the
the LLM on on the other end. I feel like one of the themes of
this show, and anyone who's listened to a few episodes will
probably notice this, is this discussion of the magic bullet
versus the actual infrastructurework that needs to be done.
Because usually AI is marketed as it's magic bullet.

(27:19):
It's going to solve your problems.
You just apply it and bang off the races.
And there's a little bit of that.
There is some magic there, no question.
But anyone who's played D&D knows that if you're a wizard,
you're trying to harness the thepower of the universe.
You have a lot of studying to do, you've got a lot of work to
do. And if you just decide to go to
sorcerer route and make some pact with some entity as a

(27:42):
warlock or something like that, that comes with risks.
And to to apply this to AI here.If I think about like, hey,
let's just, let's just throw an LM at it and think we'll figure
itself out. There are risks to that and you
are not going to get the accuracy you want all the time.
You need to test, you need to evaluate, you need to understand
your data pipelines, you need tohave the infrastructure in place
and do the work in order to get that fully constructed wizard

(28:04):
spell that you want. And then yes, it is magic and
you can apply it, but it it comes with all the upside and
all the downside if you don't actually think through your
approach to data. So I appreciate you talking
about how Cohesity and how you have have thought through this
approach because so often enterprises have data spread
across various sources, various formats and unifying that access

(28:29):
or providing A consistent way toquery and retrieve relevant
information for AI regardless ofwhere or how it was originally
stored. I can imagine it's quite
challenging. It, it, it really is.
No, I was chuckling as you were saying that because, because,
yeah, I think and and I think mebeing in the industry on the
vendor side, we have to take a lot of the blame here.

(28:49):
But our, our enterprise customers are starting to
believe that it's magic. And, and it's funny, I, whenever
I give talks, I always ask the, the audience and it's mostly IT
folks. Now we're getting some of the
data folks. But I think can, can anybody
give me a simple definition of, of artificial intelligence,
machine learning or generative and just the simplest?

(29:10):
And then it's like, it's usuallylike dead silence.
I'm like, hey, it's OK. Tell me what?
Just tell me what you think it is and people will try to give
me these complex definitions of,of trying to in their words,
describe neural Nets or, or deeplearning.
I was like, no, no, no simpler. I was like, it's, it's not
magic, it's statistics at scale.It's all math.

(29:31):
It's using information for the past to predict a future event.
And that future event could be the the next most probable token
or word in the sequence or now tokens.
Now we're doing multiple token predictions or it can be a
forecasting model linear regression, like linear
regression still works great today for, for certain tasks.
And, and I think we're in a world now where, where, where

(29:53):
people are also throwing LLMS at, at everything.
Let's say, you know, I want to do, I'm just going to put an LLM
with this data, it'll figure it out.
Or I'm going to put an LLM, you know, in my life sciences or, or
within, in the hands of my doctors and, and it'll figure it
out. And there's a couple of major
problems with this too, like, like you touched that, like we

(30:13):
had talked about the data problem on the other side, but
there's probably certain problems that, that you don't
want to use an, an LM for or, orsometimes you don't even want to
use AI for. And you have to understand like
as a, as a, as a product owner in the enterprise OR, or, or
creating products, you have to understand the end use and, and

(30:33):
what like a sufficient F1 score is going to be like, is it OK
that it's going to get it, potentially get it wrong 15% of
the time? Or do you want it 2% of the
time? You know, and, and so being able
to disclose those, those evaluation scores as well as
like understand the use case, you know, what's going to be
acceptable is, is kind of also something that we're not

(30:56):
necessarily talking about. It's a great point to take that
product owner mindset of what dowe actually need to deliver
here, because it can really vary.
And what's acceptable on an internal document retrieval use
case is extremely different fromwhat a financial services
provider needs in order to put acustomer service spot in into

(31:16):
action. And, and that's my, you know, we
obviously work with a lot of Galileo and have dealt with a
lot of our enterprise customers.And it's a, it's a very
complicated, customizable deep field.
And I, I know that you're dealing with that with a lot of
your customers. And I've heard you kind of
critique this common approach of, of using one model to

(31:37):
evaluate another as turtles all the way down.
LM as a judge, I think has a lotof pros, but they're definitely
cons. Why do you view this approach is
insufficient, especially for high tech stakes enterprise use
cases? I think it's tough.
I although, you know, I've been,I've been thinking a lot about
this because I say it a lot likeit's turtles all the way down.
You should never do this. I don't know if you should never

(31:58):
do this. I think if it goes back to what
I was talking about getting evaluations for this, like do we
have proper evaluations on the evaluator model?
Like how do we know that that it's giving us a good result?
There may be some use cases where where this is this is
completely valid, but I think weinitially started doing this
pretty heavily. If I look a lot of like the the

(32:18):
birth of some of the open sourceprojects around evaluations and,
and using an LLM is because it was hard the cognitive load on
the humans to evaluate. You know, 5 pages of text is a
lot, especially when you kind oflook at it, you know, on the
opposite end of the spectrum, you know, we collect E

(32:40):
biometrics to, to help retrain AI models all the time.
Either like I put a shirt in my shopping cart or I don't, I buy
something or I don't, I give a thumbs up or I don't.
That's a very low cognitive loadthat needs to be applied to be
able to get good, good feedback or good evaluations.
But if I, you know, word vomit up, you know, a 10,000 word

(33:02):
response to a simple question, it takes an expert in that.
It takes somebody reading through it and an LOL might be
good at it, but I think it's also disingenuous for us to
throw away, you know, 20 years of of NLP evaluations.
There's probably a good mix, mixof both and I think there's uses

(33:22):
for LLMS in this. I think about it, LLMS are great
at summarization and kind of like first draft generations.
And if I'm, if I start to use the LLM for what it's good at in
these pipelines, like you start to get some great results.
Like recently I was doing, I kind of had this, this notion or
an idea, and I'm sure it's been done out there, but it's saying,

(33:45):
you know, a lot of people are looking at, yeah, at evals with,
with elements or rags saying theanswer is either wholly good or
bad. And but again, because that's an
easy path and we're as humans, we're, we're like really lazy,
simple. Yeah.
Yeah, it's simple. But you know, when we're having
a conversation, you know, an an individual will question a

(34:08):
single statement or single thought.
You know, a humans, we, we decompose whatever is being said
to us into kind of facts or claims or assertions and we try
to evaluate internally is this, do I believe this?
Do I have evidence for this? So like I've been doing a lot of
work and taking the generations of an LLM, decomposing it using
an LLM in the claims because again, you know, being able to

(34:32):
identify parts of speech or phrases and extracting them is
a, is an NLP task that LLMS can do pretty well.
And then I can use some other models to like classify a claim
and assertion along with retrieve context to say like,
hey, is this, is this valid? Do I support this?
Do I not support this? Or do I not have enough
information? So I think there are great uses

(34:53):
for LLMS in any vals, but can't be taking, it can't be
exclusively, and it can't be themodel that you use to generate
the response. Oh yeah, I see this a lot where
I think a lot of the folks are having success when they work
with enterprise customers today is like, hey, we're applying
multiple judges that are different LMS are waiting the

(35:15):
responses and then using those two flag things.
Then you go have humans look at because yeah, at a certain scale
we can't have humans look at everything just where we are
with automation today. But like, it doesn't mean humans
shouldn't be involved in the process and that human
validation shouldn't occur. We call it continuous learning
through human feedback, which iskind of how we've integrated in
our platform. And that is huge, being able to

(35:39):
get human feedback into the process.
And you're right, humans are notgoing to be able to get
everything. It's same with like with like
synthetic data generation. If you start with something that
humans have created, it's a goodfoundation.
If you then, you know, randomizeyour distribution of responses
and start spot checking. And that could be.
Reinforced into the model to then say, oh like, let's improve

(36:00):
how we approach this. Yeah, and being able to code in
reward functions for, for correctly, I, I identifying
these are great practices. But I've, I've seen some people
say, well, I used GPT 4-5 to, to, to generate it.
I'll just use GPT 4-5 to say is it good or not?
And that's, you know, that's just, if that, that's the

(36:20):
inmates running in the asylum. It's like walking into a room of
Thebes as a cop and saying, you a thief?
No thief's going to say, no, I'mnot a thief.
No, of course I'm not. One of my favorite prompting
techniques, I don't even know ifyou call this technique is to
like, if I'm just writing some stupid copy and I, I, you know,
I use GPT 4 for it first just togo to Claude and be like GPT 4

(36:42):
came up with this. And honestly, I think it's kind
of crap. Like, you know, I, I really need
you to make it punchier. And Claude could be like, yeah,
I'll do it. I got you.
And I think it's the same comparison where you want to
almost pit. I don't want to say pit.
It's, it's adversarial, no. And, and I think that, you know,
using, using adversarial techniques is becoming more and

(37:05):
more popular in this because you, you especially if you're
going to go then use reinforcement learning to, to
retrain you, You don't want to overfit towards like a, you
know, a, a open AIG PT4 bias because then everything it says
is going to be perfect and you're going to have F1 scores
of one across the board. And you know that's not going to
be right. Yeah.

(37:26):
And I think honestly, we, we maybe don't make these
comparisons enough to how you think of human, fully human
organizations versus human plus what I think of as like async
digital employees, AKA yeah, yeah, or LMS.
And I'm like, it's the same way if I have a coding intern who
I'm working with, like as what I'd be using cursor to vibe code

(37:51):
is like, hey, yeah, they're, they're very enthusiastic.
They'll go do things. You have to check their work.
You don't have time to check alltheir work, but you have to get
the feedback, you have to let them learn, You have to give
them rules and guardrails to approach it.
And this can change depending onthe sophistication of different
AI systems of how much feedback they need, how much can be
handled by other async AI employees.

(38:16):
But it it like, just like you would for, you know, training
any team member, you can't ignore that work and just assume
that the team member has come toyou fully trained, fully
coached. And I, I think that folks who
are great managers of people aregoing to have skill sets that
are valuable for managing LLM systems.

(38:38):
At least that's my opinion. I was literally just going to
say, you know, I think all of ushave, have had experiences with
great managers and terrible managers.
And if you think of it the same way, like, and, and, and those
of us who are people leaders outthere, there, there are times
where we feel like we're doing really, really well.

(39:00):
And it's a lot of that like feedback.
It's, it's being able to look at, at, at the work and provide,
you know, great feedback or areas for, for improvement.
But it's it's our job to be ableto inspect flag if something
doesn't look right and then helpcoach and guide people.
Or autonomous async employees. I kind of like that, an async

(39:23):
employee. That's how I've been like
thinking about it in my head cuzI was trying to frame like, what
is an agent to me when it's likeaccurately and well applied.
And to me it's like, it's probably a junior employee.
I, I, I don't necessarily, I, I know there are some folks who
are getting really incredible results and they're saying, oh,
this is PhD level research happening here.
And in some cases I think they're getting specific

(39:45):
results, but I, I don't think it's consistent necessarily all
the time. Yet.
There's, there's a lot of hype around agentic AI systems.
And while there's a ton of potential, I know a lot of us
are expressing caution about teams jumping in too quickly
without robust evaluation, without robust testing, without
observability. Because it does feel like you're

(40:05):
letting a horde of junior employees loose and they can do
a lot of great things and they can also make mistakes that are
crucial. Yeah, I mean, we're, we're still
working on, on trying to get evals right for like a simple
rag pipeline. And now you, you open it up to
A, to an agent that could have access to 1520 different tools
that it's calling, get it right early and then you're going to

(40:25):
be able to, I mean, it's, it's just like we were talking about
with, with data. It's garbage in, garbage out.
You can build these really complex and beautiful agentic
systems, but if you can't trust it, how useful is it?
I feel like I have to shout out Galileo's own agentic
evaluations and our reliability platform building for agents
today and say, hey, check it out.
Galileo dot AI, sign up for you to try it out.

(40:48):
There's more information on there.
I don't spend too much time on it, but yeah, there are a lot of
foundational steps to your pointaround data preparation, data
lineage that needs to occur. And as we come to the closest
conversation, I'd love to get your thoughts, Greg, on, on kind
of where you see the needs for data professionals and, you
know, engineers in the next 6 months, year.

(41:11):
Like what do you think they should be focusing on?
Is it data lineage and is it evaluations?
Is it, you know, the rigor they're applying?
Where should their heads be? Yes, everywhere and everything
all at once. No, but in, in, in all
seriousness, I think there's, there's kind of two different
camps and, and I think I'm, you know, I'm, I'm very fortunate to

(41:34):
be able to take the career path that I, that I have and, and
I've had the, the, the, the trust for my leadership to kind
of let me go, go explore. But I, I, I think it's, it's for
one very specific reason. We'll get here, get to a second,
but I think if you're sitting there today, is a, is a, is a
data professional or an AIML professional, You know, it's,
it's being able to, to have a better grasp of the, of the

(41:58):
data, demand more from yourselves, your organization
and the industry to help ease the access to the correct data
more rapidly. A lot of these folks, you know,
they want to increase performance in their models.
You need more data. More data equals more better,
but it's really hard to find that data today.

(42:19):
So I think as an industry we need we need to, to step that
up. But I think on the other side of
the coin, and this is the tough thing I think for a lot of us to
kind of start grappling with is you're, you're building a tool
to solve a problem. Try to understand the problem
that you're solving for, you know, if you can spend time in

(42:41):
that particular industry or thatparticular job roller function,
whether it be actually doing thejob, you know, interviewing a
tons of people that that that work in that function or, or
bring those people. I think more importantly, bring
those people into the fold and you both can kind of Co learn
from each other on this. I think if, if we can get more

(43:01):
people thinking about the business problem that they're
solving for, we're going to we're going to really rapidly
increase the pace of, of innovation and problem solving.
I love that. I think that is a great
indicator of empowered product and R&D teams everywhere is they
think obsessively about the problem they're solving.
They talk to their users, they talk to their customers, and

(43:21):
they bring that feedback in. And absolutely, we have to apply
it to AI. And I think it's really easy
when we're using this incredibletechnology, just get excited, as
you said, like I want to just build the newest thing, let's
try the newest one. And there's nothing wrong with
doing some of that. But to really get into
production with customers takes effort, takes work.
And I definitely recommend everyone who is listening to
check out Greg's LinkedIn for incredible insights he shares

(43:45):
and cohesi.com for everything Cohesi does.
There's so much opportunity in the AI data space, and Cohesi is
definitely leading the way. Greg, thank you so much for
joining me today. It's been a ton of fun.
Thank you so much, it's been a blast chat with you.
Yeah, absolutely agreed. I feel like we could go on for
so much longer, especially once we start getting into soccer.

(44:05):
Though I I really don't want to talk about how good LAFC is
right now because that my Sounders are getting killed.
So we're we're going to actuallyskip that one, hold out for
another episode where hopefully we'll have a.
Good show for it for another MLSAfter Dark conversation.
God. Yeah, After dark, indeed. 40
It's lovely. It's not my opinion of it.

(44:27):
I'm gonna. Are you pulling the host?
Dang it. Connor, you just you, you, you
made it too easy for me. It could have gone either way,
but it didn't. Yeah, yeah.
We'll go there. We'll go there.
It was great. Thank you so much, man.
This is a ton of fun. We'll link everything you've
talked about in the show notes. Any parting words for our
audience? Always be curious, you know, I

(44:49):
think always, always ask yourself, well, how does it do
that? Why is it do that?
Should it do that? Can it do something else?
Being curious is is what life's all about.
I love that that is a fantastic approach to life and I can very
much see it in your career and your mindset and, and how you've

(45:10):
explored I it makes things a lotmore fun and it gives you a lot
of opportunities. So thank you so much, Greg, for
sharing your insights and wisdomwith us.
And that's all for this episode of Chain of Thought.
Everyone, don't forget to subscribe wherever you are
listening, wherever you get yourpodcasts.
We're on YouTube as well, and you can check out our YouTube
for so many more deep dives ontobuilding with AI, our webinars,

(45:32):
and much more. So be sure to subscribe.
Greg, thanks again. It's been a ton of fun.
Thanks, Connor.

All Episodes

Your Key to AI Success is Hiding in Plain Sight | Cohesity's Greg Statton

Episode Transcript

Popular Podcasts

Stuff You Should Know

Las Culturistas with Matt Rogers and Bowen Yang

Crime Junkie

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Your Key to AI Success is Hiding in Plain Sight | Cohesity's Greg Statton