All Episodes

August 22, 2025 • 50 mins

In this episode of Unsupervised Learning, I sit down with Michael Brown, Principal Security Engineer at Trail of Bits, to dive deep into the design and lessons learned from the AI Cyber Challenge (AIxCC). Michael led the team behind Buttercup, an AI-driven system that secured 2nd place overall.

We discuss:

-The design philosophy behind Buttercup and how it blended deterministic systems with AI/ML 
-Why modular architectures and “best of both worlds” approaches outperform pure LLM-heavy -designs
-How large language models performed in patch generation and fuzzing support
-The risks of compounding errors in AI pipelines — and how to avoid them
-Broader lessons for applying AI in cybersecurity and beyond

If you’re interested in AI, security engineering, or system design at scale, this conversation breaks down what worked, what didn’t, and where the field is heading.

Subscribe to the newsletter at:
https://danielmiessler.com/subscribe

Join the UL community at:
https://danielmiessler.com/upgrade

Follow on X:
https://x.com/danielmiessler

Follow on LinkedIn:
https://www.linkedin.com/in/danielmiessler

Become a Member: https://danielmiessler.com/upgrade

See omnystudio.com/listener for privacy information.

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
S1 (00:19):
All right, Michael, welcome to unsupervised learning.

S2 (00:22):
Hey, it's great to be here. Thanks for having me.

S1 (00:25):
Yeah. So, uh, lots to talk about here. Uh, can
you give a quick intro on yourself?

S2 (00:32):
Yeah, sure. So, uh, my name is Michael Brown. I'm
a principal security engineer at trilobites. I lead up our
company's AI and ML security research group. We really focus
on two kinds of, uh, intersections between AI, ML, and security.
It's primarily using AIML technologies to solve traditional cybersecurity problems
that are really hairy and really kind of sticky, and

(00:54):
conventional methods have kind of failed to address. And then
we also, uh, to a smaller degree, look at, um,
the security of AIML based systems. So, um, I was
also the lead designer, um, in team lead for um,
trilobites team that entered into the AI Cyber Challenge. Uh,
we built the tool called Buttercup, which took second place

(01:16):
in And overall in the iacc. And, um. Yeah, that's
about it.

S1 (01:22):
Yeah. That's perfect. And that's exactly what I'd like to
chat about. Um, so I guess, um, I guess the
thing I'm most interested in is, uh, just the design
of the system, and, um, I guess overall, what you
know about the designs of the other system. So design

(01:44):
versus design, system versus system. What? Whatever you want to
share or can share. Like what? What are your thoughts
on that? Um, I guess everyone releases open source. So
maybe you've had a chance to look at some of
the other offerings. Maybe you've heard them talking, maybe you know,
the teams. Uh, so I guess what kind of Intel
do you have on what everyone else was doing versus

(02:06):
what you guys were doing? And how do you think
that went?

S2 (02:12):
Yeah. Well, um, yeah, I guess I can answer that
last part pretty easily. It went pretty well for us. Um,
so we took second place. Uh, the team that finished
in first. Team Atlanta. Um, they had a pretty similar
setup to ours. Um, they had more components, more moving parts, uh,
more pieces. They had more hands. Um, larger team to
be able to kind of implement more, um, but ultimately

(02:34):
they had a really similar kind of set of design principles, um,
that worked out for us, the third place finishing team theory, they, um,
had a bit of a deviation in terms of like
their conceptual, uh, principles that guided how they built their system.
But I can get into that in a bit. Um,
I guess I can first start off by talking a
little bit about our concept. So it's interesting. Um, you know,

(02:55):
the concept for Buttercup changed quite a bit over the
course of the over the course of the AI Cyber Challenge.
So this got announced, um, a couple years back, and
there was a period of about 4 or 5 months, um,
after the cyber challenge was announced, but before DARPA had
really released any rules. So we didn't really know exactly
how the competition was going to be structured. We structured.

(03:15):
We just knew that we would have to build a
fully autonomous, AI driven system that could find and patch vulnerabilities, um,
with a high degree of accuracy. Um, so originally, the
concept that I drew up along with my co-creator Ian Smith, um,
was originally really ambitious. Lots of moving parts, lots of
static analysis, dynamic analysis, lots of, um, conventional techniques, lots

(03:39):
of AIML based techniques. But ultimately, once the rules came out,
it kind of got pared down quite a bit. Um,
some of the things that we wanted to do, um, were,
were marked as like out of scope. Some of the
stuff we wanted to do were marked as against the rules, um,
just for the tractability of the competition.

S1 (03:54):
So is that because they were, they would have been
too expensive. Didn't you have budgets you had to stay under?

S2 (04:00):
Yeah. So some of it was definitely, um, budgetary and
some stuff was just, you know, flat out against the rules.
We looked at fine tuning a large language model, um,
with information about lots of open source software. And, um,
there ended up being a rule about pre-baking models, so. okay, really,
kudos to DARPA for making sure that, you know, competitors
didn't have the ability to kind of, um, skew the

(04:21):
systems that they build for the test, which is, you know,
finding and patching vulnerabilities and open source software. Um, so, yeah,
there was a lot of stuff that gets cut down. Um,
they got cut down. But ultimately the design of our system, um, was,
was basically a pipeline. We we kind of broke the
problem down. We realized we had to do basically 4
or 5 things really well. To win this competition, we
had to be able to find vulnerabilities. And not only that,

(04:42):
we had to be able to prove they exist. So
it wasn't enough just to, you know, use a static
analysis scanner and say, hey, this thing thinks there's a
vulnerability online. 50 of, you know, whatever, uh, you actually
had to have a crashing test case for the first
round of the competition in the semifinals. And in the finals.
You didn't they, they relaxed this requirement. But the pathway

(05:05):
to getting lots of points basically still required one. Um,
so you you had to find vulnerabilities and also prove
they exist with a crashing input, or an input that
would trigger a sanitizer in the target function. Um, you
had to be able to contextualize and draw additional information
about this vulnerability. Otherwise, patching was doomed to fail. Um,

(05:25):
and then you had to patch the actually patched the vulnerability. Um,
so this is a highly complex, uh, problem that conventional
approaches to software analysis have really kind of not addressed. Well,
in my opinion. And it was a great area to
use I. And then we also, you know, finally we
had to orchestrate all of these functions and do really

(05:47):
high quality engineering around all of them so that the
system would stay up and running for several days. Um,
so based on those kind of 4 or 5, depending
on how you chop them up, core principles or core
tasks that we had to do, um, we kind of
decided on an approach that we kind of call the
best of both worlds, which was, you know, we knew
that conventional software analysis, whether it's dynamic, static, hybrid, whatever, um,

(06:08):
it really excels at certain subproblems within this pipeline. and
it really struggles with other ones. And AIML and specifically
generative AI, which the competition was, was kind of heavily
skewed towards generative AI. Generative AI does really well at
certain types of subproblems in this pipeline, but also really
struggles with others. So our approach is pretty straightforward. We're

(06:29):
going to merge the best in class capability for each
part of this pipeline. Uh, stitch them together with high uptime,
high reliability engineering code, um, and then focus on doing really,
really well for the largest number of, um, the largest
number of possible targets that we could possibly, um, that
we could possibly do well in.

S1 (06:51):
Okay. Yeah. Interesting. So would you say that, um. Basically
those those things that you described in the beginning, those
are like modules and they should almost like, kind of
work independently. So you can, like, hand a task to
each of them. Is that kind of the the system

(07:11):
design idea?

S2 (07:12):
Yeah. Yeah. So we, um, part of this was just
surviving a really rapid development cycle. This wasn't really advertised
all that well, but we actually only had about three
months to develop the first version of Buttercup in the semi-finals. Um,
and we actually had only had about six months to develop, um,
the final version of Buttercup or Buttercup 2.0, which, which
took second place in the finals. Um, and that was

(07:35):
because even though each round of the competition ran for
a year, it took DARPA a while to solicit feedback
from competitors, other stakeholders, and actually solidify the rules. Um,
and so the rules were solidified. It was really at
risk to do really kind of any development on the system. Also,
certain things like the the technical specifics on their competition
API weren't available until later in the, in these cycles. Um,

(07:58):
so part of the reason why we modularized each component
was so that we could take smaller subteams within my
larger team of about ten engineers, um, all working some
degree of part time on this system so we can
modularize it, keep them kind of separate. You know, it
gives us this integration problem that we have to deal
with at the end. We have to kind of put
everything together and make sure that it runs well. Um,
but it was kind of a necessity. It was kind

(08:19):
of a necessity because we had to work on developing
everything independently. We couldn't afford to just do the first block.
And is it becoming like that? You know, that meme
of the horse drawing where really finally defined head and
then as it gets towards like the the back parts
of the animal, it turns into like a raw sketch.
That was what was going to happen if we if
we didn't modularize this. Um, but it also helped because

(08:40):
as we decided to change out strategies or play with
different strategies, made it really easy to kind of plug
and play different parts to see what would work later on.

S1 (08:49):
Yeah, that makes sense. So I keep having this debate
with a whole bunch of people. It's kind of around, um,
let the model do the work because the model is smarter. Um,
and it just understands what to do. And then there's uh,
the other argument, which is, um, build a robust system

(09:10):
and you have the model kind of just be the
intelligence that helps guide the system or moves things through
the system, or maybe routes, uh, across the system or whatever.
But the system itself should be set up really well,
and you're kind of like functioning as a router. And
then when the model gets updated, it makes the system better. Um,

(09:33):
but the counter to that is basically that we're just
going to design bad systems. So we should stop trying
to be rigid there and just use the model. Like
where do you guys fall on that?

S2 (09:45):
Uh, I think it was probably closest to the second
one and maybe more like an an undescribed third thing.
So I'll kind of go over for I, um, you know, we've,
we've been, you know, in me particular I've been doing
research on like applied AI for, for security problems since before, uh,
the large language model became the predominant form of technology.

(10:06):
Back to, you know, 2018, 2019 time frame. Um, and uh, realistically,
like large language models are great at a good number
of things. Um, but they really struggle with certain things.
And particularly in a challenge like this where you have
to do multiple things right in sequence in order to
be successful, you have to worry about errors that start

(10:27):
off in early stages of an LLM heavy pipeline that
compound over time, until eventually you get to the point
where I think kind of collapses. Um, so our philosophy
on using AI, uh, specifically within the AI cyber challenge
and also kind of more broadly, um, is to use
it for, um, tightly constrained, highly contextualized problems that, um,

(10:49):
the models are set up for success. Um, so this
is actually kind of an interesting anecdote. Um, during the
first round of, uh, during the first round of the
AI Cyber Challenge, um, the whole concept of like multi-agent systems,
systems that have, like, tools available to them. um, didn't
really exist. It was like in a couple of papers

(11:10):
on archive and ultimately, um, the way we built our
aperture for the semi-finals and for the finals, um, is
is now reflective of how LM driven systems are just
built today. So it's actually really vindicating. So like our
patcher is a like a multi-agent system. It's got multiple
large language models, each with different roles to play within

(11:30):
this process that collaborate to generate a patch and then
validate it to make sure that it's actually one will compile,
two will actually fix the vulnerability that we've discovered. And
three doesn't break other functionality within the program. So we
found that trying to ask one large language model to
do all of that didn't really work out. And also
in the semi-finals, the, the reasoning models, um, or the

(11:52):
thinking models, depending on, on the branding, they didn't exist,
they weren't available. They weren't even available to us to
use as like, um, early adopter models in the a.i.c.c.
So we were dealing with, with simple, you know, back
and forth, um, style chat models. Um, so we actually
had to build in a lot of this reasoning as
part of this, like multi-agent architecture, we had to build

(12:14):
in a lot of like reliability and engineering code around
maintaining the pipeline. Um, fortunately, the process for um, discovering
artifacts and submitting them was pretty rigid. Um, so it
didn't really affect us that much in terms of or
it didn't have to like put a lot of really
complex reasoning in, um, but actually we ended up even
by the end of the finals, we didn't use a

(12:36):
reasoning or a thinking model, um, in Buttercup, because we'd
actually built it in, it was part of the circuitry
or part of like the, um, the Python code, part
of our orchestration code. Um, so we had the opportunity
in the finals to take that out and let the
model do the work. We kind of explored it a
little bit, but ultimately we decided against it because the
best case scenario was that the model would kind of

(12:58):
figure out on its own how to break the problem
down and how to do individual things, and what tools
to call in sequence. Uh, but we were already subject
matter experts who did it exactly the way it should
be done. So the the best case scenario is that
the model was able to replicate what we've done only
at a more expensive per call. Um, or more expensive,
like number of volume of tokens. Um, so we actually kept, um, we,

(13:21):
we did upgrade our models. We went from the GPT
three series, um, and the Claude three, uh, series of
models and moved up to, um, the four and like
the basically the Gen four versions of models for the final.
So we, we upgraded the underlying models, but we very much, um,
kept the problems very small for the, for the AI's

(13:42):
or for the, um, for the AI models, so that
we would avoid this issue where you have compounding errors,
you have to worry about like these, these modulo errors of,
you know, deciding to do the wrong thing in sequence.
And that actually turns out to be really, uh, to
be penalize you heavily in these long systems because, you know,
when a system decides, you know, hey, I've got to

(14:03):
do A, B, C and D and C before b.
All of that information involved with dealing with this like
out of sequence task. It stays in the context window.
And it kind of, for lack of a better term,
kind of pollutes the model's ability to kind of reorder
those tasks and do them correctly. It has a hard
time kind of forgetting information until it rolls out of
the context window. So it's a really long way to

(14:24):
say we probably did the latter version. But, um, one
thing I do want to say is like the actual
like processing of artifacts through the system, we didn't rely
on the AI to kind of figure out, okay, I've
got a vulnerability now I should patch it. That was
also all, um, that was also all orchestrated, um, by
our by our larger pipeline.

S1 (14:42):
Okay. Okay. So yeah, I've seen this a lot as well.
I mean, I feel like this is a general concept
that people are coming to, which is, um, I don't
want to say legacy tech. Traditional tech is just like, deterministic. So, like,
that's the tech that you want to use to, like,
do things that matter, and then you kind of want

(15:05):
to use like AI for like a, um, I don't know,
like a router maybe, or like a, um, something intelligent
about choosing which standard tech to use, but not making like, choices.
Maybe necessarily. Um, I don't know. I'm trying to figure
out how to articulate that, but it's like.

S2 (15:24):
Yeah, well, it's actually funny you bring this up. I've
had to kind of get good at articulating this, um,
over the last couple of years. So the way I've
explained this to people is that certain problems, particularly in
computer science with this kind of generalizes everywhere. Certain problems
lend themselves to prescriptive solutions. So prescriptive solution is something
that we do when we write an algorithm to solve

(15:44):
a problem. This could be like coming up with an
answer for the traveling salesman problem. You know, we know
it's a really difficult problem to solve, but there's greedy
algorithms that do a pretty good job and for the
most part, will get you a good answer. Maybe not
the best answer, but they'll get you a good one.
So for these types of problems, you can prescribe a
set of steps to the computer and let them execute them.
Now other problems are really, really challenging to prescribe a

(16:09):
solution for. So these types of problems lend themselves to
AI or ML techniques because you can use a descriptive
instead of prescriptive solution. So a good example of this
is like image recognition. So it's really really hard to
take a picture of a cat and write a computer
program that will say, okay, based on the pixel colors
of this pixel and this position, this is going to

(16:30):
be a cat, because a cat can be in a
million different contortions. It can have different hair, the face
can be half obscured. But what we can do is
we can describe to an AI ML model what a
cat looks like with millions of pictures, because we have
millions of pictures of cats. And then it can do
a good job of solving that problem. Now it might
make mistakes, but this is better than the option that

(16:51):
you had with the traditional approach, because that approach was
awful to begin with. So a good example of a
corollary for this in Buttercup is patch generation. There's a
lot of synthetic code generation tools and a lot of
research in this area. But in terms of like automatically
generating patches to fix bugs, unless your bug is like
dead obvious, like it's missing a bounds check and it's

(17:13):
really easy to apply some sort of pattern matching to
figure out what the lower bound is, or the upper
bound is that needs to be checked. Um, tools to
generate patches for weird bugs. Like they just don't exist.
So this is a great place for AIML to help
us out. And it actually turns out, um, you know,
this is really proven true by the AI Cyber Challenge
and by Buttercup, more specifically, um, llms are great at

(17:35):
generating code, um, because it's one of the biggest value
propositions right now for the technology. So, um, generating patches
for bugs is tightly constrained. It's not not asking you
to generate all of the code that is necessary to
build this entire system that I've got a spec sheet for.
I'm only asking it given this code, and given what
we know about this vulnerability, how would you change it

(17:56):
to fix it? The large language models have already internalized
internalize large numbers of incremental commits to open source code
repositories that fix bugs, so they actually have a really
good track record with, um, more than I expected, even
when we started this, uh, with generating patches. So this
is a great example of where generating a patch is
something that lends itself towards a descriptive solution and a

(18:18):
descriptive algorithm, uh, or an AIML algorithm versus something that's prescriptive, um,
which is fuzzing. Fuzzing is a good example of a
prescriptive solution. If you if you need to find a
vulnerability and you need a crashing input, um, you have
to be able to prove that it exists. It's really,
really hard to get an LLM to do that because
llms the underlying reasoning. They don't have like data feedforward. Um,

(18:42):
they basically they look at source code like they look
at natural language. Natural language doesn't describe the activities of
an underlying state machine that runs on hardware after it
passes through a compiler. So like, you know, the source
code when looked at by a model. Models look at
source code in a really shallow way. Um, so when
we want to find, you know, a crashing input, a

(19:04):
fuzzer is a great way because we can prescribe a solution,
which is try everything, brute force it. Um, just come
up with different inputs, throw it in there, and then
if it crashes, well, there you go. You've proven it.
So that's what fuzzing heavily early on. You know, for
one type of problem we use patching heavily for another.

S1 (19:20):
Yeah, that makes sense. And the other problem with, um,
finding vulns with with um, I also seems to me that, um, they,
they want to please there's they're heavily biased to be like,
this is it. This is one. Yeah. Well, this is
definitely a hit or whatever. And you look at it

(19:40):
and it's actually not. So I guess the intelligence is
deciding to use the fuzzer, which it could help make
that decision that a fuzzer should be used. Right.

S2 (19:52):
Yeah. Yeah. So it's it's funny you bring that up.
Large language models really struggle to solve problems that aren't
rooted in some kind of ground truth. Um, it turns
out there's a huge difference there. We have some internal
research that we haven't published. Anybody could reproduce it. But, um,
so it turns out if you if you have a
bit of source code and you ask the model to
tell you where the vulnerability is, um, it will absolutely

(20:15):
hallucinate a vulnerability because it wants to please you. Uh,
we have one of our researchers, um, one of our
principal researchers, Artem. He's a great guy. He, um, he
downloaded the, um, formally, correct. Uh, the formally proven correct
portions of, uh, of Linux and asked a large language
model several hundred times. Um, here's a snippet of code.

(20:35):
It has a vulnerability where it is, and every single
time it would find it would manufacture vulnerability because it
wants to find the answer. So it turns out when
we started asking it, is there a vulnerability? Um, it
messed up a little less, but it would still assume
that because you're asking that there's something to find and
it would still mess up quite a bit. So that's
why when we're in the concept where we're, when we're using, um,

(20:57):
large language models for generating patches. It's great because we
know there's a vulnerability because we found it and we
proved it, and we can collect additional information.

S1 (21:04):
Yeah.

S2 (21:05):
So now I don't have to worry about asking the model. Hey,
do you think there's a vulnerability? And if so, patch it.
I say no, there is a vulnerability. It's here. This
is extra information about a code that touches it. Now
generate a patch. And the model is very good at
doing that because it takes away the decision making or,
or the judgment call that large language models are really,
really bad at because they don't actually model judgment calls underneath.

(21:27):
And their architecture, they, they model, you know, sequencing information,
sequencing tokens. And when you write code, you're writing a
sequence of tokens. So these problems tend to be, um,
a lot more suitable than other problems where you're asking
it to find the ground truth for you, bad problems
for llms asking it to take ground truth and expand
upon it. Great applications for Llms.

S1 (21:48):
Oh man, I love that. And this also goes to
your previous point of not wanting to pollute the context
for the current task on hand, which is building that patch,
because if you have like some history of like there
were previous decisions made or previous questions asked or whatever
it might get like diverted, you know?

S2 (22:06):
Yeah, absolutely. It's um, it's a, it's a big challenge particularly, um,
I don't know, it's funny. I've, I've been kind of
trying to sing this gospel internally, uh, at Trail of
Bits and to other people who will listen that, um,
the increasing size of context window is not always your friend. Um,
by increasing the size of the context window. I mean,
if you think about how the large language model works

(22:27):
under the hood, it's using these contexts to attune the
model to certain parts of its training data that are
going to be highly relevant to solving your particular problem.
And the more words and the more tokens you put
into the context window, the more you are kind of
nulling out or, um, numbing the attention mechanism. You're forcing
it to become more and more general, because now there

(22:48):
are more tokens that are affecting these attuned probabilities. So
you actually are better off with using now. Context window
is great because if you need, let's say a million,
you know, a million tokens in your context window to
constrain the problem, then use a million tokens. But if
you can do it for 1000 or 10,000, you're going
to get better results because you're more likely to focus

(23:09):
that model where it needs to be.

S1 (23:12):
Yeah, I love this. Like, by the way, this this
this is great. This is great. Um, I'm going to
create a lot of content out of this, um, because it's,
it's really crystallizing in like one starting to form something
in my mind. I'd love to work with you on it. Um, essentially,

(23:34):
what I'm trying to think of is, um, what are
some general statements that we could make? Um, one that
I'm sort of heading in the direction of, you tell
me if I'm wrong is like. And this might be
overstating it, but like, the system itself should be highly
modular and and most as much as possible made up

(23:56):
of traditional and deterministic tech. And then the way that
you use the AI is for the specific type of problem,
which we're going to articulate the way you articulated it
for those types of problems where routing is needed to
the traditional tech. Um, and it's like, don't just go

(24:18):
crazy with AI. Don't ask it questions that the traditional
text should be answering. Um, it's something like that. And
then ultimately you have like this dependable deterministic system with
the minimum amount of AI that is required to move
appropriately through that system.

S2 (24:40):
Yeah. So yeah, really it comes down to problem formulation.
And this is like the the great part about and
this is part of the reason why you see such
a huge overlap in interest between people from the computer
science background and people from like data science backgrounds on
here because, you know, one of the basic things you
learn in computer science, like when you get to like
the graduate level is problem formulation. It's how to recognize

(25:02):
your problem as a derivative, or maybe a like dressed
up version of some other problem. So, you know, right away, um, okay,
I have this problem of, okay, I've got to manage
this delivery system. How do I make this delivery system, um,
for Amazon efficient? You can recognize this right away as, oh,
this is traveling salesman. There's no good way to do this.

(25:22):
But what I can do is I can. I'm going
to get a good answer. I just have to accept
that my answer is going to be imprecise or not
necessarily optimal. Um, and in applying AI and ML to
security problems or any problem in general, the first step
is very much like problem formulation. It's understanding what kind
of model is going to work best for this problem,
because is this a problem that will work well with

(25:45):
a time series model, because my data is coming in
over time, or is this a model that's going to
work well with, um, let's say like a, like linear regression,
because there is some true underlying probability for how the
data is distributed that I'm trying to learn from one
of like the kind of curses of large language models

(26:05):
is that they have abstracted all of this good data
science practice, all these good data science practices away. And
now it's great because it democratizes it. Anybody can use AI,
anybody can use an LLM. And all you have to
do is be able to articulate your problem. The problem is,
is that it also abstracts away problem formulation. And now
we're starting to use Llms because they're accessible for certain

(26:26):
types of problems that they're really not well formulated for. Um.

S1 (26:30):
Yeah.

S2 (26:31):
So this is this is kind of where we get
to the issue. So the good news is we don't
have to just like say, okay, well, I can't do
problem formulation with an LLM, so I just throw it away.
Don't use it. I have to go back to, you know,
TensorFlow and writing my own models and stuff. What we
really have to do is get to what you were describing,
which is rather than throw the LLM at a large problem.
We take it a step further. We break the problem down.

(26:52):
Are there subproblems that are highly amenable to AI solutions?
I have a litmus test that I, that I pass, um,
you know, problems through. And I try to encourage my
team members to use, um, which is, you know, basically
like a check to see whether a problem is good
for AIML. And it's usually, you know, do you have
enough data in the model that you can train? In
this case, it now becomes is the LLM. Does the

(27:13):
LLM have examples of this on the internet that it
can draw from, or are you asking it to do
something like reverse engineering, you know, firmware code on this
obscure chipset that like there's no examples on the internet,
bad example or to it won't have it won't have
anything to draw from. Number two, um, is there some
probabilistic nature to the data that's underlying? This is actually

(27:36):
makes large language models really bad for a lot of
security problems, because they're what we call non-differentiable, meaning that
they don't have like this nice curved space that you
can use stochastic gradient descent or virtually any optimization function
to try and climb and find a good answer for
it actually exists more of like this kind of cloud
with dots of answers all over the place. If you
were to try and imagine the answers to security questions

(27:59):
in like a mathematical graph.

S1 (28:01):
Okay, what's an example of what's an example of one
of those? I'm, I'm trying to think of what that
space might look like.

S2 (28:08):
Yeah. So a good example of like a problem that
is differentiable is like housing prices. So housing prices vary by,
you know, like the size by square footage. Yeah. Square footage,
number of rooms, zip code quality of the schools. So
when you plot these all out you get something that
you can do linear regression on. You can see like.

S1 (28:24):
A.

S2 (28:25):
Little loop. And that's called a differentiable function because it's
a continuous line that you can draw through the data
that more or less minimizes the error of those points
along the line.

S1 (28:34):
Yep.

S2 (28:35):
But if we want to think about, um, let's say
now optimizing a program, we can take a look at
how ordering certain steps or changing the way we implement
certain functions as changing the speed of a program up
and down, and that becomes kind of pseudo differentiable. It's
it's more like a step function where you have kind
of like little lines where if I change this one thing,

(28:56):
it jumps up a little bit, it's more jagged, but
there's still, um, it's close to differentiable because I can
kind of map deterministically how if I run it on,
you know, with this set of compiler optimizations or that
it's definitely not differentiable, but it's closer. Security is just
wild because the flaws in computer programs can come from

(29:17):
one of a million different sources. It can be a
logic bug, it can be a mis implemented function. It
can be the use of an unsafe function, which is
easy to find. There's no way for us to take, um,
root causes for vulnerabilities in software and solutions to them
and plot them on a graph. Because they come from
they come from unquantifiable sources. Some of them like, you know,

(29:39):
Spectre and Meltdown and stuff. They they're resident in hardware
and the implementation there. Some are purely in software like
X type vulnerabilities. We can't they don't they're it's, um,
it's not even apples and oranges. It's like trying to
compare apples and fighter jets. Um.

S1 (29:56):
Is it, is it a matter of, like the, the
tensor size or the, um, I think that's called tensor size.
I can't remember the, the, um, the number of dimensions
in the space, because when you're looking at square footage
and price what you have to write, is it the
problem in security that is just so many dimensions that, um,

(30:18):
when you try to plot it, you try to simplify it,
it just becomes garbage.

S2 (30:22):
Well, it's a matter of common dimensions. So if you
build a house, every house has square footage.

S1 (30:28):
There you go.

S2 (30:29):
And you can calculate the space underneath. But a cross
site request forgery vulnerability in a, um, you know, piece
of JavaScript code that exists on the web has almost
nothing in common with a memory corruption vulnerability in a
C program running on a router in your home device.
They are implemented at different levels of abstraction. You know,

(30:51):
like even the program representations are different because some of
the vulnerabilities might exist only in binary code after it's
been compiled versus other vulnerabilities that are resident in source
code that's interpreted via web browser. Um, so really what
it is, is it's like trying to it's like trying
to plot, you know, the prices of homes, along with
the prices of, um, I don't know, oranges in a

(31:14):
particular year. You know, there's very little in common between
a house and an orange other than maybe some, like,
you know, global macro effects that might show some correlation,
you know. You know, economic factors like inflation.

S1 (31:28):
Or like the beating of a whale's heart to determine
whether or not it's healthy. It's it's like completely different. Uh, yeah.
Completely different sports. Yeah. Yeah, yeah.

S2 (31:39):
Yeah. So, so really, it's a it's a lack of
common dimensions in cybersecurity, which is why, you know, if
we think about like if we were trying to model,
like what the data would look like, if we could
visualize it, it would just be a bunch of points
of presence out there. Um, uh, within this, like, kind
of large cloud. Um, and even then, that's another problem
that kind of makes cybersecurity really hard to model with

(31:59):
AML is that there is really comparatively little data, um,
in terms of like the volume of data, there's tons
of vulnerabilities out there. But if you're trying to make
a model that's really, really good at, let's say, detecting, um,
buffer overflows and embedded device code, um, you're going to
find some data for that, but there's not that much
you have to rely on like POC write ups on,

(32:21):
on the internet for practitioners who put it out there
for fun. Um, but there's not a million of examples
of that like, it is if you want to say,
I want to train a model to write the Great
American novel, there you can take you can take every
novel ever written, throw it in there and then see
what the model comes up with. If you prompt it
with like a general plot line, it's going to do
a lot better at that because, you know, that data

(32:42):
fills in that space a lot more. Um, so so, yeah, it's, um. Yeah.
Like the, the, the challenges and problem formulation are, are
really big and, um, yeah, that's why I kind of
encourage people when they look at these like, okay, I
want to build an AI, ML driven system. Um, take
a look at what subproblems are actually suitable for AIML.
Use them there. And I think you'll also find that

(33:03):
a lot of the times we have a tendency to
like say, okay, let's just kind of throw large language
models at some of these problems that we know we
could really solve with regular code. Um, and that's really
bad because of this compounding error problem. So, you know,
if I, you know, five steps in sequence that I've
got to do in step three is good for AIML
and step four is good for AIML. You know, like
it's like, okay, well, look, almost half of this problem is,

(33:25):
you know, is something I'm going to ask the model
to do anyway. I'll just ask it to do one,
two and five to. Well, the problem is it can
make a mistake in one. It can make a mistake
in two. That compound before you get to three and four.
So you're better off, you know, implementing one, two and code.
And then maybe you ask the model just to finish
it off and do step five because it's the final step.
It's had ground truth rooted in steps one two, steps

(33:46):
three and four. If they're well contextualized problems, maybe the
false positive rate is low enough that you can afford
to just let the model kind of finish it up
for you. But that's the biggest that's the biggest jump
I would take. Usually that's step five is like validation
or correctness. Um, checking. And that's not something you want
to ask the model to do because it's, it's it's

(34:07):
it has the tendency to, um, one be wanting to
kind of like please itself and say, oh yeah, it
looks great to me. Um, or to, um, depending on
how you phrase it, find something that doesn't exist. And
validation is a problem that typically is, uh, is pretty
amenable to like deterministic code.

S1 (34:27):
So I really love this. Um. Where this is taking
me is designing, like, a, uh, a general problem solver.
And I'm imagining, like, the smartest model that you have.
You know, opus, whatever. Or, like, the best Gemini or

(34:47):
whatever or whatever the best model is. But but then
what you do is you say, okay, uh, the problem
is we need to design a system that, uh, you know, properly,
deterministically solves this problem with a high level of accuracy.
For example, the vulnerability problem that you guys worked on.
And then what I love is the idea of you

(35:11):
present to the model all these different AI models and
all these different deterministic technologies, all as solutions. And then
you do what you said, which is you, um, break
down the problems that need to be solved at every
level of the subpieces. Right. And then you match each

(35:33):
of those little problems to either one or, uh, one
or many of these eyes, which are bigger or smaller,
have different weaknesses or whatever, or even ML, not even
LLM based. Yeah. Versus deterministic with the rule of like look,
use the appropriate one for this problem type. And then

(35:56):
maybe you have a whole bunch of training about problem
types and solution types. And then it picks which one
to use for each step. I mean is that.

S2 (36:08):
You mentioned this. I think this is what some of
like the large, you know, third party ML as a
service providers like OpenAI and anthropic are kind of trying
to do. If you've heard of like this concept of
like mixture of experts models, um, it's uh.

S1 (36:19):
That's true.

S2 (36:20):
Yeah. It's this concept where, you know, like, you know,
like the, the actual interface. We have to maybe GPT
five and, and I haven't looked at the source code.
I don't work at OpenAI, so I have no idea
if this works underneath the hood, but it's been kind
of theorized and it's even been mentioned, you know, a
bit in terms of, um, you know, people who've kind
of looked at the models a little bit closer that,
you know, um, you know, when we, when we, we
fine tune a model to make it really good or

(36:41):
really suitable for a particular purpose that's amenable to AIML,
it can still be challenging to, um, have it interface
with the user in the way that like a high
quality chatbot would. So using yeah, a mixture of experts
models suggests that like having like an interface, like a
bot that interacts with the user but then recognizes certain
classes of problems and ducts them to the right expert. So, oh,

(37:04):
they're asking me about cyber. I'll ask, you know, um,
cyber GPT to handle this one. All they're asking about,
you know, mental health, I'll ask, you know, mental health
GPT to to help out here. Um, so, you know,
this kind of like concept I think is I think
it's trying to be creative, or at least it's been

(37:24):
thought of, um, in terms of using like all AI,
ML solutions. But but yeah, I agree, like the way
forward is to have, um, you know, for, for like
rapid like prototype development have like components that do certain things. Well, um,
and honestly, it's like reflected in software, like we have
libraries for, we have libraries for sorting. No one or
we have libraries for cryptography. Nobody should be writing their

(37:47):
own cryptography code. Use a library. Um, you know, the
closer these high quality libraries and, um, fine tuned ML
applications or ML models for certain types of subproblems, the
closer we get to being able to kind of compose
all these together. And the good thing is, is that
Elm is probably pretty good at writing the glue code
to sequence all this stuff together.

S1 (38:06):
Yeah, yeah. Because because that's the trick for me. Because
inside of a mixture of experts, you're already inside the LLM.
What I'm thinking of this higher level model is like, look,
we're doing it. We're doing, um, matrix math over here.
We're doing multiplication over here. Um, guess what? This problem
space is not associated with an AI. We don't even

(38:26):
know I will ever touch this. We hand it to
our fastest and best, you know, deterministic addition function or whatever,
you know, and it's like maybe 95% of the whole
app ends up being traditional tech that doesn't involve AI,
other than the routing to get there.

S2 (38:43):
Yeah, I mean, that would be ideal. I mean, anything
you can route, anything. Anything you can. Yeah, I don't know.
It's funny. It's like really what it comes down to
is like using large language models and like, solving large problems.
It becomes a conditional probability problem. And even if you
have the answer, get the right answer right at 99%
of the time. Um, over and over and over again,

(39:08):
you still have a high likelihood of failure by the
time you compute all the conditional probability out. It's kind
of funny. Like, I kind of learned this lesson in like,
in a completely different walk of life. Um, after I
got my bachelor's degree in CS, I, I worked for
like a year doing, um, software engineering and kind of
found it to be dull, so I, I, I did

(39:30):
something completely different. I joined the Army and I started
flying helicopters. Um, it's actually nice. That is, that's actually,
you know, I'm at up at Camp Dwyer in in
RC Southwest and Afghanistan. It's, um, picture was taken of
our aircraft on the flight line, and one of my
jobs as a pilot was to educate our junior pilots
on this concept of, like, mission survivability. Um, and that's

(39:52):
the idea that, um, you know, understanding what's called, like,
the kill chain. The kill chain has been pretty popularized
and security as well. But, you know, basically for a
for a compromise, whether it's shooting down an aircraft or
breaching a database, like a lot of things have to
happen and they all have some sort of probability. And
your goal in breaking the kill chain or breaking the
exploitation chain is to reduce any one probability down to zero,

(40:13):
because then the common or the conditional probability problem becomes zero. Um,
but the probabilities can be really weird. I used to
talk to my junior pilots and ask them like, hey,
what do you think is like the acceptable loss rate
on any of the missions that we fly here in theater?
And they would usually give me answers like they were
pretty close. They'd say like 90% or 95% or even 99%.

(40:35):
So I would actually take them to the math problem.
I get off the whiteboard and I'd say, okay, let's
assume it's 99%. I say, okay, how many aircraft are
we flying a day? Okay. You know, we have ten
total aircraft. We go on five missions a day. So
that's five aircraft are going out there. And let's say
there's only a 1% chance that each one of them
gets shot down. Okay. So that's five aircraft a day.
But we're going to be in we're going to be
in theater for for nine months. We'll round it off.

(40:58):
We'll make it a year. We're going to be here
for 365 days. So now if I take 365 by
five and multiply it by five, that's the number of
missions we're flying in the entire time we're here. This
number comes out to be pretty high. And now all
of a sudden, if I lose one aircraft for every 100,
you realize that I actually run out of aircraft in
the first two months of of being in theater and I.

(41:20):
And now all of a sudden, the troops don't have,
don't have helicopters to fly.

S1 (41:23):
Yeah.

S2 (41:24):
So I said, actually, believe it or not, our our
acceptable loss rate is something more like 99.99999%. Um, we
can almost never lose an aircraft because. Or we can
almost never accept any type of probability. That means we
have even a remote chance of losing an aircraft because
we will deplete them. It's a limited resource. Um, solving
problems with Llms is the same way. If you ask

(41:46):
them to solve 15 problems in a row, even if
it's got a 99% chance, which is which would be
amazing if any LLM could get anywhere close to that,
even if it has a 99% chance of answering every
single problem right over the course of a year, it's
probably going to give you answers that are wrong almost 80%
of the time if that chain is long enough. And

(42:07):
if you have enough problems that you feed through it.
So that's one thing I try to like, um, hope
people conceptualize over relying on large language models and try
to help them understand this, like compounding error problem. It's
really a conditional probability, uh, compounding conditional probability problem. And
your tolerance for false positives is actually zero. So anywhere

(42:28):
in this chain that you can we have to think
about this differently now because I can't reduce anything to zero.
But what I can do is I can take certain
parts of the chain and I can bump them up
to 100%, meaning my chances of getting something right when
I use a deterministic algorithm are 100%. So now I
no longer have some sort of fractional probability out of.
So this 15 step problem now let's say 12 steps

(42:49):
I do deterministically. Now I only have a three step chain.
And now that 99% I'm getting it right only three times.
You simplify this problem. Now I might be able to
make it through a year's worth of operations that, you know,
100 examples of the problem a day. I might be
able to make it through that with a false positive
rate of. I don't know what the math is in
my head. I'd have to I have to punch it out.

(43:09):
But that false positive rate might be a lot more
survivable in an operational world than, you know, 15 conditional
probability problems that are all 99%.

S1 (43:18):
Yeah, yeah, I love that. The way I describe it is, um,
what's 1% of 100 metric tons of problems.

S2 (43:27):
A metric.

S1 (43:29):
A metric ton of problems?

S2 (43:31):
Yeah, I love that. I love that.

S1 (43:34):
Yeah. Yeah. Um, so, uh, we share this in common, actually. So, um,
I was, um, I was also Army, and I was at.
I was at Fort Campbell, so I was air assault,
so I had to do all the helicopter stuff.

S2 (43:48):
Uh, right on, man. Hell, yeah. Brother.

S1 (43:50):
Yeah. That's cool. Airborne air assault. Right? Um, yeah.

S2 (43:55):
No. Yeah, I, I was, um, uh, this this picture
was taken when we were doing, uh, medevac chase. Uh,
we we did security for those guys over there, but
I was in an air assault battalion, so we literally
did nothing but fly you guys around, so.

S1 (44:07):
Oh.

S2 (44:08):
Nice man. Small world. Dude.

S1 (44:10):
Yeah, yeah.

S2 (44:12):
Yeah, I was over at Fort Campbell. I, I was at, um.
I was at Fort Riley, uh, in in the first
cab and then, um, I PC from there after I
went to Afghanistan and went to the 82nd. Um, so
I never got, never quite got to Campbell, which, like,
would have been great because I live here in Ohio
and Cincinnati. It's like where I was from. So I
was like always trying to get to Campbell because it
was like only like 4 or 5 hours from home

(44:33):
and be able to see family a lot easier. But
I ended up like 12 and nine hours away, respectively, so, uh.

S1 (44:39):
Yeah. Well, that's super cool. Yeah, well, we need to
chat some more. Man. This is, like, really, really cool stuff. Um,
what you guys did on the team is cool, but
I'm even more excited just about the way you think
about these things. Um, I'm. I'm, uh, happy that, um,
the way you're thinking about it is similar to the

(45:00):
way I'm thinking about it. I you've taught me a
lot just during this thing. We should we should definitely
chat more after this. Um, anything else you want to
share about the the competition or, um, lessons learned? Um.

S2 (45:15):
So I think one of the things that that came
out of the competition, um, was a lot of vindication. Sorry.
I nudged mouse in it. Oh. So, um, I'll just

(45:36):
I'll just go right into the answer. I assume you
can edit this later or something, but yeah. Um, so yeah,
one of the things that, um, that came out of
the competition was, was honestly a lot of indication, um,
like I had mentioned before, you know, when we started
off this process, um, this was two years ago, which
has been two lifetimes in the development of like AI
enabled systems for any problem, much less cybersecurity. Um, so

(46:00):
a lot of the things that we did, like tool enabling, um,
and multi-agent systems were things that we did before, things
like MCP or um, complicated, um, libraries for supporting this existed,
like we used early versions of um, of long chain, uh,
for some of our multi-agent stuff, but we actually ended
up having to write a lot of and implement a
lot of our own glue code for this. Um, so

(46:23):
it's really vindicating to see, like, those techniques become, while
we're doing the competition, become not only one commonplace and
two supported by the major large language model, providers be
adopted and be used generally by the community. Um, you know,
it was really great that we came in second and
that also the first place finisher also used this like
kind of, um, use, um, problem solving techniques that are

(46:46):
well suited for the problem approach. Yeah. Don't use AI everywhere. Um,
finisher theory. They were a little bit more LM forward,
but they still had a lot of, like, traditional components.
I don't think any team really went after this. Like,
all LM tried to just do everything within the LM. Um.

S1 (47:05):
I bet a lot started that way, and they they
fall back from it. Yeah.

S2 (47:10):
Yeah. Yeah, I think I think at least one of them, um,
at least one team. I think all you need is
a fuzzing brain. I think in the semi-finals, their approach, um,
tried to just use an LM to augment a fuzzer
to find vulnerabilities. And I don't think they really had
much of, like, a solution for patching, but it was
enough to get them to the finals. I they had
a more well rounded system, I believe, uh, in the,

(47:31):
in the finals. Um, so yeah, it was kind of
vindicating to also see that all these other bright minds
out there were also similarly of the, of the mindset
to do this. But um, one of the biggest takeaways
I have that I'll, that I'll say is that was
like different than what I expected because it's really easy
to pat myself on the back and say, oh yeah,
all the plan I came up with worked great. That's
that's awesome. But, um, I will say that I was

(47:52):
really surprised at how well large language models eventually became
at helping us generate patches and also helping us generate
seed inputs to improve Fuzzer performance. Those were areas where
I didn't really give the LLM a lot of credit
up front, but I had to build an autonomous system,
so I had no choice. They really outperformed my expectations.
So I kind of came out of this with, um,

(48:13):
a bit of a healthier respect for the capabilities of
AI models. Once again, these are still highly constrained.

S1 (48:20):
And yeah, yeah.

S2 (48:21):
Very context rich problems that we ask them to do,
but they still did way better than I thought they
were going to do. Um.

S1 (48:27):
Yeah. And also context constrained, not polluted, like a very
controlled context for that thing. Like like you were talking
about before, right?

S2 (48:37):
Yeah. Yeah. Um, yeah, I think that's about it. Unfortunately,
I do have to jump off. I gotta I got
another call at 1230, but, um. Yeah, I'd love to
chat more and talk more with you at some point.
If you want to do a follow up episode or,
I don't know, you just want to chat about other stuff. Um,
you know, we got a couple of friends in common
between Clint and, uh, between Clint and Keith, and it's, uh,

(48:58):
you know, I've. I've run into you a couple places
on various calls and stuff that we've been on, but, um,
it was good to get a chance to talk with
you one on one. I feel like we've been kind of, like,
circling around in the same circle for a while, but
I hadn't had a chance to, like, actually just chat
the two of us.

S1 (49:11):
Yeah, absolutely. Well, thanks. Thanks for the, uh, the input.
This is just, uh, fantastic stuff. And, uh, let's definitely
catch up soon.

S2 (49:20):
Yeah. Sounds good man. Take care of yourself.

S1 (49:21):
All right. Take care.
Advertise With Us

Popular Podcasts

Las Culturistas with Matt Rogers and Bowen Yang

Las Culturistas with Matt Rogers and Bowen Yang

Ding dong! Join your culture consultants, Matt Rogers and Bowen Yang, on an unforgettable journey into the beating heart of CULTURE. Alongside sizzling special guests, they GET INTO the hottest pop-culture moments of the day and the formative cultural experiences that turned them into Culturistas. Produced by the Big Money Players Network and iHeartRadio.

Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

The Charlie Kirk Show

The Charlie Kirk Show

Charlie is America's hardest working grassroots activist who has your inside scoop on the biggest news of the day and what's really going on behind the headlines. The founder of Turning Point USA and one of social media's most engaged personalities, Charlie is on the front lines of America’s culture war, mobilizing hundreds of thousands of students on over 3,500 college and high school campuses across the country, bringing you your daily dose of clarity in a sea of chaos all from his signature no-holds-barred, unapologetically conservative, freedom-loving point of view. You can also watch Charlie Kirk on Salem News Channel

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.