AI Coding Agents Can Do So Much More (ft Kiran)

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
Your goal with your whole dev setup is that you have a
pipeline that is very clear about did this work or did this
not work? Models like to role play a lot,
and so if you give them a strongnarrative, then they're really
effective. And if you give them a weak
narrative, then they're not as effective.
You can actually get like completely correct code as long
as the code base has very consistent rules baked into it.

(00:22):
What you're asking for is like afactory for software, but you
still have to design the thing. Coding agents are taking over
every research lab seems to be putting out their own CLI tool
for coding agent ID ES have AI baked in now, but is this the
best way of doing it? On episode 60 of Tool use
brought to by Tool Hive, we're talking to Karen, one of the
core contributions to Slate. Slate is a research driven

(00:43):
approach to coding agent and that does things a little
differently. So today we're going to learn
about what makes Slate differentand how to set up your
environment to make the most of coding agents.
So please enjoy this conversation with Karen.
In order to get the best out of any of these tools, you really
want to make sure that your dev environment is set up in a way
that the model will actually benefit from, benefit from it.

(01:05):
And, and so I'm at this point, it's like, I think a known,
known thing out in the open, butall of the, all of the, the
models that are trained now needto kind of have some sort of
feedback mechanism about the correctness of the things
they're doing. Otherwise they end up being and,

(01:26):
and making pretty stupid decisions.
And so an easy way to do this isto provide it things like, well,
to give it awareness of things like the build commands or the,
you know, dev server commands orpractices for how it should do
certain types of debugging, etcetera.

(01:47):
And so you have the ability to do this through things like
rules files or you can just structure your code base in a
way that it like logically makessense.
But rules files will probably beyour best friend there.
I think. Like, let's say in your rules
file, you describe different debugging approaches and you
describe how you can build it, how you can run it, how you can

(02:10):
install everything, how you can set up the environment.
All of those things are really, really important because you're
taking like essentially all the reasoning that it would have to
do around setting up the environment and caching it
somewhere for it to use. And and what that means is that
the model then has a like quasi tool, right?

(02:32):
Like if it can run those commands, if it can set up the
environment in that way with itsexisting tools, you've
essentially given it basically another tool that it can use for
validating its its work. And what we found pretty
consistently is that if you get something like Slate in a loop

(02:52):
where it gets feedback from its environment about like what it
did and whether or not what it did was correct, then you
actually can get like way betterperformance.
And and this should, this shouldbe obvious, right?
It's like, OK, yeah, you give a person a task and they go write
some code. You know, like some top

(03:14):
percentage of engineers are going to be able to write that
code completely correctly just from the mental model they have
in their head of everything. But most engineers are going to
need something like the ability to do NPM run dev or, you know,
NPM run compile. And that I think is super
important. The same thing goes for tests.
So you want to have probably a good testing set up so that you

(03:35):
can ensure that nothing broke when you add new things.
It does get weird when you ask the model to write the tests for
you, because you then have to verify that the tests themselves
are correct, and the only way you can really do that is via

(03:55):
inspection. So you literally as a person
have to walk through the tests it wrote or you have to write
them yourself, in which case youshould probably actually focus
on making your testing infrastructure as good as it
possibly can be and as complete as it possibly can be.
So that the model has this like hill that it can actively hill

(04:16):
climb because it's just it's trained to do that.
So you may as well just take advantage of the like natural
behavior that it's it wants to exhibit and just let it do that.
You just need to put the guardrails in place so that the
feedback they get that it gets from your system is designed in
such a way that it makes it easyfor it to recognize where its

(04:39):
code failed. And another, another thing here
is a lot of people have talked about how sonnet was very reward
hacky and what this manifests asis it's like deleting tests,
deleting code, right it Oh yeah,it, it just deleted my entire
code base and now all the tests pass cool, great right.
But it's because there's no tests.

(05:03):
And so that's the one thing thatyou have to kind of be careful
about here. And so you can also do things
like ask it to verify, you know,like whether or not there were
changes to the X file or Y file,or you can put in pre commit
hooks to make sure that there were no changes to that file.

(05:26):
And so something you could do islike you could take your agent
and you could have it like do all of its work, run all of its
tests to make sure everything builds and do a commit and have
a pre commit hook that you manually bypass as a user where
you can use that as a mechanism to check to see like, OK, like

(05:48):
the agent didn't mess anything, right?
It's just another form of code review.
It's just easier. And so then you could probably
do things in the form of like two commits where you do like
one commit where you do the testrelated stuff and then you do
another commit where the agent does all the code.
So that's that's an example. Another thing is that you, if
you are running multiple of these in parallel, you kind of

(06:12):
need, you can't really do locking that well over files.
Like if two agents are writing to the same file, you can get
conflicts there. And so really what you want to
do is probably something like cork trees or you want to use
like some remote agent that has like an isolated container or

(06:35):
AVM or something, right? And then you can get like the
best of both worlds where you can get the change sets for
thing A and the change sets for thing B.
Where it gets interesting is howdo you, how do you do things
sequentially, right? Like how do you get these?
How do you get agents to do large tasks sequentially that

(06:58):
are dependent on each other? How do you get them to do do
them well? And this is I think actually an
open problem still to some degree.
I think the the anthropic modelsare pretty good, pretty good at
this. But the problem, the main
problem is that code contains quite a bit of yeah, code.

(07:24):
Code contains quite a few implicit decisions about what
you should and shouldn't do. And because the decisions are
implied, you need to have the code itself in the context.
And so if the model doesn't get all the code in, in its context
when it's trying to implement something, you'll end up with a

(07:44):
bunch of like issues like the model will re implement
something or it will use the wrong decorator, or it will do,
you know, it'll do use the wrongparent class or it'll extend the
wrong type interface. And then it'll use that, that
type interface in the wrong spot.
Because it, you know, it's when it searched, it was like there

(08:04):
were two, you know, type interfaces that came up and it
picked the wrong one. And then it was like, OK, well,
now I'm going to just use this for the rest of my execution.
And then you go and you look at it and you're like, well, my
future task is dependent on thisthing and this thing is broken.
What do I do? And in that case, you're kind of
stuck, which is why you want to make sure that your code based

(08:26):
patterns are very clear. You want probably as few types
as possible. You want as little abstraction
as possible. You want everything to be as
modular as it can be without being overly modular.
Because as you start to break things down, even like to a
certain degree, you can actuallyincrease the coupling by trying

(08:46):
to break things apart when they're actually, it's like code
that's supposed to sit together,right?
So let's say you have a big function and they share like
half of the variables are start being used halfway through the
function and the rest of them are used at the top.
And there's a very clean line where you could split it into
two functions. So you have a trade off to make

(09:08):
when you're writing your code. You can either say, oh, actually
I'm going to output this type and I'm going to take make a
second function that takes in this type.
But really what you're doing is you're just increasing the
complexity of your code base because you are creating 2 black
boxes and so and you're creating22 points where you have to
track and maintain structure. I guess like you have to

(09:33):
remember the correct type structure and what each thing
means rather than one. And so really again, it's like
couple couple guiding principleswhich are kind of standard or
like if you have to repeat something more than twice,
consider refactoring it. If you know, code that like
changes together should stay together, etcetera, etcetera,

(09:56):
general architectural and like good code organization
principles will apply. Something that is important to
note though is file lengths should be shorter.
And that's simply because when you go and have model read file
like, the length of the file will introduce context, right?

(10:21):
And you don't necessarily care about all of that context.
And so if you can afford to break up the thing like you
don't have just one giant class that is like 2000 lines long and
it's actually just like a bunch of different things.
If you can afford to break that up, you should probably break
that up. And you should probably move all
your utils to a utils folder, import them, etcetera, etcetera.
Same thing with building good components and doing good state

(10:43):
management. If you, if you get your state
management right or your event handling right manually and you
move all of the like kind of system guidelines and complexity
into one spot and you create patterns around these things.
Models are really good at propagating those patterns, but
they're really, really bad at creating them themselves.

(11:03):
And so if you have poor state management, initially the models
just going to write like, well, maybe it'll write better code
than you and in that case you'regood to go.
But assuming that you're doing like prod work at a company, if
you, you know, are doing poor state management on your front

(11:24):
end, the model will just like kind of replicate the quality
that you have. It's, it's not going to really
save you. And so you need to establish
like, like I was saying, there'simplicit decisions in the code.
If the model sees these kind of patterns, it'll just choose to
replicate them. And so your, your whole goal is

(11:44):
to set the patterns up in a way where it's easy for the model to
work with and then the agent builder.
So like in this case us, it's our job for us to surface those
patterns to the model so that they make good decisions.
All that makes a lot of sense for the architecture of it.
I'm curious, when we get to the usage phase, what type of

(12:07):
constraints do we need to give agents when we're prompting
them? Agent rules about like how they
can verify their work, architecture decisions that make
it easy for them to do certain categories of work, I think
works really well. So let's say you have a certain
pattern in your code base that allows you to extend the
functionality of different components in your code base.

(12:28):
That's really useful because themodels will be really good at
extending those patterns or repeating them for different
cases. But the like constraints, I
think exist more in how you choose to, you know, accept the
code, how many times you review it, how how you want to review
it. A lot of people say that like

(12:49):
GPT 5, GPT 5 codecs are, are good code reviewers and that
like you, you can build your ownworkflow where you, you know,
have like a a git hook and you have GPT 5 do like some review
task and you have, you know, slate quad code, whatever, do

(13:18):
the actual implementation work. And then you can just run that
workflow in parallel. So like ideally what you, what
you want to think about is like,how do I build a pipeline here
where I get all of the information I need to make a
decision about whether this codeis good or bad, right?
And that's, that's like your goal with your whole dev setup
is that you have a pipeline thatis very clear about did this

(13:40):
work or did this not work? So that's, that's number one.
Number 2 is then also creating the guardrails or patterns or
guidelines that make the code more likely to work.
But first you have to be able totell if it works or not.
So again, like tests, tests, tests are really important.
To be honest, when I when I do things, I actually don't rely

(14:05):
too much on tests because we have to do a lot of end to end
work. So my personal workflow is
something like actually making sure that the model is aware of
the right files. And then This is why I was
saying patterns are super important is because you can
actually get like completely correct code as long as the code

(14:29):
base has very consistent like rules baked into it.
And you, you can just get basically perfect code without
having to have tests. I think The thing is that most
people don't write code in a waythat it's like very, what's the

(14:50):
best way to say this? Symmetric where it's like kind
of consistency across how thingsare written.
And the more consistency you have, the more likely the model
is going to follow that pattern.And therefore, like, if that
pattern is correct and is usefulfor your current use case, the
more likely the code is going tobe correct and the happier

(15:12):
you're going to be. And so it's, it's important to
figure out like, how do you wantthis code base to be built?
And this actually leads me into a different and more interesting
point, which is like, is software engineering dead?
And do new grads have jobs? And I actually think that the
answer is yes, simply because you have to make decisions about

(15:37):
how you want the code base to bebuilt.
And that requires a lot of skilland it requires a lot of
practice. And you're going to constantly
need people who are able to understand, you know, like how
to actually make good decisions about code organization, code

(15:57):
structure, implement like, you know, implementation specifics,
abstraction decisions, even evenlike build life cycle or like
the CICD decisions, all that stuff, like all of that stuff
super necessary to make. What is essentially like what
you're asking for is like a factory for software, but you

(16:20):
still have to design the thing. All we all we have right now is
we have like we have the, the, you know, robotic like arms and
we have the, you know, conveyor belts and we have whatnot.
We have all of these individual things, but you still have to
make good decisions about your factory because what happens if
your factory is like shaped likethis?

(16:42):
Why would you do that? Most factories are squared.
Why would you do this? But The thing is, most people
with their code bases, if they like, don't think about it too
much, they actually end up creating a code base that looks
like this and is not actually designed to be a factory.
And this is also true of like code bases before AI, right?
You, you write a good code base and it means that you can extend
it very easily. You can change its purpose very

(17:03):
easily. If you need to adapt something,
you can do that. If you need to RIP some, RIP
something out and reuse it in a completely different project.
Also super easy to do as long asyou push the complexity around
in a way where it is contained in, in in a way that you can
understand it well. So just like generally good

(17:25):
software engineering principles,I think make it easier to use AI
tools and and the less you know,the harder it is.
There is also something interesting here which is people
who are not super strong developers but still use these
tools also get a lot of benefit,but it seems to me like they
spend a lot more money working through these issues.

(17:50):
Coding agents are able to take real work off our plates and
deliver real value. In order to maximize that value,
you have to give them access to real data and systems using MCP.
And that's really kind of scary,which is why I've been testing
out Tool Hive. Tool Hive makes it simple and
secure to use MCP. It includes a registry of trust
MCP servers. It lets me containerize any
other server with a single command.
I can start in a client in seconds.

(18:11):
And seeker protection and network isolation are built in.
You can try Tool Hive as well. It's free and it's open source,
and you can learn more at Tool live dot dev The back to the
conversation with Karen why coding agents?
This goes all the way back to when ChatGPT came out and
stable. The fusion came out and I got
really interested in in in why models are unable to reason

(18:40):
about code bases. So if you were using GPT 3 or
GPT 3.5 at the time, right and you asked it to do like pretty
simple tasks coding wise, it would generally fail even if
given all of the context. And I thought that was really

(19:03):
interesting. And so like something kind of
weird happened when Stable Diffusion came out because the
diffusion models are able to actually understand the like
kind of structural relationshipsbetween things.
So if you think about like an image scene, right, an image

(19:26):
scene actually has like a bunch of relationships between
different things modelled like aclassic.
A classic 1 was like the mug in the grass versus the grass in
the mug. And models for a while had a
hard time distinguishing betweenthose types of relationships in
language, but they could actually represent them in
images. And I thought that was really

(19:47):
weird because code is much more structured than language is.
And in theory, you should be able to teach a model to like
understand those things. And so went down a rabbit hole
there for a while, got really really interested, read a bunch

(20:08):
of research papers. Then ended up building some like
early agents based off of the Voyager paper, right?
That was a really, really cool paper to see.
And then at, at some point we were like, well, we know how to

(20:34):
make these agents and we were working on an infrastructure
project at the time. And we were like, why?
Why don't we just make one? Because Devin just launched 1.
So why don't we just like open source Devin?
And so we did that and it actually worked pretty well.
We launched ACLI in early 2024 sort of as an open source thing.

(21:04):
Not, not really early, like March 2024 approximately.
And that was that worked out pretty well.
Got a bunch of GitHub stars, a bunch of people used it ended up
going down this giant rabbit hole of like, how do you build a
product around this stuff? And then we went off into the

(21:27):
coding like bunker and just likebuilt a bunch of different
things for a while. And recently we decided that we
wanted to just launch the agent.So we launched the agent and
that is, I think that's basically catches us up to
today. So do you mind telling me a bit
about Slate because there's so many different coding CLI's

(21:50):
coming out now, everyone major players dropping them, What
makes Slate different? Why should we check it out?
Just a little insight there. Yeah.
So Slate is primarily for engineers who are working on
like more production code bases.It's designed for longer,
harder, more edge case tasks. I think I mentioned earlier that
you can do like systems debugging.

(22:11):
We recently posted A blog about how you can do like partial
migrations using Slate and it's designed to work with you
through these problems. So something kind of interesting
with respect to that blog post is like we purposefully didn't
give it that much information tojust see how it would perform if

(22:33):
we made it as autonomous as possible.
In this scenario where I was interacting with the user, a lot
of learnings distilled from a lot of experiments in making
long running successful agents. So I've run, you know,
experiments on things like how long can a model propagate the

(22:57):
state of an automata or a set ofautomata over a finite tape.
So how many how many states can it simulate forward in like a
set of automata? Another example is like I was
building some reasoning benchmarks and one of them's was
like for a continuous learning and it was how can I get this

(23:17):
thing to navigate a maze withoutbeing aware of the surroundings?
Like just based off of knowing it's position?
How do you get it to actually map out like the state of the
environment that it's in? Right.
And there's a bunch of really interesting problems like this
that you can use to model failure modes in more complex
environments like coding. And so a lot of the learnings

(23:43):
around general reasoning in agents is baked into slate.
It's it's they're, they're, yeah, they're, they're generally
baked into slate, but it's kind of funny.
The end result is actually just something that is more simple.
So, for example, to do lists, right, to do lists are actually

(24:05):
a really, really effective way of getting a model to think
through a problem. Because you you have this
reasoning reasoning problem where the model kind of has to
predict what issues it's going to run to and run into.
And like what the shape of, you know, it's solution or approach

(24:28):
is going to look like ahead of time and it actually doesn't
know what is going to work right.
So assume like Golden Path, it explores in this case, your code
base and it is able to, you know, gather enough context and
then create a good plan. And, and most of the time what
happens is it'll create its plan, you'll get like 5 bullet

(24:50):
points and then it'll go do something.
And this happens with almost every coding agent.
And, and part of that is becausethe model providers have trained
this behaviour into the models. There was a period of time where
getting them to track things like task state was actually
really, really hard. And that has since kind of been

(25:12):
as far as issues go, that has since been kind of solved to
some degree, especially like, you know, anthropics talking
about their memory tool and all that.
So, so this state tracking behavior is really, really
important because if the plan does not go according to plan,
right? Implementing your plan for a
solution in a code base doesn't go according to plan.

(25:33):
And you learn something. Let's say you run a shell
command and you something is broken in the environment and
now you have to go off course. And instead of just going on
this like happy path, you actually have to go this way,
right? And solve the environment issue.
How do you go down that path andcome back to the original like

(25:54):
golden path once you've solved that environment problem?
The same thing goes for like whether or not you know certain
code already exists in the code base.
It's like if the model misses that, then what happens?
And so those are the cases that you have to build for.
And so I've tried like, I don't know how many different types of
solutions for a lot of these problems and and generally the

(26:18):
simplest and most generic solutions seem to be what works
best, which is really annoying because you know you'll try to
build like a, a memory, a memorysystem, right?
And almost all compressive memory systems that I have tried
or seen generally fail. And I think people saw this a

(26:40):
lot with like the compaction in cloud code.
It like it just kills the session.
So a lot of people will just kill the session and start a new
session. And so when you like run into
these problems, you end up finding solutions that are

(27:05):
probably more scalable. And so the way that we do
context management, I think is probably better the way like
our, our requests generally don't go over like 50 to 70,000
tokens, I think on average, at least that's what I've seen
historically. And that seems to be like a
sweet spot for context. It's once you get past that

(27:29):
point, a lot of models start to degrade.
So I think anecdotally, what I've seen is something around
like 120, 120K tokens seems to degrade a lot of the
performance. And the same thing goes for
like, you know, how do you do context engineering?
How do you pull in the right snippets at the right time?

(27:51):
What do those snippets look like?
How much is enough context for the model to like latch onto and
move on to the next thing? All of these things are like
really subtle small things, but they add up to make the
performance on any given task like way better or way worse.
And so when you're, you know, you're like when, when the kind

(28:14):
of average AI like friendly person goes to implement the
coding agent, what they usually find is that it like works OK,
but that's just because of the model.
And there's actually like if, ifthis is like the baseline
performance of the model, even with just a shell tool, you can
actually get like a lot more performance out of the model

(28:35):
simply by providing it the rightframing and environment for it
to, to work in. So I guess that's like a summary
of the benefits of, of using Slate or really just like the
learnings that we put into it. I think why you should use it.
It's, it's specifically designedto handle kind of three types of

(28:59):
issues in agents. So one is completeness, 1 is
correctness, and 1 is just like general intelligence.
And the tools that we've provided, the way that we've
built them, the way that we provide context, all of that
stuff is designed in a way that it like makes it easy for the
model to be as smart as it can be.
A few things to unpack there. I totally see what you mean with

(29:19):
people not liking the compact feature.
Every guess I've had on who's mentioned it talks about how
they just ignore it. They do their own compaction
because that type of control tends to benefit from view
insurance. One part of my workflow is I'll
make my own To Do List. So where I just kind of set up a
list. If I need the AI to update it, I
explicitly tell it to do so. And sometimes I rely on the

(29:41):
built in To Do List. But I find just having that
level of control, where do you see Slate kind of alleviating
the need for human intervention versus the other models?
Because I'm telling people, if you're using Cloud code slash
new slash clear very frequently referred to a markdown file.
So there's a lot of human intervention.
But with Slate, if I'm understanding correctly, it's
better for long running tasks like the big work.

(30:02):
Could you elaborate on that a little bit?
Slate generally won't get boggeddown by context in the same way
that Cloud Code or Codex will. Now you have a secondary trade
off here, which is we, the way that we built it and the way
that we have to like probably improve it is in its memory and

(30:23):
it's like kind of ability to retain that high level bowl.
However, what it's really good at is just it'll keep working.
It'll continuously like work through your task.
It doesn't really stop until it's like task list is done.
And this was super intentional. I think you, if you keep things

(30:44):
simple, you kind of have this trade off of like, OK, you can
either make it so that it is very, very agentic and it'll
just do a lot of stuff, or you can make it so that it retains a
lot of context and then starts to degrade.
And we wanted to keep the performance very stable.
So slate is a lot more like a function than I would say other

(31:09):
tools. And I think that's really
important and that's probably what's going to allow us to
scale up the task length reliably and scale up the like
kind of a bill other other abilities of of slate because it
is much more like a function andlike then then I think other

(31:32):
agents are. More like a function of what
regards like. It's more guaranteed to execute
in a specific way pretty pretty exclusively.
It will like if, as long as it'sgiven like a complicated enough
task, you will get your To Do List.
It will work through the whole thing.
It won't. It's not going to stop.
It's just going to keep going and eventually it'll declare

(31:57):
victory, which may or may not becorrect.
But by that point, it like has done the exploration, it has
gathered the context, it has maintained it the way that we
do, like tool and environment interaction allows it to be more
efficient. And so it can actually get there
faster. And also likely more completely

(32:23):
because it's like it's a little bit more prone to exploration, I
would say. But then at the same time, that
gets you the added benefit of ithaving more context about like
what you're trying to do when you want to do it.
And so for anybody who's going to go use it, they, I think and
they'll, they'll end up feeling like it feels different and are

(32:44):
it's going to be hard to articulate why it feels
different. But the actual reason is that it
is just more like, it's almost, it feels almost more like
autonomous or programmatic. I think that's what I mean by
function. It's like it's more autonomous.
It's not this like back and forth thing that's like really
kind of flimsy. It's like very just, it'll just

(33:05):
go do the stuff that you want itto do.
And then obviously, the intelligence is limited by like
the underlying model, but we're actually like also doing stuff
to alleviate that right now. Would you say with the current
state of things the bottleneck is the LMS capability or the the
framework, the tooling around it?

(33:26):
Like do you think we can squeezemore of the existing LMS or do
we need another step change to get to that next level of
capability? Something that I like to say a
lot is all the cards are alreadyon the table and we just don't
know how to use them. And I'm like pretty confident in
that statement as far as model intelligence goes, because if

(33:47):
you like look at the way that people design tools for language
models, they don't seem to be that effective.
I think you can get away with like much with with better
framing of the environment and and better tools that provide

(34:11):
the model more capabilities without actually increasing the
combinations. So what I mean by this is like
there people, you know, build like an MCP and it'll have
however many tools or like you'll, you'll have tons and

(34:34):
tons of things individually thatthe model can do.
And that's probably not a great pattern because that's a lot of
things for the model to keep track of.
What you actually want is you want likely more powerful tools.
And so I think a great example of this is like providing back
or like a shell tool. Shell tool is super powerful.
You can do most of the things that you need with it, right?

(34:55):
And because of that, you know, the models are are trained to be
aware of all the tools that are going to be available in the
command line. But because of that, you end up
with like really, really powerful agents based off of
just the shelter. And when somebody goes and
designs a new tool for an agent,they'll overload the number of

(35:18):
like combinations that the agenthas access to in terms of like
the parameters and things like that.
And that usually degrades performance.
So instead, like the goal is to design the context and actually
specifically like design the message history in a way that

(35:41):
the model can create a coherent understanding of the world that
it's in. And I think that's really
important. And I think that most of the
time people don't do that. And so another thing with Slate
is that you'll see that it is actually probably more coherent
and aware of what's going on in the environment than other

(36:03):
systems. And and part of this is like
downstream of these principles that I'm I'm kind of going over
right now. So an example is like Slate can,
you know, use can use them. It can use T mux and go over
SSH, like spin up an SSH sessioninto like a remote box and debug

(36:25):
something on a remote box. It can debug more complex
systems and like multiple thingsin parallel.
And these are kind of more complex issues that most agents
can't do. And the reason is that I think
they don't really do a good job of tracking like the context

(36:46):
that the model is working in. But I, I also don't think that
it's about, you know, stuffing or stuffing the context or
shuffling things around because let's say you dump all the docs
for something into the context that's like that.
That's actually not help at all.And what you really want is you
want the model to be able to make its own decisions about
like essentially snipping piecesof the context.

(37:07):
So let's say you have this giantdocument, you want the model to
be able to extract just this small section, move it into its
own context and dump the rest ofthe document.
And so you should build tools that allow you to do that,
instead of building a tool that just gives it access to the
whole document and sticks it in context.
Yeah, because I always thought it would steer at the wrong
direction, just giving too much and just dumping everything in.

(37:29):
And I've been a big proponent oftelling people use the LLMS as
little as possible. Like if you create a pipeline,
have deterministic code as much as possible, and the LMS just
kind of like the fuzzy match or the bit in the middle.
One thing that stood out to me, just the ability to do all these
different things where there's like read the docs, capture
little parts or go into remote servers.
It feels like that's just creating a lot of data, a lot of

(37:50):
information that goes into the context.
Is your approach to prune the context?
Or do you create artifacts that it can reference to pull in?
Like for example like creating the To Do List a markdown file?
Or maybe it creates a copy of the docs as a TXT and then it
can just read that and extract it?
Or do you leave things in the context?
Yeah, right. So right now actually, we just

(38:11):
leave things in the context. I've done a lot of experiments
with things like pruning and I am not I'm convinced that there
is a good way to do it. I'm not convinced that there is
an amazing way to do it. And so if, if you, yeah, if you

(38:37):
look at like the actual performance of of things, models
tend to degrade in performance when the surrounding context
about why they got that information disappears.
So if they don't actually see a coherent narrative about like,
oh, hey, this piece of context ended up in my history because I

(39:01):
did this thing the, the kind of like I was saying earlier,
coherence of the, the system, the agent, the model decreases
because it doesn't actually knowwhere that thing came from.
It doesn't know why it's there. And you really, you really have
to explain like why the the model is seeing the things that

(39:26):
it's seeing or it has to understand that into
intrinsically. And so if you do these things
like compression and pruning andwhatnot, you actually create a
incoherent narrative. If you do it in like a naive or
not even naive, even if you try to do it well, you still create

(39:47):
some amount of an incoherent narrative because the model is
missing information that should be surrounding the information
that it cares about. So that is actually that is a
big problem. And this is actually surprising
that I haven't heard this addressed too often, because it
makes perfect sense when you have an entity processing
language. If the language is incomplete,

(40:08):
it could scramble it and just goincorrectly.
Can this be resolved by things like having XML tags as like
section headers, or is that justkind of a Band-Aid solution
versus giving it proper reference to where the context
came from? I, I think that's, I think
that's a Band-Aid, to be honest.I like there's obviously you

(40:29):
can, you know, throw your own tokens and like your own
delimiters and right, but without the like correct history
reconstruction, the model will just end up like doing weird

(40:49):
stuff or at least they used to. So this what I'm describing is
some behavior that we saw a lot on Sonnet 3, five and three
seven. And then we like fixed our
systems and everything started working.
And so we haven't seen that problem since May.
But prior to that it was, it wasa big issue because the, the

(41:11):
model would see, you know, we were, I was testing out some
pruning stuff and the model would see the piece of
information that it was supposedto see.
And it'd be like, oh, this is this could be important.
I think the correct way to thinkabout it is if you do any of
this stuff, you do summarization, you do pruning,
you do whatever, it needs to just be treated as like

(41:32):
background information and not actually information that is
like the current part of the current trajectory.
Because without the cause and effect relationships that are
being like shown throughout the tool call history, you kind of
lose, like I was saying, you lose the context as to why that
thing exists. And so the model will see it,

(41:52):
but it doesn't really have a strong narrative around it.
And something that's pretty important is that like models
like to role play a lot. And so if you give them a strong
narrative, then they're really effective.
And if you give them a weak narrative, then they're not as
effective. And this is just an example of
this is just an instantiation ofthat problem of like, OK, here's

(42:13):
a bunch of information that seems completely irrelevant or
like tangentially relevant in comparison to all of these
recent tool calls I just made. Well, I'm going to focus on
these tool calls. I'm not going to focus on the
thing that was just stuffed intomy context.
And so even if you do it, there's a couple approaches that
you can take, but you'd really have to think about how the

(42:36):
context it grades. Essentially, if you're starting
the context over here and you'reending the context here, the
context and the attention to that context kind of goes like
this, right? So it attends really strong,
like the attention mechanism attends really strongly to the
end and really strongly to the beginning.
And anything else ends up being like noise.

(43:01):
So you can, sure, you can like introduce noise in the form of
like valuable context and it'll it'll act as almost like a like
just a just a bias towards the behavior, but it won't actually
really improve it more. So what you want to do probably

(43:24):
is like pull things guide the model with it's like To Do List
to behave better, right? Like make it make it actually
behave more correctly rather than trying to patch up the
these issues. A lot of the times when people
have some type of chatter interface, they'll just keep

(43:47):
appending to the conversate, appending to the messages to
show the conversation. Is, is there any difference in
your experience from doing something like that where you
have a clear like system message, user assistant, user
assistant back and forth? Versus taking all of that and
creating like a a few shot example where you kind of
compress everything into a single message and maybe remove
the the the user aspect, the assistant aspect of it and just

(44:08):
kind of have like 1 message or conversation versus many
messages of a back and forth dialogue.
Yeah, I think, you know, the models are trained on the chat
format. So like use it to your advantage
is like the pretty straightforward answer there.
It's like definitely use that. You can use that to your
advantage in terms of making it perform better.
So do that. The other question I had is

(44:29):
around that the attention curve where it pays more attention to
being in the end. Do you know why that is?
Because my understanding of Transformers was it should look
at everything equally. How come it's evolved in such a
way? Or maybe a structure?
Just I'm misunderstanding why itgives more attention to the
start and the end because I've heard that from multiple people.
I just I don't know why. I have not trained a model at a
frontier lab, right? So I can't say for certain.

(44:53):
And I actually, I'm not sure if they can say for certain either,
to be honest. I think it's just generally
related to the data. The like if you think about
having a conversation with a person or, or a bot, right, the
initial context is really important for the rest of the
conversation usually, unless it shifts into something completely
different. And then the most recent

(45:14):
question or turn is also the most important.
It's like, OK, yeah, we've had this conversation for 20
minutes. Whatever we started with is
probably really important and whatever we're talking about
right now is also probably really important.
And if you look at your own chats, like something that comes
to mind for me right now is if Ilook at like chat history on

(45:34):
something, I usually only care about the last two responses.
Personally, I don't really care about the response that happened
5 responses ago. And so the, the, the best
response from a model is going to be the one that actually
focuses on the things that I care about and not the things
that I don't care about. And so that's an example of like

(45:55):
a recency bias that can be trained into the data, but also
the bias towards like that initial, initial state.
I also think that because everything is downstream of
those initial tokens that there could be some form of impact.

(46:17):
But it the models are probably large enough that the like kind
of path dependence of the rest of the conversation washes out
with enough data diversity. So that's probably not true, but
it could be. And I think that would be an
interesting thing if somebody could prove that one way or
another. So you can go check out Slate at

(46:37):
random labs dot AI and you'll beable to during the payment flow,
you'll be able to get a month free of credits to just try it
out. And if it works, let me let us
well, let me know, let us know at team at random labs dot AI
and then you'll be all set. Hopefully it works out.
Thank you for listening to our conversation with Kieran.

(46:59):
I thought it was great learning about Slate and what really
makes a good coding agent. I also think it's important that
we keep in mind that you shouldn't just download these
tools to use them blindly, but if we work on optimizing our
environments to get the absolutemost of them, we're going to be
more productive, more efficient,and be able to accomplish more.
I want to give a quick shout outto Tool Hive which is reporting
the show so I can have conversations like this and I'll
see you next week.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

Medal of Honor: Stories of Courage

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}AI Coding Agents Can Do So Much More (ft Kiran)

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

Medal of Honor: Stories of Courage

Dateline NBC

All Episodes

AI Coding Agents Can Do So Much More (ft Kiran)

Stuff You Should Know