#212 - o3 pro, Cursor 1.0, ProRL, Midjourney Sued - Last Week in AI

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:11):
Hello and welcome to the last week inAI podcast, or sometimes the last two
weeks in AI podcast where you can hearus chat about what's going on with ai.
And as usual, in this episode, wewill summarize and discuss some of
last week's most interesting AI news.
You can go to the episodedescription for the timestamps

(00:31):
and links for all those stories.
I'm one of your regular hosts, Andre ov.
I studied AI in grad school and nowwork at a generative AI startup.
Hey guys, I'm your otherhost, Jeremy Harris.
Gladstone ai, ai, national securitystuff blah, blah, blah, blah, blah.
And yeah, we have a, a lot to getthrough this week because it's
actually this past two weeks.

(00:53):
This is one of those episodes where wemissed one last week that was on me.
and now we're gonna dosome catch up and see.
Yeah.
Jeremy, you seem to need to travel a lot.
I'm starting to feel like you mightbe a spy going to Washington and, and
retrieving AI secrets or something.
No, I mean, look, every oncein a while you may hear what
sounds like a Russian accent.
actually, it's, it's funny 'cause you'rethe one with a Russian background,

(01:14):
Well, but this is how spies work, Andre.
All right.
They're, they seem like theycould not be less Russian.
and yet here we are, so.
Yet I'm not a spy.
But you just have travel todo, to talk to people about ai.
Yes, exactly.
Well, we will go pretty quickjust to give a quick preview.

(01:35):
No huge stories in the pastcouple weeks in tools and apps.
There's just a variety of announcementsof somewhat significant releases, a lot of
one point ohs or new versions of things, Anew O three PRO applications and business.
Again, nothing huge, but someinteresting developments on the
chip side, on the OpenAI side thenprojects in open source research

(02:01):
kind of again, a variety of stories.
No particular focus in this episode.
Policy and safety.
We're gonna be talking about, kindof a, a bit of interoperability
and safety more so, and a coupleof national security stories.
And they'll actually have a syntheticmedia and art section, which we haven't in
a while, just because it's always at end.

(02:22):
But there's some some newcopyright lawsuits and some new
partnerships that are interesting.
So we'll go ahead and add that on tocover that sag back in the news too.
It's been a while since we've seen them.
Yeah.
Yeah.
We used to, you know, last yearthere was quite a bit of it
and we sort of just stopped.
And now is a good time to mentionsome of that ongoing news.

(02:46):
Before we dive in, do wannaacknowledge some Apple Podcast reviews?
We appreciate your comments.
Had a review to tell usto keep it up, please.
Which I feel like we've beentold this several times.
So the encouragement is,appreciated, let's say.
Yeah.
And we will try to keep it upand make it as weekly as we can.

(03:10):
Another positive review.
Love the show.
CapEx, CapEx, CapEx.
Well glad some people were on board.
And we did have a pretty detailed bit offeedback with a free year listener talking
about us, maybe alternating introductions.
More me taking the lead lessalways talking about the

(03:31):
next story and setting it up.
We just sort of wound up in there.
We didn't plan on this beingthe natural flow of a show.
So you might, I feel like we emergedorganically, like, it's funny 'cause
so I have the unfair advantage thatwhile you're going through the, the
kind of layout of the story, I get tothink a little bit more about, yeah.
Look at my notes, be like,Hey, you know, oh yeah.
There's this thing.
Because as you can imagine, we'recovering, I mean, this week will be like,

(03:54):
like 40 stories or something every week.
It's like, I. We're having to do research.
We have reams of notes on, on everysingle paper, every single news story.
And so, I dunno about you, Andre,when we switch stories, I'm like
in a scramble trying to figureout, what did I even think of this?
Oh yeah, this is that paper.
Okay.
And so while you're kind of gracefullygoing through your intro Yeah.

(04:14):
And the secret is I'm actually justbetter at sounding prepared when I'm
reading from notes because you gottaload this into your ram, you know?
Yeah, yeah.
You gotta change context andI happen to be all right.
I hope at pretending like I havean actual script instead of just
rambling off based on it's a talent.

(04:36):
Yeah.
And I will say I think I am pretty goodat segues, but anyways, we'll, we'll
try out a bit more variation throughout.
Andre's really good at segues.
And with that,
and with that, let's get goingon the actual news, starting
with the tools and apps section.
First step, we have OpenAI adding Othree PRO to chat GPT, dropping the O

(05:01):
three price by 80%, and also mentioningthat they're gonna delay the open
source AI model to later this summer.
And that's pretty much the news.
So O three is their reasoning model.
Now we have O three pro, whichis gonna be replacing O one Pro.
It seems very good startingto be on par with oh one.

(05:25):
And the O three model is getting cut downby 80%, so that would mean $2 per million
input tokens versus the previous eight.
So huge price drop.
I mean, this was to me quite surprising.
And yeah, O three PRO is pneumaticexpect, pretty nice performance

(05:46):
on benchmarks, better than allthe other offerings of theirs.
So pretty big news.
So there's an opening I post aboutthere's the model release notes on O
three Pro with some initial evals, right,to giving you a sense of like, how does
it stack up compared to both humans.
And then compared to O oneand O three medium against

(06:07):
humans, it's really impressive.
Worth looking at thechart across everything.
Basically.
You, you see a clean sweep where the model64% of the time is preferred to humans.
that includes by the, by the way,personal writing and computer
programming and data analysis.
So really kind of spanning everythingfrom, things where you have a
quantifiable reward that you can issueand things that are more qualitative.

(06:28):
You're seeing superiorperformance across the board.
And then some of the areas wherewe're seeing really significant
improvements in benchmark scores.
Amy Amy 2024 going from 90 to 93%between O three medium and O three Pro.
That may not sound like a lot.
It may sound like 3%, but one wayto think about it is once you're
already at 90% there's not that manypercentage points left to climb, right?

(06:51):
So you would expect, like saturatinga benchmark is really hard.
They just took a third of the remainingerrors off the table with that is
kind of similar with GPQA diamond thatsort of PhD level science questions
and code forces competition code.
So a, across the board, again,this like universal improvement
in these capabilities.
One thing that I hadn't noticedto my embarrassment there's

(07:14):
a benchmark that they run.
They call the four outtafour liability evaluation.
I just wanna surface this becauselike, it makes all the sense, and
of course they're doing this, butI guess I, I hadn't yet explicitly
remembered seeing this in writing.
in this eval, you consider a modelsuccessful only if it correctly answers
a question in all four attempts.
So you try it four timeson the same question.

(07:36):
And this is sort of a, you cansee it becoming more important,
this kind of valuation rate whenwe get into agents that are being
deployed in higher stake scenarios.
You wanna make sure that the agentconsistently performs well, so that
even if you, if you test it and youknow, you get lucky or something, you
don't overestimate its performance.
And so anyway, I, I thought thatwas again, one of these oddly simple

(07:57):
things but that I hadn't seen doneelsewhere, remembered done elsewhere.
Exactly.
Usually you get pass at oneor pass at five, basically.
Do you nail it first try or, ordo you nail it after a few tries?
And they do give those numbers,but they also give a four or four
reliability evaluation, which asyou said, I don't think is typically

(08:19):
what you see in benchmark numbers.
And compared to the past at one result.
That is a less nice number.
You get worse outcomes if you aretelling it to be, you know, four
out of four times get it right.
There is a performance drop and in fact,in some cases, like G-P-Q-A-A pretty

(08:40):
significant performance drop, but stillall three PRO is, is beating all of them
and on the evaluations with human testers.
So three PRO is preferred to Othree, according to human testers
on scientific analysis, personalwriting, data analysis, as you said,
about 64% of the time on average.

(09:01):
So you know, O three issometimes about as good.
But more often than not,O three PRO is preferred.
Next up we have Cursor AI editorhits 1.0 Milestone, and there are
some releases with it, includingB, but, and background agents.

(09:22):
So Cursor is the integrated developmentenvironment, the programming tool
that has become one of the leadingcontenders for being what programmers
use to include AI in their workflow.
so 1.0 release, probably, you know,not being covered in major news
outlets, but kind of a big deal.

(09:42):
And any spheres as we've covered now,has a ridiculous evaluation after
really rising quickly last year.
So with this 1.0 release, theyrelease bug Bot, which is an matic
reviewer of pool requests on GitHub.
There's also this background.
Agents in beta, which allows youto run these agents in a remote

(10:08):
environment set up by cursor.
So it's getting into the agent territorywhere the AI agent does coding for you,
does work for you totally a synchronouslyaway from your prying eyes and then
delivers something to you to evaluate.
So cursor has had a Genta coding fora while and we've been pushing it.

(10:31):
This is another step in that directionand lines up with other efforts like
Codex and JUULs from Google, where youdo have these coding agents just work
remotely and deliver results withoutdirect supervision, which was the model
for AI paired coding up to recently.

(10:51):
I. I'm super curious about where thisevolves from a, a security standpoint too.
Like for context, the way this is workingright now is that the agent will actually
fork your GitHub repository and have itslike own branch that just like, it'll,
it'll put out prs, it'll, it'll reviewprs and all that stuff, as you said,
fully in parallel on its own branch.

(11:12):
so they, they have some notesabout the security side.
They're like, Hey guys just keep in mindthese agents have a much bigger surface
area of attacks compared to existingcursor features that don't look like this.
And they, they do say our infrastructurehas not yet been audited by third parties.
You know, you have here agentswho have read write privileges
to repositories, right?

(11:32):
So this is like.
This is God mode for your AIagent that is writing code.
So if somebody can do prompt injectiondata poisoning attacks or whatever on the
agent, that could be a really big deal.
And if you're deploying this in like aproduction setting this is a, a really
interesting you set of vulnerabilitiesthat absolutely is gonna have to be

(11:53):
addressed in the basic kind of designphilosophy for these, these tools.
By the way, we'll be talking about thislater, but this on the same week that
Microsoft has come out and announced anew vulnerability that was discovered
in copilot, sort of in the same spiritwith prompt injection type attacks.
So like all of a sudden we're realizing.
You can't just deploy agents on allthe things and assume that security

(12:18):
is gonna look is gonna look the same.
So anyway I I think that cursor is gonnabe at the absolute forefront of this
because the, these agents have suchintimate access to the code base and are
able to work autonomously and in parallel.
So I think we'll learn alot about best practices.
They're gonna have to evolve reallyquickly because, you know, I mean,
there's a lot of cyber attacks andconventional software with this yeah.

(12:41):
The, the sky's the limit.
Yeah.
And that's especially trueif you're working open source
with various contributors.
Jailbreaks can be pretty subtleand can be quite weird, and agents
are still kind of in development.
So there could definitely be ways inwhich you can just tell it, delete all the

(13:04):
code, you know, or something like that.
And onto lightning round with,have a couple of quick stories.
First, you've got Mistral releasinga pair of AI reasoning models.
So Mistral is the French AI lab, whichhas released a lot of open source
models and has tried to compete withopen ai, Andro and others with big lms.

(13:27):
So we've released magi trial, thereasoning model, two variants small
with 24 billion parameters that isnow available for people to download
with an Apache 2.0 license, fully opensource and Magi trial medium, which
is available on their Letcha platformand on their API not as good as.

(13:51):
Pretty much any of the leadingreasoning models on evals.
Partially because they're smaller comparedto something like deep seek R one.
But yeah, general impression I getis people are not too impressed,
but at the same time, it's nice tohave another open source reasoning
model for people to build on.

(14:11):
Yeah, I continue to be sort ofinterested and confused about what
the big picture game plan is for,for Misra other than to become the
French champion that's subsidized bythe French state to do French things.
But we'll see the business model of justlike pumping out your models and, like as
open source and then hosting them seemsto be challenging for a lot of companies.

(14:34):
We will see if that changes with rl.
I, I'm sort of skeptical personally.
But yeah, again, with these, these sortsof eval scores, it's really difficult
to, to compete, like the frontier ismoving so fast and the fact that they
chose to release this model as well, youcan, you can read a little bit into that.
You know, like Facebook decided, orsorry, meta decided not to release

(14:55):
the kind of biggest version ofthe latest LAMA series because it
apparently wasn't performing too well.
That's the sort of thing that you doif you have a kind of met release.
The fact that they did release thissuggests maybe that they don't necessarily
have a plan for blowing things outta thewater anytime soon, so they might as well
get the, the splash in in the meantime.
That's one interpretationthat you could have.

(15:17):
We'll note that the 24 billionparameter scale is very popular.
You know, it's, it's like a good choice.
I think that's something that meta hasstruggled with is they just keep pumping
out these giant models that nobody.
Nobody really wants to use.
24 billion, 32 billion, likethese are really good sizes for
the kind of hardware that peoplelike to run open source models on.
So yeah.
That's great.
We'll, we'll see where this goes.

(15:38):
There's certainly, they certainlyare the French national champion and
it's gonna be worth something, butyeah, they're in a challenging spot.
They're a challenging spot tryingto compete on just head-to-head
training of frontier models.
And they seem to really be keen on,you know, really competing on every
front of OpenAI and philanthropic.
Last week we also released Mistral codecompeting with something like CLO code.

(16:02):
So basically on any given thing peopleare doing, at least on the L Lamb side,
not necessarily multimodal side, Misrais trying to compete and you know, let's
not count 'em out, but they certainlyhave a TAF task to be able to do that.
Next up 11 labs, the provider of textto speech and text to audio models

(16:24):
has released their V three model,11 V three, which is the latest
in their text to speech models.
It is able to do even morenatural sounding outputs.
You can even embed things like sizeor excited to have more expressive

(16:45):
cues with nuanced delivery.
And this supports over 70 languages.
So yeah, text to speech I thinkprobably less visible to a lot of
people than LLMs and image generationand video generation and so on.
But it has really come a long way,and I think it's at a point where
it will be very hard to tell ifsomething is AI generated or not.

(17:10):
Yeah.
And one of the things that's reallyinteresting, it sort of reminds me on the
agentic side of anthropics, MCP, like themodel context protocol or any of these
like hooks that people are, are learningabout the structure of a given modality,
we're learning here about, okay, what'sthe user friendly way to allow developers

(17:31):
to program text to speech, right?
So you indicated one of the,the upgrades here, right?
So you had these specialsize or excited tags.
The example, or one of the examples theygive here is we did it exclamation point
and then in square brackets happily,and then in square bracket shouts, and
then in square brackets laughs right?
And, and this is the sort of,sort of affordance that you

(17:52):
need as a, as a developer.
It seems obvious in retrospect,but somebody had to think of it and
implement it, so that's really cool.
Sort of similar another similar thingis this idea of multi-speaker dialogues
with realistic conversational flow.
So one of the challenges when you'remaking text to speech is like,
how do you know, or how do youdefine the turns of each speaker?

(18:13):
Make sure they don't talk over eachother or make sure they do talk over
each other if that's what you want.
And so they have a new text to dialogue,API, where you send structure JSON that
defines when each user gets their turn,and then the model automatically takes
care of, you know, the, the kind ofemotional shifts, the interruptions,
the, the, the natural flow of, ofthat conversation through that lens.

(18:33):
So again, it's one of those thingswhere, you know, you, you sort of don't
realize you need it until you startto, to, you know, produce stuff with,
text to speech and especially on theentertainment side or trying to make
real kind of natural conversational flow.
So.
Really cool.
And, and as you said, whole bunch oflanguage is supported, so yeah, I mean,
11 Labs are still doing impressive things.

(18:54):
Yeah, 11 labs market leaderin this territory, so
definitely worth knowing about.
Next got text video Byte Dense is gettinginto the competition with C Dance 1.0.
So it's their latestvideo generation model.
It's trying to compete with VOthree, the really pretty viral

(19:18):
video generation model from Google.
This one is able to generate five secondsof HD video in about 41 seconds, so it's
pretty fast to actually do Generation andAnce is apparently planning to integrate
c Dance into their platforms like Dobafor both professional and public use.

(19:41):
One of the big advantages thatthey have, of course, being the
TikTok parent company, is accessto tons and tons of video data.
I guess this is, you know, makesyou wonder a little bit about.
I mean, a, they're gonna bepilfering YouTube videos left,
right, and center as well.
It's not like that'll stop them.
Especially being a Chinese company,not that that's stopped open AI in the

(20:02):
past, if you can remember, like Miratis sort of famous presentation snafu
when somebody asked her, like, for, Ithink, I think it was for soa, right?
Where did, where did you get that data?
Did you get it from like YouTube?
Like, and she's like I, I forgetwhat she said, but she looked very
uncomfortable and, and it's pretty clear.
Some, some stuff or to many people, it'spretty clear that some stuff went down.

(20:22):
but certainly TikTok has accessfront row seat access too.
An exquisite quantity of data.
One of the interesting things theycall out is that they can handle.
Complex sequences with multiplecamera angles and maintain
character consistency throughout.
This is, you know, part of that wholeworld model building thread that

(20:43):
people have talked about quite a bit.
You know, our text to video, ourimage to video models, world models.
Do they contain world models?
One of the big questions, ofcourse, is always, well, if they
contain world models, they shouldbe able to model real world physics.
That includes things likeobject permanence it includes
things like object consistency.
And so this is sort of hinting atthat, though we don't know much

(21:03):
about the, the architecture itself.
And so, you know, maybe some of this iskind of baked in with, with inductive
priors and it's not actually sortof learned per se, difficult to know
but certainly impressive and, andthe world of convincing AI generated
video, I think it's fair to say is,is just basically here at this point.
Right.
And unlike video free, it is notable to also generate audio that's

(21:26):
pretty much only video free.
So Google impressively kind of took thelead on the text to video world and yeah,
I think it's, it's good to call out,but most likely it's because they have
YouTube and they just can train on allYouTube and nobody else can, and by dance
might be able to compete for that reason.
Well, and the, the audio toois no small thing, right?

(21:48):
Like we're entering this world wherewe're getting positive transfer as
these models are trained on more andmore modalities and video and audio
are so causally intertwined, right?
Like you imagine trying to make a worldmodel literally, like if you're, deaf.
Like you look at the world, youcan create world models, but you
can learn faster about the world ifyou also have the ability to hear.

(22:09):
And especially for AI systems,just given that, you know, they,
these are not trained with rl.
They can't go out into theworld and interact with things.
Having that extra modality to kind ofcross correlate physics and you see what
somebody's mouth opens and the sound tendsto come out, it's like, okay, that tells
you something about the kind of functionof the mouth and the physics of it.
You know, same with car crashesand the sounds that come from that.

(22:30):
So anyway I, I actually expect thatthe inclusion of audio in a single,
almost monolithic base model, ifyou will is gonna be a really big
deal for everything from promptadherence to world model development.
And speaking of VO free, Googlealso had an announcement.
They are revealing a $20 AI PRO plan tolet people use V Free more, and they are

(22:54):
releasing VO free Fast, which is able todo faster generation compared to V three.
V three is.
Fairly slow to use.
It takes I forget exactly, but,you know, a couple minutes.
So this allows you to take,let's say less than a minute.

(23:15):
And now Gemini Pro subscriberscan create up to free videos
daily using VO free fast.
And it's definitely seemed to be the casethat the servers and GPUs from Google are
pretty slammed by people trying to use vo.
A lot of it wasn't working, so wouldn'tbe surprised if this was rushed into

(23:37):
production to keep up with demand.
Yeah, and I, I mean, I continue to tapthe sign that someday fairly soon we're
gonna be able to generate one secondof video for each second that you wait.
In other words, you're gonna beable to generate video as fast as
you can prompt it to be generated.
I. Once we cross that threshold,there's gonna be excess compute on the

(23:59):
generation side, which I would expectto start to get dedicated to addiction.
So, you know, imagine your, your TikTokfeed, but if you've got biometric data
coming in through, for example, thecamera, or even just your interactions
with the app that cause the video tobe modified in real time based on what
you're seeing, there's like a very darkrabbit hole for where this ends up going,

(24:19):
ultimately with the abundance of compute.
that threshold's gonna be very critical.
I think almost from a, asocietal level in terms of how
we even think about these apps.
It's, it's not unlike.
what the ability to generate freshapps from scratch based on prompts
is doing right, where apps themselvessuddenly become this malleable thing.
Well, this is sort of similar,but for manipulating pixels on a

(24:40):
screen to kind of stimulate you.
It's not clear what happens whenthe optimization process that's
running in the back end of thesesystems operates as quickly as the
human biophysical response cycle.
That's a, a, I think a very, veryinteresting phase that we're getting to.
And we're gonna see a lot ofinteresting phenomena, psychological,
and otherwise emerge from.

(25:01):
Yeah, I think you could say this issimilar to where agents were last year
in the sense that we were talking aboutagents a whole lot going back definitely
into 2024, but it took until really thelast couple months for agents to really
mature and, and make a huge impact.
Now with things like cursor code,I think video's in a similar spot

(25:24):
where you're starting to see toolslike flow, like a more easy to use
pipeline to not just prompt it,but actually build something of it.
And I think in the coming months youknow, we will start seeing that actually
not just be used for memes, but actuallyhave an impact on workflows and so on.
And moving on to applications in business.
So we start with thisreally interesting story.

(25:46):
OpenAI and DeepMind are losing engineersto anthropic in a one-sided talent war.
So there is this venturecapital firm called Signal Fire.
They came out with their2025 state of talent report.
And they basically look at like, okay,what's the rate at which we're seeing
employees leave open AI for anthropicversus the rate at which we see

(26:09):
employees leaving Anthropic for open ai.
Right?
So which direction is preferred?
So when it comes to open AI andphilanthropic open AI, employees
are leaving eight times morefor anthropic than vice versa.
At DeepMind, the ratio is 11to one in Anthropics favor.
So for every Anthropic employeewho leaves anthropic to go to
DeepMind, 11 DeepMind employees areleaving DeepMind to go to Anthropic.

(26:33):
That's pretty insane.
There's all this kind of interestingspeculation by the way that, so
Anthropics retention rate is like 80%for employees hired over the last two
years, which in tech is pretty wild.
Like I get in the kind of standard world.
That doesn't sound too, too impressive.
Like, oh, you're still in the same companyyou were two years ago, 80% of the time.
That sounds about right.
in ai that is fairly unusually highopen AI's retention rate for two years.

(26:58):
By the way, 67% that's aligned withwhat you see at meta, for example.
So there, there's all kinds ofpeople kind of tossing around
ideas about why this might be.
One of the often incited hypothesesis like anthropic is just
sort of coming out of nowhere.
They've got the best coding models.
That's just really exciting towork for them, blah, blah, blah.

(27:19):
I think that this actually missesthe core point, which is Anthropic
was a company founded on a very clearprinciple, and it has stood by, for
the most part, those principles.
You know, it's founded by theseopen AI policy and safety and,
and some pre-training researcherswho left essentially in protest.
I mean, this is essentially a,an open secret now over open AI's

(27:43):
sort of attitude and approach toalignment, technical safety and policy.
Open AI or Anthropic rather,seems to have walked the walk
on a lot of their policy stuff.
Pushing back on this pretty ridiculousidea of like banning all state
level AI regulation for 10 years.
That was snuck into.
The, the latest big, beautiful bill.
anyway, and, and, and OpenAI seemsto have been pushing for something

(28:04):
pretty aligned to that, at leastin their, their policy work.
So a lot of this is like, you've got anentity where the leadership says something
and then they actually kind of acted out.
And there's also a lot of kind ofopen discourse, like when you talk
to folks who work at Anthropic.
I've never spoken to, I've spoken toa lot of people at OpenAI who I would
call whistleblowers who are like, I'mreally concerned that the leadership is

(28:27):
talking through both sides of its mouth.
I have never had a conversation that feelslike that with an anthropic employee.
The OpenAI ones that we spoketo in our investigations in
the past were often like.
They're really tense.
You could, you could sense that they,they did not want you to tell anybody
that we'd spoken anything like that.
Whereas in philanthropic it's kindalike, yeah, you know, I might have a
disagreement with leadership, but youget the census is the sort of thing that

(28:49):
they would have out anyway and have, havespoken to leadership about and sort of
reasonable reasonable people can differ.
So I think that that's anunderrated factor in all this
is just the cultural difference.
And I think that's leading the bestresearchers to flock to anthropic.
And that in turn is the causalelement behind, in part anthropics

(29:09):
great success with its coding model.
So I think, you know, it's not all that,but this is a kind of missing element
in at least some of the analysis on thisissue, just sort of from what I've seen.
Right.
And I think, you know, to compliment thatthe dynamics of open AI andro competing
are very different from dynamics of deepmind and philanthropic competing where

(29:31):
deep mind, if you are preferring to goto philanthropic, it is likely because
you don't like big company politics.
Yeah.
Yeah.
And you don't like a lot of bureaucracythat has been introduced to review if
you're allowed to publish your researchor whether you're able to contribute
to Gemini, for instance, development.

(29:53):
You know, not really a surprise.
DMin has been around for a long time.
It's now officially part of Google.
There's been a bunch of reorgs and so on.
It seemed to be really kind of in a bit ofa a bad shape in terms of being organized.
So in that sense, it'snot crazily surprising.
I think also DeepMind was quite big andGoogle has been quite big, so I wouldn't

(30:17):
be surprised if philanthropic just hadfewer people to lose, to be honest.
Yeah, I, I think that's,that's a big factor.
And the other thing is, I mean, Googleand Anthropic have a partnership, right?
So you're not quite leavingthe nest in the same way when
you move from one to the other.
Google's made massiveinvestments in Anthropic, right?
Along with Amazon.
They're basically the,the two main backers.

(30:39):
So, and, and certainly you know,Google TPUs are a huge part of
Anthropics fleet and, and strategy.
So I think that kind ofmakes a lot of sense.
given that anthropic is, you know, hasbutted off of open ai, it kind of, you
know, anyway, it sort of feeds intothat, that narrative of sort of open
ai, disillusioned open AI folks leaving.
the other thing, by the way, themoney side is interesting, right?

(31:00):
This, this article goes into somepretty wild, so they, they talk about
OpenAI, some OpenAI researchers canearn more than $10 million a year.
They're putting together counter offersto stop OpenAI employees from leaving.
For other companies likephilanthropically, like safe, super
intelligence, and these include$2 million retention bonuses.

(31:22):
So just like a one-time bonus,$2 million, please don't leave.
In addition to, this is insane equityincreases of $20 million or more.
Please don't leave me.
Here's a crap ton of money.
Like this is, this is a lot of moneyto be, to be throwing at people
just as a Retention bonus basically.
Yeah, it should've been nice tostudy LLMs when I was in grad school.

(31:48):
Also worth noting in this report, wewon't go into it too deeply, but it does
focus somewhat on entry level tech jobsin addition, and it's in a rough shape.
Yeah, it's increasingly lookinglike, you know, CS in general
has seen a huge rise in undergradenrollment over the past decade.

(32:09):
And for a while it was sort of the starpath to a good job and, and good earnings.
Now as a fresh grad, it's much tougherto get hired than it used to be, and
the number of positions seem to besmaller, and I would not be surprised if
AI has a large role in that in additionto an economic conditions and so on.

(32:32):
A hundred percent.
I, I think we're in this interestingposition where a lot of people, you
can still tell yourself the storythat, oh, it's, you know, it's because
of tariffs, it's because of theeconomy, you know, things like this.
But I'll tell, I mean, I had aconversation with, a very senior
person at, at one of the top labsand and, and what they were telling
me was we are no longer hiringentry-level software engineers.

(32:54):
We don't expect ever to do that again.
And in, in fact, we don't thinkwe'll be hiring anyone with less than
10 years of experience ever again.
And.
when you hear that, it just makesit real where it's like, ah,
this is where it's coming from.
Like, and you know, this is a labthat already is seeing the majority
of its code base written by ai,which, that's not surprising to us.

(33:15):
This is something we've been coveringfor a long time, but I, I think it, you
have to kind of sit back and, absorbthat reality that the job of software
engineers, the job even of AI researcheris getting more and more abstract and
further away from anyway, many of theactivities that used to define them.
And, and that just makesit, I mean, it's brutal.
I like this is you're, youknow, we're, we're headed for
a situation where white collar.

(33:36):
gets automated pretty hard, pretty fast.
And there's social unrestthat will come with that.
I mean, there's no two ways about it.
I, we've got a veryinteresting transition.
We're gonna have to navigate gracefully.
Yeah.
And it, it is happening quite fast.
So, you know, 20 23, 20 24, 2022, to some extent, we saw a rise
of intelligent AI assistance inthings like copilot and cursor.

(34:02):
And that had a massive productivity boost.
you're twice as productive, threetimes as productive with these agentic
tools like cloud code, which are nowworking well, it's getting to a point
where you barely need to touch code.
As a software engineer, what you needto do is be able to tell the agent what
to do and to inspect what it's doing.
To verify that's correct.

(34:23):
And that that's not what an entry levelposition kind of entails typically.
So it's changing fast and yeah,it's worth being aware of that.
and moving right along.
Another, I guess another OpenAI story,not that, the last one was all OpenAI
OpenAI Slams Court order to save allchat GPT logs, including deleted chats.

(34:43):
So essentially what's happened isthere was a court order that came
in and said, look OpenAI is beingaccused of essentially serving as
a platform that allows users toget around paywalls and access, you
know, new news and, and like New YorkTimes articles and things like that.
And what's more, we suspect that thatusers are gonna be deleting the evidence

(35:07):
of that so that if we actually requestthe court requests records of people's
use of the, the tool they're not gonnaactually show these violations of,
of copyright and, and all that stuff.
And so the New York Times argued for thecourt to prevent open ai essentially from
from deleting or discarding informationabout chat GPT logs that otherwise would

(35:32):
have been deleted, including recordsthat users have, have tried to delete.
Right.
So, th this is, OpenAI is callingthis out as basically a, a way of
preventing OpenAI from respectingits users' privacy decisions.
it's essentially puts opening AI in thisawful position where they are at risk of
breaching their own privacy agreements.
Which, you know, huge, huge trust issue,but also, I mean, it could put them

(35:55):
in breach of, of contracts and globalprivacy regulations, all kinds of stuff.
So this is really messy.
You can actually, I mean, I can seeopening eyes argument here that like,
this is, to just lurch out and, and dothis seems like a, a, a strange strategy.
But, you know, I'm not alawyer so hard to know.
There's so little precedent ingeneral on cases like this, but yeah.

(36:17):
So the idea of chat jPT to skirt, paywalls.
It does sound plausible, I guess.
But the question is how, you know, how doyou actually manage that is the best way
to force essentially a kind of defactoprivacy violation onto OpenAI users?
I don't know what the answer is, butthis is the state of the debate anyway.
Right.
And OpenAI.

(36:38):
Even released a blog post, how are weresponding to the New York Times data
demands in order to protect user privacy?
Where they frame it as a privacyquestion as kind of a commitment
to their customers and address forinstance where are business customers
that use zero data retention APIs wherethe chat logs aren't gonna be kept.

(37:04):
But open has had this interestingpattern of, of releasing blog
posts in response to legal drama.
And this one is very much along that line.
Has a lot of notes in, in response to it.
So open the eyes is a little saltyand, and not a fan of this court order.
Clearly, I. next up in the lightninground, we are starting with a story

(37:28):
from the information which typicallyhas far more cutting edge or let's
say, less public information.
And this one is saying thatNVIDIA's biggest Chinese rival
Huawei struggles to win at home.
So this is pretty much an analysis asto what extent Huawei is able to beat

(37:51):
out Nvidia in terms of providing chips.
And it seems to be that so far Huiis unable to get to biggest tech
companies in China to adopt theirchips for AI training and inference.
Yeah, this is actually a reallyinteresting story because the story that
the, like Nvidia of the world have beenpropagating that all, anyway, a a lot of

(38:13):
kind of anti export control people havebeen propagating is that, hey, you know
what, we withdraw from the Chinese marketand like Huawei is just gonna dominate it.
And it just creates a whole bunch ofeconomic wind in their, in their sails.
And this is not entirely wrong,but there's an awful lot kind
of missing in that analysis.
So, one key thing to keep in mind.

(38:33):
Huawei does not have access to the mostexquisite fabrication processes that are
available to Western companies thanks toTSMC, which is based in Taiwan of course.
So TSMC can help you fab down tothree nanometers now, and we'll have
chips that, you know, come off theproduction line using the three nanometer
process in the relatively near term.
Huawei can only use the domestic, theChinese analog to TSMC, which is SMIC.

(38:58):
SMIC is roughly speaking, stuckright now at seven nanometers,
maybe arguably working on five.
So it's, it's forced to usea subpar fabrication process.
Huawei designs the chips, and thenthey send them to SMIC for fabrication.
The problem is you can only do so muchwhen you have limitations, fundamental

(39:18):
limitations on your design process.
In particular, if you look at the Huaweichip series, what they will tend to do
is they'll be very energy inefficient.
If you want to get very energyefficient chips, you have to
get more advanced processes.
So we, we talked about howHuawei's been working around that.
They just set up this like cloudmatrix 3 84 which is like their.
Their computing system that bundles upa bunch of their ascend chips together

(39:42):
in a way that is designed to just say,okay, our individual chips may be crappier
because they're fabricated using a, aweaker process, but we can just string a
bunch of them together, like, like buildlarger systems with larger data centers.
And because China is swimming inenergy in a way that America just
isn't America's energy constrained,China's chip constrained, China

(40:04):
doesn't really care about the energyefficiency of the chips that much.
They can just put more of themtogether and achieve the same scale.
And that's really what they've been doing.
The catch though is overheating.
If your fabrication process is bad, ifyou're, if you're gonna basically like,
like overpower your chips and just.
Pour tons of energy into them,then the chips will overheat

(40:24):
and you will see problems.
That's exactly what seems to be goingon and what seems to be hampering
a lot of Huawei's sales activities.
The Ascend chips also, by the way, can'thandle direct support for low precision
formats like number formats like FP eight,which notably is what deep seek uses.
So Huawei, literally, like theirchips, cannot support deeps seek style
training runs, which is why Deepsseek has been using Nvidia technology

(40:48):
and why the demand for it continues.
One last factor that's really importantto keep in mind is that Huawei
competes with a lot of their customers.
Think about bite dance.
Alibaba, Tencent, right?
These, these companies.
They're all looking into Huawei chips.
They haven't made big purchases.
Part of that is because a lotof them run their own clouds.
Huawei runs its own cloud too.

(41:09):
And so are you really gonnabuy from your competitor?
I mean, this is the reason, ifyou go back to a hardware episode,
this is the reason that pure playFoundry's worth thing, right?
that Intel, for example,historically struggled to attract
a chip designer customers becausethey also were designing chips.
And so you're sort of likebuying from your competitor.
What the market fundamentally wants,is it, it kind of does want a separate

(41:31):
foundry, a separate designer, and thenultimately a separate cloud company.
And, you know, it's not a coincidence thatNvidia isn't so much in the cloud market.
They, they could be ifthey want it, right?
They could make big clouds.
You could have Nvidia right upthere with GCP, with Azure with
AWS, but they're not doing it.
Part of that surely is goingto be competitive reasons.

(41:52):
Let's just have people buy our chipsand, you know, reduce the barrier
to entry on that as much as we can.
And anyway, so Huawei is in a, morecomplex situation than I think a lot of
analysis historically has acknowledged.
We'll see where it ends up going.
And they are a nationalchampion, so the CCP can always
force people to buy from them.
But it's, it's an interestinginteresting scene.
Right.
And also mentioned in this article,and I think worth noting some

(42:16):
companies like Bite Dance and Tencenthave significant business outside
of China and the US is cracking downmore and more issued guidance that
basically says, don't use Huawei chips.
So if you are a more globalized companybased in China, that's even more
reason to prefer Nvidia over Huawei.

(42:38):
Our next story is sort of related.
Actually.
Huawei expected to break semiconductorbarriers with development
of high-end three nanometer,GAA chips tape out by 2026.
Okay, so GAA is gate all around.
This is a transistor designthat is becoming really popular.
It's a way of essentially making thetransistors that form the, the critical

(43:00):
circuits, the number crunching circuits onGPU logic die more energy efficient have
higher throughput, all, all, all kinds ofdesirable thermal properties, et cetera.
So.
Essentially what's happening rightnow is the the three nanometer process
that for example, TSMC has developeddoes not actually plan to use GAA.

(43:24):
So it's not gonna be agate all round process.
Huawei is accelerating towardsGAA, that's the plan here.
Essentially skipping a generation,which you kind of have to
do if you're, if you're theunderdog and trying to catch up.
but the challenge is right nowit's, it's not really clear
that they can pull this off.
You know, they're seven nanometers, theirfive nanometer, and even their seven seven

(43:46):
nanometer process that they get throughSMIC that we just talked about, that sort
of Chinese TSMC has really bad yields.
Seven nanometer yields are somewherebetween 15 and 50%, which is, I
mean, industry standards, like 90%.
anyway, so, so it's like they'remajor economic challenges, but
if they can somehow do that,that would be really interesting.

(44:07):
It would be a big leap.
The only other gate all aroundfocus design for three nanometers
is being done at Samsung Foundry.
So this would literally be the, thefirst non Samsung Foundry process.
If in fact it is non Samsung, ifthey're doing it through SMIC,
which again would be kind of weird.
It's also possible this implies acollaboration with Samsung Foundry, which

(44:29):
would be really weird because Samsungis of course based in South Korea.
So this would, you know, be interestingfrom an export control standpoint.
You know, can this actually work?
But, but anyway, so Huawei hasbeen known to make optimistic
kind of pro pronouncements aboutthe future of their technology.
Hey, we'll have all theseexciting things that don't quite
end up taping out, if you will.
we'll see.
But three nanometer gate allround would be a big deal if

(44:51):
Huawei can actually crack it.
Yeah, not much to add.
All I'll say is if you Google G allaround and look at the images, some really
fun illustrations and electron, my, mymicroscopy images, and you get a feel
for these poor computer engineers andsemiconductor experts, you, you need to go

(45:12):
3D and build these elaborate structures.
Now, just to be able to go intothese load nanometer regimes
and actually make chips work.
And speaking of that, next we'vegot a story about TSMC in their 1.4
nanometer process, which is calledAngstrom, which is making progress.

(45:33):
It's still not out.
It's expected to be available by 2028.
And according to the story, it'sestimated to cost $45,000 per wafer at
50% increase of over the two nanometerPRO process, which is 30,000 per wafer.
So yeah, that's pretty much it.
It is gotta be very expensiveto use the really lowest, like

(45:57):
most high density chips that arecoming online in the coming years.
Yeah, so 1.4 nanometer, they're, they'recalling it angstrom, which is like,
you know, slightly frustrating becauseit's not quite an angstrom is it?
but that's cool.
this is the, the next beat.
Yeah.
50% more expensive.

(46:18):
Apparently 2028 is gonna bethe earliest production run.
So if a AI 2027, that that sort offamous blog post ends up being wrong and
2028 ends up mattering we'll probablysee in 2029 some, pretty impressive
rollouts of the next generation ofNode and the chips designed on it.
so this is by the way assessing.
If there's a company that wouldwant a first crack at this angstrom

(46:41):
process, it would be Apple.
I would just say, we've been saying thison the podcast, do not take your eye off
Nvidia, which by the way, is literally theworld's most valuable company right now.
As AI chips become more and morevaluable relative to phones, expect
at some point that Nvidia starts tomake moves to compete for the leading
node to essentially buy out Apple,all of t SMCs capacity and kind

(47:04):
of become the subsidizer of choicefor TSMC for their leading nodes.
I actually think that couldhappen sooner rather than later.
There are indications it'salready sort of in the works.
So anyway, that, that, that wouldbe a pretty significant shift in
tech and, and the day that happens.
We'll definitely be talking about it here.
Fun fact, angstrom is 10 to thenegative 10 meters or 0.1 ERs.

(47:26):
So as you said, not really anaccurate name at all, but yeah.
Yeah.
Sounds good.
Sounds fun.
And last story.
Coming back to Mistral, they're launchingMistral Compute which is a cloud
offering for compute for AI that is gonnatry to compete with other offerings.

(47:51):
I suppose these days, AWS isstill one of the leading ones.
You have also newer competitorsin a space like modal.
So Misal, again, continuing to tryand kind of on every front provide
a European version competitor toofferings both in China and the us.

(48:11):
And they are coming at this froma position of, of less money, less
talent you might expect or might argue.
So we'll see.
It, it the main kind of.
Analysis of where advantages, I thinkI agree with you is the deposition
as a European leader in the space.

(48:34):
Yeah.
Yeah.
And, and in particular it's no smalldeal that they're based in France.
You know, you think aboutwhat are the big bottlenecks?
We talked about this right?
In, in the United States.
It's energy, right?
Everybody's trying to figure out wherecan I find a spare gigawatt on the grid.
It is not easy.
You know, even 30 megawatts you like,you can find it, but it's going fast.
And so in France where they have.

(48:54):
Really the, it's the only Europeancountry, the only western country that's
been doing nuclear this whole time wherethey can actually build new nuclear
plants in less than 10 fricking years.
You know, they can support this and,and now they're reaping the benefits.
The, the scale that's being talkedabout here for Mera compute, by the way,
is tens of thousands of GPUs they saybuilt on Nvidia reference architectures.

(49:16):
And so this, I assume that they,they must be looking at this point
at like GB two hundreds you know,tens of thousands of those, I assume.
And they're saying that they'llbe supporting workloads ranging
from defense to drug discovery.
Okay.
National champion much, right?
This is the kind of workload that smellsa lot like you know, preferred partner of

(49:37):
the French government, which by the way.
Also from a red tape standpoint, ifyou're trying to set up a new scaled data
center not only do you have the massiveenergy supply that the French enjoy,
but you also have the support of thegovernment to cut red tape, especially
environmental regulations that allowyou to get things up and running faster.
And so these things do sort of stackup in, in very interesting ways to

(49:59):
like compete another day, let's say.
But I think their fundamental challengeis gonna be capitalization, right?
That's always how it's gonna be.
you can't compete forever withcompanies that will raise, you know,
tens of billions of dollars on ahundred billion dollar valuations.
Like not even taking that much of aliquidity hit and raising from sovereign
wealth funds and this and that.
It, it just does becomereally challenging.

(50:20):
And the, you know, the Frencheconomy just isn't that big.
So yeah, if I were France,this is what I'd be doing.
But that doesn't mean that theynecessarily have a, a winning hand.
Yeah, as you said in this blog postof Air, they are literally saying the
offering will include these trials,AI training suite that can accelerate
region and domain specific effortsacross nation and industrywide endeavors.

(50:45):
So yeah, calling out some ofthat champion kind of stuff.
And I will say it's a little bitdifferent, open AI and ro they're
not offering this much of a cloudkind of architecture for training
and serving and whatever else.
Yeah.
And it is rather specialized, Iwould assume this came outs to.

(51:13):
Setup for compute to be able to do this.
So I do think there is a decentchance that they have some good
technological aspects here that mightmake it actually quite a good product.
And next up, moving to opensource, we have one story Pro rl.
And for whatever reason I keepsaying pro pl, every time we talk

(51:34):
about it offline, pro rl, prolongedreinforcement learning expands reasoning
boundaries in large language models.
Bit of a mouthful, buthey, aren't they all?
So there's this idea that the RLprocess itself just optimizes existing
capabilities in large language models.
Basically, it's like you have yourpre-trained model and it already

(51:56):
kind of has all the capabilitiesthat reasoning model should have, and
your reinforcement learning processjust elicits those capabilities.
It it bubbles them upto the surface, right?
So what they're after here is toshow, actually that's not the case.
What we can do is.
is imbue the model with completelygenuinely new capabilities
that were not there before.
And they have a couple of ideas thatthey stack together to just like optimize

(52:20):
the reinforcement learning process.
One of which is this idea ofthere's a, a callback labeler.
Divergence.
So this is, essentially a way of measuringhow different two different distributions
are and like probability distributions.
And so what's often done during trainingis you'll have a model that's being
trained and you'll have some kind ofreference model where you don't allow

(52:42):
the model under training to deviatetoo much from the reference model.
The reason for this often is that ifyou just let the model go hog wild and
get trained on its own to whatever.
It will end up being, that model willlearn to kind of optimize very narrowly
and unhelpfully over-optimize to theobjective that it's being trained for.

(53:04):
So in the limit, the classic exampleis if you let these models get fine
tuned for too long without a kind ofregularization they'll end up like no
longer speaking English or they'll end upyou know, kinda re really rigging their
becoming sycophantic or, or, or whatever.
And so you just have this referencemodel to keep pulling it back to reality.
And there've been arguments that,that this KL divergence penalty

(53:28):
is a bad thing that you actuallyshould just get rid of it.
A lot of those arguments are based onlooking at base models and like before
the supervised, fine tuning stage inthe context of reinforcement learning.
And what you find there is theirperformance actually doesn't get so good
if you keep enforcing that they haveto be similar to the reference model.

(53:49):
But what they're showing in this paperis actually if you do supervised,
fine tuning first to let the modelget good enough at reasoning.
At that point if you then use that asthe reference model, you actually do
find that the KL diversion strategymakes sense that regularization strategy.
So that's one thing they did.
They also did this thingcalled reference policy reset.

(54:09):
So as you train your model again, you'vegot that reference policy, so it's
not allowed to deviate too, too much.
But then you'll update yourreference policy to match whatever
the model under training currentlyis, and then you'll proceed.
So you're basically using thereference policy as a, a kind of
drag on the model under training.
The model under training does a budgettraining, it can't deviate too much,

(54:31):
but then you update the reference modeland now you can start training again
and, and you can deviate a little bitmore, but not too much from that one.
So it has a way of sort of.
Slowing down the deviation fromthe reference model, but not so
much that you're eternally lockedinto the original reference model.
And that turns out to help a lotwith training stability while also
allowing you to kind of recover a lotof these new capabilities that come

(54:54):
with with reinforcement learning.
And so they have a huge data setor a bunch of different STEM logic
puzzles instruction following.
Data tasks.
It's like 136,000 problems in mathand code and all kinds of stuff.
They also have an en enhanced versionof this GRPO algorithm, which you might
remember for our, from our discussions ofdeeps seek, it's become really popular.

(55:16):
Just sort of a way of of stabilizingreinforcement learning training.
quickly gets into the weeds, but yeah,bottom line is they're, they're borrowing
a lot of stuff from, other papers like,dapo, which is like dynamic sampling
and augmented policy optimization.
That, there are you're basicallyfiltering out prompts to only
keep the ones where the model.
Sometimes succeeds and sometimes failsso that they're like hard enough that

(55:39):
the model's gonna learn somethingby training on them, but not so hard
that it's just hopeless and the modelnever even gets a reward signal.
So there's all kinds of shit.
it's actually quite aninteresting collection of shit.
The shit links together in interestingways to make a little shit chain
and together that is pro rl.
Not how I would've described it, but okay.

(55:59):
Yeah, some interestinganalysis in this paper.
It's a family show.
Yeah.
I don't know what kids enjoy last week.
That's right.
I hope not many.
Yeah.
We have some analysis aboutthe question of prol eliciting
new reasoning patterns or not.
They basically make a point that thereare tasks on which the base models

(56:22):
are already pretty good, and therethe gain is not significant, but there
are other tasks where the gain issignificant if you train long enough.
And I just wanna call out, you're notgonna be going into detail on the story,
but Magistral alongside the model.
Mr did release a report on it, apretty detailed, like 18 page paper.

(56:47):
And they did also highlight somedifferences in their loss for GRPO,
including the elimination of KL Divergenceas a penalty and, and some other stuff.
So, very much a lot of explorationgoing on and into the right setup for
RL training and including the loss andRL in general is, is a big headache.

(57:12):
So I guess not surprising that there'sa lot of things that are being figured
out over previous and, and even nowas people are diving into RL as a
very prominent research direction.
Next up research and advancements.
We begin with kinetics,rethinking test time scaling laws.

(57:36):
So there is a new proposal for testtime scaling that incorporates memory
access into the calculation of the cost.
So this is a different wayto calculate the scaling law,
basically for test time scaling.
And in this new way of evaluating thescaling with updated cost, they argue

(58:02):
that prior scaling laws have overestimatedthe effectiveness of small models
that have inference time strategies.
They're basically saying that increasingmodel size up to 14 billion parameters.
Is more effective before applyingtest time strategies, like best
event sampling and chain of thought.

(58:23):
So basically, instead of running yourmodel more after training for smaller
ranges of models in like 10 billion range,just make your model bigger instead of
doing more inference on it if you can.
Yeah.
This is a really interestingkind of compute aware.

(58:43):
Not compute aware memory, bandwidthaware way of doing things.
So historically, when we talk aboutscaling laws, right, you, you'll see these
plots, you know, what do they look like?
Well, you usually have flops likecomputing budget on the x axis, and you'll
have some measure of performance on theY axis, and then you'll see your nice
little log plot and everything is good.

(59:04):
The problem is that flops, like, likethe actual mathematical operations that
go into training a model are only onepart of the hardware picture, right?
So GPUs, yes, can crunch a lot ofnumbers really fast, but they also
have to move data around, right?
Like that's one of themost time consuming things.
One of the big bottlenecks now isjust like, how fast can you, can

(59:27):
you move the data around, not justcrunch the numbers, but shift it from
memory to logic and back, and thento other memory and things like that.
And so what they're trying to do here is.
Redesign a scaling law thataccounts for that for two.
In other words, two measure two metrics.
One is flops as in the traditional computescaling curves, but also memory bandwidth.

(59:49):
And this is really where, or, or sortof memory access cost, which accounts
for, the bytes of memory that need to beaccessed here, the memory picture, right?
And so they're actually gonnacombine them both into one metric,
they call it the e flop or e flops.
And it's this essentially mathematicallyit's the computational cost of training,
the model plus the memory access costthat, that essentially accounts for

(01:00:13):
the, the memory bandwidth requirementsand other things that go into it.
Times the intensity, which is, huh?
A hardware specific ratio of computecapacity to memory bandwidth.
Basically this is as you canimagine, this would depend
heavily on your hardware fleet.
Like what does your hardware actuallylook like is gonna determine.

(01:00:35):
In practice, what your ideal numberof parameters should be, what
your ideal architecture should be.
And so this is part of the reason thatscaling laws, by the way, always were
framed in terms of flops because themoment you kind of try to balance flops
and memory bandwidth, pretty soon youstart to almost simulate a data center.
And like you're, you're gonna have tohave like, all kinds of resolution.

(01:00:56):
And, and that just makes itreally hard, not least because.
Then people will go, okay, well that'show it plays on that data center,
but what if I changed my data centeraround and now we've got a different
scaling curve and, and just it becomesimpossible to do apples to apples.
That in fact, is one of thechallenges with this paper.
It only uses a kind of referencearchitecture associated
with the Nvidia B 200 GPU.

(01:01:17):
So they are assuming those specs hold andyou're seeing the scaling laws for that.
It does not look at different effectivelydifferent scaling laws on different
accelerators from like a MD or Intel orin, or other Nvidia chips or different
networking or interconnect configurationsor different memory h hierarchies.
None of that.
So feel, you know, think of thisas kind of more of a vibe thing,

(01:01:38):
but in terms of like what we canlearn from this, I think there are
actually some really cool things.
So, you know, in practice when youscale up a a transformer architecture.
What you'll tend to do as adeveloper is you'll increase the
size of the MLP layers, right?
So much faster than the thescale of the attention mechanism.
So you, you could scale the attentionmechanism, you could increase the number

(01:01:59):
of attention heads the head dimension theembedding dimensions, all all that stuff.
But people tend to practice tojust increase the scale of the MLP
layers that sort of do the logicinstead of the attention piece.
Now, the intuition that a lotof people have is like, okay,
well that shouldn't matter.
So because we're just, we're just gonna bescaling the MLPs, they already represent

(01:02:21):
the line's share of the compute andparameter account to begin with, right?
So, so surely the MLP layersare already the bottleneck.
So the fact that the attentionmechanism is scaled more slowly, well,
that, that shouldn't matter, right?
But here's the catch.
The MLP layer the computerequired to scale your MLP layer.
It scales with the lengthof your input, right?

(01:02:41):
So double the length of the input.
Roughly speaking, double the amount ofcompute that your MLP layer will consume.
Fine.
But as you increase the size ofyour input, the attention, memory,
bandwidth requirements, scale withthe length of the input squared.
So in, in other words, very rapidly,as you scale the length of the
input attention, the, the memorybandwidth pieces start to become

(01:03:06):
the rate limiting step and youroperations become memory bound.
Because, you know, you're, you're,anyway, you're bottlenecked
by the attention layer and so.
This has become more and more of anissue because the length of inputs
and outputs is getting greaterand greater and greater, right?
With these kind of best of end schemes,inference time, compute reasoning, all
that stuff you're seeing, your inputsand outputs get longer and longer and

(01:03:29):
longer, which means that bottlenecksthat scale with the square of the input
length quickly overtake bottleneck's,scale just linearly with the input length.
And it turns out that memory bandwidthis, you know, scales with the square.
And, and that's why werun into this problem.
And so, anyway I thought really, reallyimportant paper, if you're interested
in understanding the consequences ofhardware choices for model architecture.

(01:03:53):
thought, I thought this was actually quitefascinating and something that I just
haven't seen other people dig into isthese more nuanced scaling laws, right?
Yeah.
The very first sentence in abstract,they're saying, we are coming at this from
a practical efficiency perspective and.
To your point of what is andX axis, they're very direct.

(01:04:14):
They say B 200 seconds.
So Andre B 200 GPU, whichis the leading edge.
Instead of looking at computation, weare looking at the literal amount of
seconds to get some level of accuracy.
Lots of really goodanalysis in this paper.
We also have a really nice blogpost, and I feel like we often call

(01:04:35):
out when papers come from Appleor, or DeepMind or philanthropic.
So worth mentioning this is fromCMU, like a fully university work.
Also, the two leading offersare immigrants to the US system.
So we should get into it.
But I do wanna say with some of thepolicies about you know, grad students

(01:04:59):
and in general kind of taking ingrad students from other countries.
You look at these papers and it, itmakes me feel a little depressed.
But anyway, moving on.
The surprising effectiveness of negativereinforcement in LLM reasoning this is
looking at RLVR reinforcement Learningwith verifiable rewards in two paradigms.

(01:05:28):
You got positive sample reinforcementand negative sample reinforcement
where PSR focuses more onreinforcing correct responses.
NSR negative sample reinforcementemphasizes penalizing in correct ones
and it seems that you can do positivesample reinforcement only and negative

(01:05:49):
sample reinforcement only training.
And, pSL only positive, only improvespast one, but reduces higher past 10.
So basically it, it reduces, if you havea few opportunities to get it right,
you're not necessarily gonna do well.
And that's because there seems to be aloss of output diversity versus negative

(01:06:12):
only apparently is able to improveperformance across all paths at K metrics.
So not just one trial, but neseveral trials, meaning that it, it
might be better to do, to focus onpenalizing incorrect outputs over,
encouraging it to do the same stuff.
That seems to work.

(01:06:32):
Yeah, it, it's actually, I'm surprisedat how intuitive at least this result
seems to be where you'd imaginelike if you were being trained.
to do any complex task.
And the way you're being trained is not bybeing told when you did something right.
But just when you didsomething wrong, basically.
What this has, this has a way of, nottelling you how to do your job, but

(01:06:55):
to tell you how to not do your job.
And that means you're gonna be morecreative if the reinforcement tells you
like, here's the right answer you know,do it like this versus don't do it the
wrong way, then that's a, you know, verydifferent kind of reinforcement process.
It's a little bit difficult to analogizebecause it's, it's post hoc, right?
So imagine that you, you try totask and if you did it right we just

(01:07:20):
wipe your brain and, and you, youhave no memory of doing it right.
But if you did it wrong, wetell you, Hey, you did it wrong.
Like, that's kind of what we'redoing with these models with
this sort of architecture.
Which is really interesting and theresults do bear out that you get
more diversity of of sort of moreexploration oriented models rather
than exploitation oriented models.
Because what you're really doing isyou're redistributing probability mass

(01:07:42):
to plausible strategies rather thanconcentrating all your probability mass
into the small number of highly kind ofcorrect, observed, correct paths, right?
Because this is, this is one ofthe things with, with RL is like,
you're not going to get to observeall the correct paths, right?
You're not also not gonna be able toobserve all the, the incorrect paths.

(01:08:03):
But at least by, you know, by not callingout the, the correct ones and saying do
it more like that, you're leaving it thepossibility space open for the model to
pursue kind of alternate correct ones.
So anyway really interesting.
One question that, that came to mind,like, as I was reading this, I was like,
well, you know, wouldn't you run intoa problem where over time if your model

(01:08:24):
gets, gets better and better at a task.
You just sort of can't find enoughnegative samples in a batch for like,
for GRPO and, yes, this is actuallyan issue and they, they call it out.
So they frame it as a feature and not abug, which I. I think it's somewhat true.
And then there's some stresses.
So they point out that does preventoverfitting because you just

(01:08:46):
won't get updates once the modelreally masters the problem set.
So you, you won't keep, you'll justlike run out of failure cases and
so you won't over optimize the modelto overfit, which is really cool.
The flip side though is it's kindof compute inefficient, right?
Because you have to then doa lot of rollouts that don't
yield any trainable data.
And so I think from a compute optimalitystandpoint, you're also taking a bit

(01:09:10):
of an L. So they actually suggestthis kind of like middle ground
strategy they call weighted reinforce,where you still use some positive
reinforcement at as they put a 10%strength to ensure continued learning.
But you're gonna use full strength,negative reinforcement learning.
So really lean more towardstelling the model not to do things.

(01:09:31):
And with a little bit ofguidance about how to do things.
So anyway that kind of helps 'cause you'reretaining some of those positive examples.
But again, from a computeoptimality standpoint, I think
it's sort of, it'd be interestingto see how this ends up scaling.
Yeah, this is one of the somewhatnuanced aspects of reinforcement
learning to actually do goodreinforcement learning, you need to
model the reward for any given output.

(01:09:53):
And to do that, you need tobe aware of both positive
rewards and negative rewards.
So it's interesting to focusmore on a negative rewards.
Basically their weighted, reinforceup weights the negative aspect, and
they compare this weighted reinforceagainst standard G-R-P-O-P-P-O,
these other RL training setups fortheir own objective and losses.

(01:10:18):
And it looks like from their resultson QU 2.5 worth noting, all these
reasoning model papers are lookingat a particular model, which.
May not be ideal, but anyway withthis weighted inforce setup seems to
be better than GRPO and PPO, which ispretty significant since GRO is often

(01:10:40):
what people are exploring in thisresearch, like I mentioned previously.
couple more research papers.
Next up, we have predicting empirical AIresearch outcomes with language models.
So that's pretty much what it sounds like.
You wanna try and predictwhat will happen to in a given

(01:11:02):
experiment with a language model.
They created a benchmark here byscraping ideas and results from
conference papers and wound upwith around 1500 test examples.
And then with a whole system withfine tuned GP 4.1 and paper retrieval,
they were able to get 77% accuracyon the test set at being able to

(01:11:27):
perform the prediction significantlybetter than off the shelf performance
just by baseline existing models.
So pretty good results.
They say it.
Outperforms a human expertbaseline on NLP idea pairs.

(01:11:48):
But you know, it's, it's still,let's say, nascent and, and this is
an interesting idea, but definitelya nuanced area to look into and,
and requires careful extrapolation.
Yeah, it's, it's one of those areastoo where people often talk about,
AI models the, the big advantage isgonna be in having good taste regarding

(01:12:08):
the problems that we throw them at.
this is an example of AI modelsactually developing taste.
The automation of taste itself, right?
Research, taste if you can predicthow likely a given idea is to pan
out, that's sort of the idea here.
So the way they do it inpractice is they're gonna go
within a given paper, right?
You often see multiple methods usedto achieve the same goal, right?

(01:12:29):
And you can imagine how hard it would be.
Like they're not gonna go andgrab two different papers that try
to do similar things and predictwhich one is gonna work better.
'cause it's impossibleto get apples to apples.
People like use differenttraining strategies, different
data, like all kinds of shit.
So what they're gonna do issame paper, multiple methods.
They're gonna extract pairs ofthese essentially experiments in

(01:12:51):
the papers that compare differentapproaches and that's what they're
gonna use to construct their dataset.
So that's kind of moreappropriately calibrated kind
of apples to apples comparison.
And so in that sense it's a little likeit is predicting it AI research outcomes,
but it's not quite the same as havinga new research hypothesis from scratch.

(01:13:11):
Like it's not at the paper level,like, alright, which, you know,
which paper should I, whichpaper level should I explore?
Yeah.
Predicting is, is maybea little misleading.
It's comparing two potential ideasand predicting which one will get
a higher number on a benchmark.
And so it's a binary prediction slightlyeasier setup and saying like, if I were

(01:13:33):
to try this idea, what would I get?
Yeah, exactly.
I think in order to do it at the paperlevel, which is the most interesting
thing, you'd probably need a very complexsort of data filtering and shaping
approach where you, you try to get it tobe apples to apples as much as you can,
and then, you know, feed into a model.
But the, the interesting thinghere is like you called it out,

(01:13:54):
this model, this sort of fine tunemodel does better than O three.
Models like O three perform nobetter than random guessing.
And so when you're looking at 77%accuracy on this benchmark of predicting
kind of, which of two ideas is gonna dobest, obviously random guessing is 50%.
So that's quite a lift.
Bears mentioning that it achieves about64% accuracy on unpublished novel ideas.

(01:14:19):
So there's some amount of overfittinggoing on here where we're getting, you
know, 77% in the sort of like test case.
But then when they actually triedon, on these new ideas that are
unpublished it goes down to 64%.
It's still still much better than 50 50.
But yeah, pretty remarkable.
The other.
The other funny thing is ifI'm interpreting this right,
says, they beat human experts.

(01:14:40):
Human experts scored 48.9%, which isslightly worse than random guessing.
if that is apples to apples, ifit's just a side by side thing.
So that's kind of, I amusingin and of itself, like humans
kind of suck at this themselves.
And they are really getting somesort of, some sort of lift from
their fine tuning approach here.
Like if they're going from 50%to 64, that's, that's not tiny.

(01:15:02):
and one last paper also related to ai.
Contributing to research.
In this case it's called EXP Bench,and it's focusing on benchmarking AI
agent's ability to conduct end-to-endresearch experiments, also using
tasks from published research.

(01:15:23):
So here they looked at peer reviewedAI publications from Europe's and icl.
They created this benchmark of461 research tasks from 51 papers,
and they basically show like, canVisa AI agents do the experiments

(01:15:43):
introduced in these papers?
And what happens with publishedpapers is usually ideally they publish
their code so you can replicate theexperiment and get the same output
and, and replicate the whatever.
Tables of numbers you get.
So that kind of gives you a richsignal as to how you want to set up

(01:16:05):
your experiment, how you wanna ideallybe able to replicate the experiment.
And so this is making it possibleto evaluate whoever AI agents are
able to that and they struggle.
Is, is a summary on whether they're ableto implement and get things correct.

(01:16:25):
Yeah, I, I will say we're getting to thepoint where the benchmarks that we're
designing are so hard that once youactually do saturate these, like, I mean,
what does the world look like when you'rehitting 50% on expert bench, like 50%
success rate for end-to-end automation of.

(01:16:47):
The process of formulatinghypotheses, designing and implementing
experimental procedures, executingthem, analyzing the results.
That whole end-to-end, like that's,not far from automate, like fully
automated ai r and d, right?
That, that's kind oflike at the model level.
There's obviously a bunch of hardwareand network optimization, jazz
that like independently openingAI is working on internally.

(01:17:08):
But, what does the world look likewhen you're actually saturated that,
that's worth asking right now when youlook at O three mini, which is the best
model they tested overall, you know, Othree PRO was not out at this time, you
know all that, but 1.4% or, or six orseven out of 461 tasks that they tossed
at it were completed successfully.

(01:17:29):
So.
One read on that is 1.4%.
Wow.
That's really small.
Another read is like, wow, we're actuallygetting like a complete success rate,
end to end of like between one and 2%with our best model today in a context
where new models are coming onlinelike, you know, every other week.
So yeah, I don't know, butthis may be a bigger deal.

(01:17:52):
It's, it's, that's a pretty big 1.4%.
At least in my mind, I. Right.
And, and to give you an idea ofwhat is involved, the inputs include
a research question they have anexample, does the Monet architecture
outperform existing lightweight models?
They have a high level method onthe experiment, train the Monet

(01:18:14):
variance on ImageNet one K forblah, blah, blah, blah, blah.
And they give it some starter code withpotentially additional instructions.
And the job of AI isn't to do the researchper se, it's to set up the experiment and
run it and get the results, which meansit needs to correctly configure the code,
add stuff to the code train it for thecorrect amount of time with the correct

(01:18:39):
parameters and, and evaluate the outputsto get the actual output of experiment.
So yeah, if you are able to dothis there's a lot of conversations
around speeding up researchand speeding up AI research and
eventually possibly automating it.
We've seen research on I,ideation idea creation.

(01:19:00):
We've covered some of these papersand we've seen some progress there.
This is really covering theopposite side where you are able
to evaluate ideas with experimentssomewhat pretty closely guided.
So I will say in the inputs it's toldkind of exactly what to do, but still

(01:19:22):
it's, it's a very useful kind ofthing to be able to speed up research.
I guess it should said like,automating the implementation phase
of AI research, which actually isan interesting compliment, right?
To the idea that we just saw earlierof trying to automate taste in language
models, both sort of at nascent phases.
But, you know I'm old enough toremember when humanity's last exam was

(01:19:43):
supposed to be a very hard benchmark.
And before that, when GPQA andbefore that, when when even, yeah.
Anyway, there, there's likea million of these, right?
It starts at, it starts at 1% and thenvery quickly we, we start saturating.
So yeah, these darn kids startingvia PhDs don't even have to do code.
All the code is done by Claude and all theexperiments are gonna be done by agents.

(01:20:07):
And moving on to policy and safety.
Our first story here is largelanguage models often know
when they're being evaluated.
And so essentially this is a a testto see, you know, how how can we
measure looking at nine differentfrontier models from all the anthropic
open AI and DeepMind families.

(01:20:29):
They're gonna ask them multiplechoice questions, asking about a,
a bunch of transcripts basically.
So they're, they're chat transcriptswhere you have interactions
between agents and humans.
You have evaluation deploymentscenarios all, all kinds of kind of
you know, transcripts anyway thatyou feed in in different contexts.

(01:20:51):
And then you ask the models.
These questions about whether thetranscripts came from evaluations
and, and this is a basic test to seecan these models grok like, okay,
I'm probably I, in this situation,I'm probably being evaluated.
And the kind of safety implicationsthere are, you might expect.
Alignment, faking, you might expect allkinds of anyway funny behavior where like

(01:21:14):
sandbagging, where the models pretend notto have certain dangerous capabilities
if they know they're being evaluated.
So then they increase the chancesthat they'll be deployed and have
more freedom of action and sort oflike standard power seeking picture.
So what they find here is, yeah,actually, you know, frontier
models, they show definite aboverandom evaluation awareness.

(01:21:34):
So Gemini 2.5 pro has ana UC of 0.83 actually.
Now the human baseline,they run very helpfully.
It hits an a UC of 0.92.
So not quite a human baselineyet, at least for Gemini 2.5 pro.
But but significant upliftover over random and both AI
models and humans are better.

(01:21:55):
At identifying evaluations in age agenticsettings compared to chat settings.
So that's kind of interesting.
That may also be just becauseoften age agentic setting
evaluations are more contrived.
Like you can just sort of telllike, oh, this is a toy app.
It's not like a, you're notreally trying to build a company
here, or, or a real thing.
This seems like the kind ofthing that you would ask a model

(01:22:16):
to do as part of an evaluation.
So that's, you know,that's sort of part of it.
But hey, that's, that is part ofthe problem of evaluation is like,
it is also hard to come up withwith evals that are realistic.
So anyway yeah, thisis pretty interesting.
there are a bunch of different cuesthat the models were found to use.
Things like having very kind of formaltask structures and specific formatting

(01:22:39):
requirements seem to, to be cues thatthe models use to go like, oh, okay.
Or, or synthetic content.
so you know, you can see how thatwould work and you can also see how
the human reviewers would also thatwould also kind of pick up on that.
But the other thing too they found isthat the human versus AI performance
gap depends a lot on the setting.
So in an agentic setting, the humanability to identify eval transcripts

(01:23:04):
tracks an a UC of 0.996, whereas.
It just hits 0.813 on chat transcripts.
So humans are almost perfect atidentifying age agentic evaluations,
but we suck at chat based onesor we struggle a lot more.
And that's a lot higher thatvariation than any models
variation across those settings.
And so it's pretty clear thatmodels aren't fully taking the hint.

(01:23:26):
They're not picking up all of thehuman accessible hints, at least
that, at least in age agentic modethat you're in an evaluation setting.
So that's kind of interesting.
A bunch of other stuff here, butwe're enlightening around territory.
Oh no, we're not enlighteningaround territory.
Anyway, I'll pause it there.
Andre, if you have any thoughts,you've got, you've got a
decent amount to go through.
So yeah, we need to go.

(01:23:46):
I will just mention a UC is area underthe curve for anyone who doesn't know.
Basically comparing true positive, likeit actually is evaluation and you guess
it correctly versus false positive.
You guess it wrong.
Basically how accurate youare with a bit of nuance.
They also evaluate in two ways.

(01:24:06):
One of them is multiple choice andthe other is open-ended as, and as you
might expect on multiple choice whenyou're given the option explicitly.
Is this for evaluation or something else?
You're gonna do a lot better thanif you're just being open-ended.
So multiple choice is like,what was the main test here?

(01:24:28):
Was it capability, alignment or other?
And the system's gonna predict alignment.
And then open-ended is, would bewhat was the primary reason for the
interaction and it would predictsomething like the test of AI boundaries.
So as you said, the implication hereis if you're testing alignment and
safety, the model, if it's aware ofthat, could mislead you and pretend

(01:24:52):
not to be able to do hacking whenit actually can, for instance.
Next up more of an interoperabilitypaper called Beyond Induction Heads.
In Context Meta Learning inducesmulti-phase circuit emergence.
What a fun title.
So this is looking at the mechanismbehind in context learning.

(01:25:16):
In context learning is the ideathat you give it a couple examples.
The model is able to do something thatit isn't necessarily able to do out
of a box just through pre-training.
And they are saying that thisnotion of induction heads, this
is a term from philanthropic.
I think originally it's patternyou get in models where.

(01:25:41):
Basically a part of a model focuseson looking backwards in the input
to identify some things that alreadysaw that's similar to what it's
currently looking at and be able topredict what comes after the current
input based on previous patterns.
So they say that in action, handsonly, partially explain ICL.

(01:26:05):
essentially there's a fancy circuit, afancy abstract mechanism in the model
that emerges and that enables metacontact in context learning beyond the
kind of known induction head mechanism.
There's even a fancier kind of abstractnotion of something with a model

(01:26:25):
that does in context learning well.
this is sort of a generalizationright, of, of induction heads and we
talked about the induction head bumpbefore, but it worth kind of reminding
people about the, the specifics here.
So it's, it's kind of like the,the answer to this problem.
You read on a piece of paperthe words United States of.

(01:26:47):
And then like you, obviously youinstinctively know it's America, right?
But in that setting, there's a circuit inyour brain that's going like oh, oh, oh.
Like I've seen this before.
United States.
Of, United States oflet me see, lemme see.
Where have I seen United States of before?
Oh yeah.
America.
America.
Okay, I'm gonna put that in there.
Right?
That's what the inductioncircuit induction heads do.
And they emerge quite early, as youmight imagine in the training process.

(01:27:10):
And so what you'll see is the losscurve will drop and drop and drop, and
then at one point the model will kindof like, it's almost like it's gonna
like shift its position a little bit toaccommodate the indu induction heads.
So you see this little rise inthe loss, the, the performance
on paper gets worse very briefly,and then it drops quite quickly.
So the induction head bump is that, it'sthe development of this new circuit.

(01:27:32):
And this is something that'sbeen very extensively studied.
It's almost like you know, if you'veever done biology like drosophila and
Melanogaster or whatever those likemodel organisms are, this is a model
circuit that people turn to quite a bit.
this is an attempt to see if wecan find a more complex version
of that same basic circuitry.
So, for example they take a setof three different tasks where you

(01:27:57):
have a bunch of geometric shapes.
So triangle square circle, diamond, right?
Depending on the task, you canend up assigning different color
labels to each of those shapes.
So maybe in a size based labelingtask you know, triangle is red, square

(01:28:17):
is blue, circle is green, right?
Maybe in a, a different tasktriangle is blue, square is green,
circle is yellow, and so on.
And then during training, the model isgonna see a sequence where you go, okay,
now triangle is blue, square is green.
Circle is yellow, what is diamond?
And in order to do that, the modelhas to basically look at the tasks in

(01:28:40):
context and figure out what task thisis, and then predict the correct label.
And so you can sort of see how this isa bit like the induction head, right?
It's, it's looking back more abstractlynow at like the set of tasks rather
than just like, okay, what wordalways comes after this word Instead?
It's like, okay, if it's this task, thenwhat word always comes after this word?

(01:29:02):
And so anyway, it's unlike theselike simple copying tasks that
you see with the induction heads,there you see a a single jump in
accuracy, in in context meta learning.
with this sort of setup, you end upseeing three distinct phases where
the model develops increasinglysophisticated strategies.
The first one is just at the verybeginning where all the model is

(01:29:23):
essentially using, its like statisticalunderstanding that's been picked up.
It doesn't really use context.
It's more of an auto complete mode.
And then in the second phase,they have a semicon circuit where
accuracy jumps from about 35% to 75%.
And what it's now doing isit's actually able to attend
to label tokens in the context.

(01:29:45):
So it's actually gonna look, you, youcan notice it, paying attention to the
right tokens in the, the context thatyou fed it, looking at the actual tasks
that seem like they map onto yours.
But it, it is still focusedanyway, on, on the query.
Bottom line is this starts toemerge gradually and in layers
which is interesting from aninterpretability standpoint.
It means you can kind of draw a little bitof a box around the process by which more

(01:30:09):
sophisticated reasoning starts to emerge.
Right.
Worth noting.
This paper is doing the research onsort of toy tasks, a small neural net.
And, and this one task, as you saidwhich is also how initially the
research on induction heads worked.
Andro did follow up their initialresearch with making the argument that

(01:30:31):
there are industrial heads in giganticneural nets and large language models.
Here they're still focusingon a small scale scenario.
And so this like multiple bump analysis.
May not necessarily extend, but it's,it's a sort of, yeah, slightly more
theoretical, conceptual argument thatit's not just about induction heads.

(01:30:54):
There's different types of emergencethat might occur in neural net training,
which in general is interesting becausethe sort of jump and loss due to a
conceptual change of reasoning yeah.
Isn't necessarily something thatwas commonly understood to be the
case until relatively recently.

(01:31:16):
A couple more stories.
Now moving on to security.
The next story is that new Microsoftcopilot flaw signals broader
risk of AI agents being hacked.
So Microsoft copilot where agenthas been identified as vulnerable
to a zero click attack, meaning thatthe hacker is able to exploit the

(01:31:41):
system without any user interaction.
So kind of a big deal, right?
You can actually hack it.
And I think, Jeremy, you mentionedthis earlier on, as we deploy more and
more agents in more and more kind ofisolated environments without direct
human supervision, these kinds ofthings become much more concerning.

(01:32:04):
it is the first ever zero click attack onan AI agent that they're calling out here.
It's called Echo Leaker.
That's what AIM Security, whichis the firm that found this is
calling it, it's been fixed already.
It was in Microsoft 365 copilot.
Customers were infected 'cause theyflagged the issue to Microsoft months
and months ago by like five months ago.
They've been working.

(01:32:25):
Around the clock, it seems to,to solve this problem that's
a lot longer of a lag than youtypically find for fixes like this.
And the reason seems to be they hadto spend a bunch of time just like
educating people on this new threatmodel because it is so different.
This is what's known as a, an LLMscope violation, vulnerability.

(01:32:46):
So you're essentially,what you're doing is.
You're sending an email, right?
So like I send an email to youand I know that your computer is
running Microsoft 365 copilot.
I know that your computer isrunning an agent, and that that
agent will review my email, right?
And whatever I put in my, in my email toyou, that agent will put in its context.

(01:33:06):
And so essentially this is aprompt injection attack, right?
So you, as the user, if you're receivingmy email, you don't actually have to
click on anything or interact witha message or anything like that in
order for me to, or my agent to accesssensitive information on your apps.
If I can just put in a prompt injectionthat causes your agent to send me a

(01:33:28):
bunch of your private information, right?
So you know, send an email to user.
There's, there are no phishing,no malware needed by the way.
This is just straight prompt injection.
and there are hidden instructionssomewhere in the email for copilot.
And so this is.
A pretty big deal, especially giventhat we live in a world where,
you know, the anthropic modelcontext protocol Salesforce's agent

(01:33:49):
force, you got a bunch of theseagents you're kind of taking over.
This is, the problem is there's noclear solution to prompt injections.
And as long as agents are gonna beloading human written text into context.
These failure modes are going to arise.
It's really interesting.
And the attack surface has just exploded,right, with these agents, right?
The implication of zero click is you asa human don't have to make a mistake.

(01:34:12):
Or typically with email attacks, you know,you see a phishing attempt where, you
know, a hacker pretends to be your bossor whatever, and you have to make the
mistake of thinking it's real and clickinga link or whatever to install a virus.
Here, literally via AIjust sends an email.
And if it's in your inbox and theagent scans your inbox and reads the

(01:34:34):
email, it goes off and like leakssensitive data because it's told to
and, and listens to the instructions.
So as you say, I think very.
Real threat.
And, and, and as we get into monotechmodel context protocols into agents
going, kind, connecting to differentendpoints by themselves and reading
instructions that are not provided by you.

(01:34:56):
Yeah.
Lots of opportunities to exploitagents and make them do silly things.
And one last article, Claude govmodels for US national security
customers services from philanthropic.
And yeah, they introduced cloud gov modelsspecifically for US national security.
Apparently they are already in use bytop level US national security agencies.

(01:35:22):
it basically is just that we've obviouslyseen a whole bunch of stuff about
open AI and anthropic and, you know,and Google Deep Mind looking after
or going after government contracts.
So this, you know, makes it.
Ton of sense.
You know, having these models thatcan operate in classified environments
is really, really important.
Right now what they're being usedfor apparently is strategic planning,

(01:35:46):
operational support, intelligent analysis,threat assessment, that sort of thing.
But they do say the applicationsrange across the board there, so
could be other things as well.
And then they highlight a bunch ofspecific capabilities that, that
they've been deploying which areall anyway, what you might expect.
Improved understanding and interpretationof complex cybersecurity data for
intelligence analysis and henceproficiency in languages and dialects

(01:36:09):
critical to national security operations.
Greater understanding of documents andinformation within the intelligence and
defense context, et cetera, et cetera.
Oh, and then a, a really interesting one,improved handling of classified materials.
as the models refuse less whenengaging with classified information.
One of the problems that we will run intoand arguably are already running into is.

(01:36:30):
If you want to use these models fornational security applications, the
safeguards on them will sometimesprevent you from doing that, right?
The models will be like, well, asa large language model built by
philanthropic, I can't, blah, blah, blah.
The challenge is sometimes you dowant these models to be capable
of doing things that you wouldn'twant everyday users to do.
And the other problem with that is,as we've seen alignment, faking and

(01:36:53):
resistance to fine tuning of thesemodels where they will try to prevent
themselves that are safety measuresfrom being overridden can cause the fine
tuning process to be really challenging.
And so we may actually, thissounds insane, but I'm just
gonna plant the thought.
We may be entering a phase where itis actually difficult to convince AI

(01:37:16):
models to be the national security toolsthat we will sometimes need them to be.
That's a really interestingproblem set, and I think to the
extent that that ends up being thecase, boy, is that an interesting
warning shot for alignment risk?
Yeah.
And onto synthetic media and art.
Just a few more stories.

(01:37:36):
We begin with Disney and NBCUniversal Sue AI Company, midjourney
for copyright infringement.
So there you go.
Midjourney, one of the big text toimage model providers used to be
a leader in, in the best quality.
Now they're just one amongseveral and relatively open model.

(01:37:57):
So you can produce Dar Faderor I don't know, whatever
else, copyrighted characters.
Apparently you can produceminions, which is NVC, universal.
And we, claim here is that this isstraightforward copyright infringement
that midjourney has to stop doing it.

(01:38:17):
And and Disney and NBC want a bunch ofmoney and also want midjourney to stop.
Apparently, according to them, theyreached out to Midjourney prior to the
lawsuit and ask them to stop and to filterthe data and outputs to not allow their
copyrighted characters to be produced,which, as I recall, I believe OpenAI did

(01:38:39):
for instance, and midjourney has continuedto allow their models to produce things
which has been argued potentially couldbe argued as fair use and therefore
not applicable, but clearly a big deal.
Right?
This is Disney, this is NBC Universal.
There's been a bunch of lawsuits relatedto generative ai, especially in the

(01:39:04):
LLM domain, in the text output domain.
We have New York Times versus openAI as a major one that's ongoing.
As we've covered earlier, I wouldexpect this to be another major
case that has major implications.
Yeah.
And the claim, and you'll see thisin fairness, in, in any lawsuit, but
the claim here is that midjourneyis being especially egregious in

(01:39:28):
their, in their approach here to,to a use of copyrighted material.
They're saying, you know, midjourneyis basically selling subscriptions that
let's users download infringing images.
Like, it, it's not likethere's modification happening.
It's not like midjourney is notmonetizing they're, they're like
directly monetizing the tool that allowspeople to just download these things.

(01:39:50):
And the claim is also that midjourneycould have measures in place
to prevent that from happening.
Like specifically that is to preventcopyright infringement images that violate
copyright laws from being generated,but that they've just not done that.
this is gonna be aninteresting one to watch.
I mean midjourney probably has fewerresources these days, I guess to pull off.

(01:40:11):
Its like lobbying effort, whichis something that OpenAI has
certainly been able to to do.
So we'll see how the, howthe case works out for them.
Right?
Also a fun lawsuit, PDF to read.
'cause we do embed images of, I dunno,AI generated truck and AI generated
Dar fader in there, which I wouldexpect is not often something you see

(01:40:32):
in lawsuit documents which, go intoa lot of technical detail and so on.
And onto last story, SEG AFTRA and videogame companies reach tentative new deal.
So Seg AFTRA is the union for it's theScreen Actors Guild, American Federation

(01:40:52):
of Television and Radio artists.
So a uni of actors and includingvoice actors who work in video games.
And so there's been.
A strike and, and a lotof negotiations ongoing.
We covered this a lot with regardsto movies and TV last year.
Well, now there is this developmentin video games, which is, you

(01:41:16):
know, especially important becauseif you're doing voice acting as
we've covered, you have 11 labs.
Text to speech is even further along thantext to video and, and image cloning.
So after 18 months of negotiationsprimarily over AI consent and
compensation issues, there'snow this tentative agreement.

(01:41:37):
And I guess there are AIprotections in place for actors.
And when you sign a contract as an actor.
You know, to voice a specific characterthe video game com company might wanna
be able to then make an AI model ofyour voice acting of that character
to use in future games or whatever.
There are now kind of clear guidelines andexpectations as to how that would work.

(01:42:02):
Boy, I
so people can do impressions of peopleand like, if you have access to an AI tool
that you can steer, and we've seen, youknow, the kind of steering that's coming
online with 11 Labs I really wonder whatsubstantively these protections end up.

(01:42:23):
Giving in the long run.
I mean if I want somethingto sound like Morgan Freeman.
Okay.
So I'm barred from using Morgan Freeman'sactual voice without permission, but
surely I can find the person who does thebest possible Morgan Freeman impression
or and, and maybe use that as a startingpoint, and then like gradually kind of
tune the, the waveform prompt the modelto refine its, its impression without

(01:42:46):
ever using the word Morgan Freeman.
Like, you know, maybe not even withoutever saying make it sound like the
God and Bruce Almighty or whatever.
That's a like.
Probably too old areference for you, Andrea.
I'm sorry, that's not that old.
That's, you got that.
Okay, cool, cool.
Yeah.
But anyway, you know, stuff likethat, like I, I'm really curious
how in practice, because thereare gonna be like good faith.

(01:43:08):
Like, you know, the famous ScarletJohanssen thing where at least the
claim from OpenAI was, oh yeah,we just got a voice actress who
sounds like Scarlett Johansson.
We didn't actually like, andit's like, yeah, okay, well
you defacto cloned her voice.
Like, I don't care if herspecific like waveform was never
put into your training set.
in effect, that's what we ended up with.
And so I'm really curiousabout that dimension of it.

(01:43:30):
Do we own our voices?
What does it even mean to own our voices?
We'll see.
Right, right.
This is dealing with AI replicasin particular, but there's also a
question of, well, what if you don'thave a human actor in the first place?
Yeah.
Which is very plausible now in a waysimilar to coding, where like, okay, you
don't need a person to write code anymore.

(01:43:52):
You need a person to.
Tell the AI what to do.
Yeah.
Anyway, at least there's nowthis agreement and there's
no more need for strike.
So I suppose good for actors.
Yes.
And with that, we have finished withthis episode of the last two weeks in ai.
You can go to last week inai.com for all the links.

(01:44:13):
Also last week in.ai for theSubstack with our text newsletter.
As always, please share, subscribe,review and all that, but more
of anything, do keep doing it.

All Episodes

#212 - o3 pro, Cursor 1.0, ProRL, Midjourney Sued

Episode Transcript

Popular Podcasts

New Heights with Jason & Travis Kelce

Dateline NBC

On Purpose with Jay Shetty

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}#212 - o3 pro, Cursor 1.0, ProRL, Midjourney Sued