Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:11):
Hello and welcome to the last weekAI podcast where you can hear us
chat about what's going on with ai.
As usual, in this episode, you willsummarize and discuss some of the
last week's most interesting AI news.
And as always, you could relativelyepisode description for the timestamps
and links for those articles, and alsoto last week in ai.com where you can
(00:34):
go and browse the web and and so on.
I am one of yourregular hosts, Andre ov.
I studied AI at Stanford and I nowwork at a generative AI startup Astro.
And hi everyone, I'm Jeremy.
I'm your other regular co-host.
Sort of been, I guess, been inand out the last couple weeks.
But yeah, great to be here andyeah co-founder of Gladstone,
(00:56):
ai, ai, national security stuff.
You know the deal if you've beenlistening to the podcast and this
week we were talking about this,this happens every once in a while.
Not that often these days, but we'relike looking at our roster and we're
like, man, this is a light week.
And I guess it's gonna be ashort podcast for that reason.
But like hot air that expandsto fill the entire volume
that is available to it.
(01:16):
I'm sure we'll find a way tomake this a two hour podcast.
Nonetheless, we, it's a problem, it's askill, you know, we really are capable
of talking a lot when we have the time.
but to give a quick previewof what we'll be talking about
in tools and apps, there's avariety of sort of smaller tools.
Launching one kind of major onefrom OpenAI, but the rest are maybe
(01:38):
less notable and, but kind of variedand interesting applications and
business, as is often the case.
You're gonna be talking abouta lot of hardware OpenAI
spending much money on it.
Some developments from Huawei.
And a couple of business deals as well.
Projects in open source.
There's a couple new models out toGemma Free and ones from Sesame.
(02:01):
Pretty exciting ones.
Research and advancements.
We got Gemini Robotics, which is kindof, I don't know, unexpected from me
and pretty exciting and interestingpaper about test and compute.
And finally, policy and safety.
Our usual mix.
We got one paper onunderstanding and alignment.
And then we have a lot of storiesabout China, US relations, which
(02:23):
seems to be big deal these days.
Yeah, it's going great.
It's going great.
Well, let us just go ahead and jumpin, starting with tools and apps.
The first story is OpenAIlaunching new tools to help
businesses build AI agents.
So there's now a new responses,API, which allows you to make custom
(02:47):
A agents that involve things likeweb searching and file scanning,
and that is meant to be even more.
Yeah, autonomous capabilities.
So it also allows you to usethe computer using agent model,
which can control much more variedtypes of things on your device.
(03:07):
And apparently enterprises can runthat computer using agent model
locally, although on the consumerversion, you're gonna have to
only use that for web actions.
So I think kind of not toosurprising, we saw philanthropic
launch, also a computer used API insort of, early release a while ago.
(03:30):
And it would be very interestingfor me to see if this will be part
of a trend of kind of the next waveof automation after just playing l
LMS seems to be something like this.
It's, it's a really interestingmoment for the agent, like age
agentic business model if you want.
Like, here's OpenAI essentiallylooking at how do we unbundle the
(03:53):
package that we've offered historicallythrough the the agentic systems that
we, we've offered basically, right?
Like we built the agents, you usethem, this is them saying, no,
no, well, we'll give you accessto the underlying tooling and
you can build your own agents.
You know, like this comes with, forexample, like a file search utility
that can scan across files in yourdatabase and Train models on those.
(04:15):
So that's at least inprinciple, the guarantee there.
But yeah, there's a whole bunch ofother, you know, you mentioned the,
the QA model, the, the computerusing agent model behind operator.
And that by the way, you know, generatesmouse and keyboard actions that's
actually like about computer use itself.
But essentially, yeah, the unbundlingof this gives customers a lot of options
for how to create their own agents.
(04:35):
Eventually the sense though, everyindication from OpenAI is they
intend to have one experience to rulethem all at least on offer, right?
So they are looking tobuild an integrated solution
where it's not unbundled.
But this is, it's kind of theother side of the coin, right?
You're either building tooling toempower your users to build their
own agentic systems, or you'rebuilding one agentic system that,
(04:57):
you know, you imagine most kind ofconsumers would tend to use directly.
So they don't have towrangle their own thing.
They're, they're trying to sortof have it both ways here, right?
So unbundle and bundled products, whichI would expect anybody with sufficient
scale to do this who can afford tofocus on two things at once is gonna
have to do at some point because.
It's like, it's along the spectrumtowards open sourcing, right?
(05:19):
When you, you start unbundling the toolsand letting people mess, mess around
with them and, and see what they build.
So that itself is, iskind of interesting.
And then OpenAI can learn from how thosetools are being used and unbundled in
the same way that, you know, meta learnsfrom the way people play with Lama in
the open source world to then integratethose learnings into their own kind
of all fully packaged agent systems.
(05:39):
So kind of interesting, Ithink a, a good strategic play.
They're opening up a whole bunchof toolkits including an open source
toolkit called the agents SDK.
And that gives you a bunch offree tools to integrate models
with your internal systems, addin safeguards and do monitoring
stuff for your agents as well.
So, pretty interesting.
It's again, this sort of likestraddling the line between what
(06:00):
do we open source, what do we not?
and I think a, a nice sort of middleground that open OpenAI spotted there.
Expect philanthropic to do the same.
Expect ultimately xai to do the same.
I think a lot of this stuff isgonna get picked up, but it does
seem like a good strategic play.
Yeah, and I think also it kind of pointsto something I suspect we don't know the
details, but I would imagine that theAPI is where the real money's at, right?
(06:26):
We have a consumer version.
You can pay a subscription of $20per month for now, $200 per month
if you are a total power user.
And presumably enterprises arepaying for that for their workers.
But a bunch of companies, includingto some extent our company is using
the API to create their own thing ontop of a Chad GPT or on top of Claude.
(06:50):
And I would imagine long term, that'show open AI philanthropic will be
making the majority of their money.
And certainly philanthropic istargeting enterprise explicitly.
So this also plays into that I think.
Absolutely.
Yeah.
The, the money as ever is always in B2B.
Right.
And it's interesting that in a wayyou can view the history of chat
(07:12):
GPT as having been just to useB2C business to consumer to build
brand recognition for the B2B playthat inevitably was going to be, I.
Most of the value created here.
I think the one caveat is when yougo really long term with this stuff,
eventually, if you talk about superintelligence, if OpenAI plans to start
to centralize more and more of, ofthe activity that's happening, the
(07:35):
economically productive activity, whichyou have to imagine they would despite
everything that they say publicly about,you know, we want to empower creators
eventually, like, like Amazon, right?
With you know, what do they call it?
Amazon Basics or whatever, youknow, they spot those products that
are selling really well and boom,they'll, you know, snap up those
opportunities, expect the economicsto look the same for open ai.
When that happens, essentially they'recutting out the business middleman
(07:59):
and going straight to consumerand internalizing all that value.
In some verticals, not all, but in some.
So I think there's gonna be thisinteresting evolution to your point a
transient where B2B is where all themoney's to be made, but then because
AI has the ability to eat the worldit's gonna be really interesting
to see do they ultimately become.
As much a B2C company as a B2B company.
(08:20):
Right now, by the way, anthropichas a massive advantage, relatively
speaking in terms of the balanceof, of consumer versus business.
Anthropic much more focusedon the business side.
And that's reflected in just the bettercoding abilities of, you know, cloud
3.7 sonet cloud 3.5, sonet new and allthat stuff, even over and above some of
the age agentic models from remote ai.
But I think that's a great point.
it's so interesting, right?
Like these companies areinventing new business models.
(08:42):
No one really knows what's gonnawork, how they'll evolve over time.
The one thing that's guaranteed is theywill evolve and we will be surprised.
Next up we have a story from Google.
They are now releasingvariability in Gemini two flash
to have native image output.
So that allows you to do conversationalimage editing in the chat flow as
(09:06):
you do with chat and others, youare able to ask it to generate an
image to my understanding here,basically this is different because
it's not calling on another tool.
It's built in to Gemini to flashitself as a multimodal model.
And so the results can bequite impressive in multi turn
(09:29):
conversation taking where one ofthe key sort of limitations of
image generation or challengesis if you wanna edit an image.
You wanna also preserve, let'ssay the characters the people,
you know, various aspects of theimage while still changing it.
And that's one thing you get sort offor free here because Gemini two Flash
(09:51):
has the context of a conversation, bothin terms of the text and the images.
And that means that it can really doa pretty stellar job from examples
I've seen of maintaining consistencyand also being very kind of
generalized and able to do all sortsof stuff directed by your output.
(10:13):
So initially this was announced,it was available to some testers.
Now this is rolling out tousers and, and even developers.
Yeah, and it's you know, the, the,the small handful of things that you
tend to turn to and you look at thesedays, you know, okay, how good is
this new in the generator text, right?
It handles text really well.
They, at least based on the, the demosthat they show, which you never know,
(10:35):
but they have it create an old detailedvintage, 35 millimeter photograph from
a front view of a computer monitor.
And then they say like, have this textdisplayed on the monitor, and it's about
three lines and it, it captures every,every word as far as I can tell here.
So, you know, that's a failuremode that we've seen before.
And it's the combination, you know, onceyou get into text that you want to have
(10:58):
faithfully represented in the image.
And also you wanna get into thisback and forth editing of the image.
A lot of these things, as you stackthem together, that's where you run
into a lot of the problems, and atleast based on what they're deciding
to show us here in the demo, it, itdoes look pretty remarkably good.
So, as ever, I am wondering what isthe next step in, in image generation?
(11:19):
I'm sure there will be some, butfor, for those of us who are just
sort of like lowly consumers of imagegeneration, tech I think we're pretty
close to approaching saturation point.
I mean, Andre, you're obviouslymore plugged in on the gaming side.
I'm, I'm guessing that, you know,you have specific things that you
might look for just 'cause of, youknow, generating visual artifacts,
avatars, things like that, or.
Yeah, I mean in our kind of testingwe found that if you have a very
(11:43):
particular use case, these aregeneral purpose models and they
reflect their training data sets.
So they're really good at generatingkind of stuff you would find
on the web if you have a veryspecific set of requirements.
Typically these models aren't ideal.
So, being able to be very good atinstruction, following down to very,
(12:04):
my new details is very importantto be able to use them kind of zero
shot for whatever you're doing.
So that could be one of the,powers or benefits here.
Another interesting aspect, justlooking at their blog post is if
you look at the examples they givein the multi-term conversational
image editing, you have to wait likeupwards of 10 seconds for a response.
(12:26):
Mm-hmm.
And I would imagine that's one ofthe limitations when you are doing
this kind of native image output,when you have a multimodal model that
does image and text and audio it canoutput images for you and, and have
very flexible kind of reasoning andaccuracy, but then it is way slower
(12:49):
than the sorts of image generators, textimage, generators we see on the market.
So I think that's an interestingtrade off we haven't seen necessarily.
Yeah.
There's almost the kind ofuse case issue there where.
Until compute is so cheap thatfor most practical purposes, you
can get instantaneous generationof outputs for multimodal models.
(13:10):
Yeah.
There's probably gonna be a needfor, you know, still specific, you
know, high specificity models thatonly have one modality or another,
and maybe router models that routeyour queries or, you know, whatever
modality to whatever modality.
But we're definitely not there yetthat we can have a, you know, a
single gato like model that justdoes everything for everything.
Yeah, people are still using plentyof Lauras out there, that's for sure.
(13:32):
Moving on to the lightning ground,we have a few smaller stories.
First up, one of my favoritetopics, apparently since I keep
bringing it up, it's Waymo andthey are yet again expanding.
They're now offering robot ridesin a few more cities in the Bay
Area, including Mon View, PaloAlto, Los Altos, and parts of
(13:53):
Sunnyvale, which is exciting tome because I work in Los Altos.
So now I'll get to potentiallyuse it in my commute sometimes
just because that's fun.
So this is part of our rollout.
It seems like they are reallytrying to expand a lot.
This year.
They've expanded to Phoenix tooffer their robot services there.
(14:15):
They're trying to expandto la and they're also are
planning to go to Atlanta.
So, yeah, it seems like we're.
They feel ready to expand.
And for me, the main question is,can they do it more rapidly than
they have for the past year or so?
I mean, they were in San Franciscofor now, quite a while, maybe two
(14:37):
years, and they're now moving kind ofto the suburbs south of San Francisco
with some of these smaller citiesin their backyard, so to speak.
So still moving a little slow on theexpansion front, but it seems like
they haven't had any kind of bigcrashes or anything of that sort as
they've expanded, which is promising.
(14:59):
Yeah.
It's also it's strategicallyinteresting, right?
'cause the, one of the big things that'scome out from way more recently, of
course, and which we covered, is theirpartnership with, with Uber in Austin.
And that seems to be expandingnow to Atlanta, or at least
it will later this year.
it does make me think a little bit like.
Uber is at some risk here, right?
Because essentially the the coreplatform that they're using for a
(15:22):
lot of these rides, where essentiallyas an Uber customer, now you can,
you know, if you're in Austin, youget matched with a Waymo Robo Taxi.
Same thing will be true in Atlanta.
yes, you're a marketplace.
Like that's the value ofUber in this context, right?
It's discovery of supply and demand.
But at a certain point, if people getkind of comfortable riding in Waymo
cabs, you've got the brand established.
(15:42):
If Waymo just comes out with anapp and then undercuts Uber, which
presumably they may be able todo, if only for a transient period
to onboard people like Uber's gotsome, some platform risk here.
And so this is, it's not acoincidence, right, that Uber had
made previously through Uber, a TG.
The self-driving piece,a priority for them.
They've since ditched that just becauseit's, it's too capital intensive.
(16:04):
They weren't making enough progress.
But that was because they sawthis eventuality potentially
coming and huge amounts of,of platform risk on the table.
So, yeah, I mean, I, I don'tlove Uber's positioning here.
I I, I think, you know, they have greatsoftware, but when you're riding off
a, a kind of hardware platform wherethere's a lot of efficiency to be gained
(16:25):
from vertical integration potentiallyin this space, 'cause the margins are so
limited I wonder what they're thinkingand how that ends up playing out.
But we'll get some earlyindications anyway.
So with these rollouts in, youknow, Austin, Phoenix and elsewhere.
Right.
Exactly.
And, to that point, Waymo alreadyhas an app, a standalone app you can
use for example in San Francisco.
(16:45):
So they're kinda ready to getrid of Uber whenever they can.
I guess the benefit for Uber andso on is, is just their scale.
They're everywhere,obviously, across the globe.
So it'll take quite a whilefor Robotaxis to be, just have
enough hardware in the firstplace to even be able to compete.
And the big question as ever is,is Tesla gonna be able to catch up?
(17:09):
'cause right now it seems like Waymoat this point is the only player
in town on the robot taxi business.
Next up we have a new video generator.
There's a startup called MoonValley that has released a
video generating model they callMarley, with the pitch that it's
trained on licensed content only.
(17:31):
So not using any copyright data thiswas done in incubation with Asteria
and AI animation Studio, and is alsomeant to seemingly be for more let's
say cinematic or I media production.
Type of roles.
It allows you to customize camera andmotion controls, for example, and in
(17:54):
scene movements and things like that.
And this allows you to producehigh resolution clips for up to 30
seconds long with, again, the pitchbeing where it is low legal risk.
So, this is certainly you know,it's kind of been a little quiet
on the text to video front.
We had a big kind of momentthere of Sora being released.
(18:15):
A while ago.
We saw Adobe, I believe Idon't remember if it's released
already, but we have announced,we have video generation model.
So it's continuing kind of arollout even as the focus has
definitely shifted to reasoning.
Yeah, it's, it's also, they'reapparently, so starting
off with more kind of opensource, openly licensed stuff.
In this first release, they areapparently working with partners
(18:37):
to handle licensing agreementsand packaging videos into data
sets that they can then purchase.
Which is a lot like whatAdobe is doing, right?
So we, you know, we see them kindof doing the same thing with their
big they were, I think the firstcompany, the first company certainly
that we covered that was doingthe indemnification guarantee.
You know, if you get sued for usingour, you know, image, video outputs,
whatever, we will indemnify, wewill kind of defend you in court
(19:01):
if you're using our, our softwareas it's intended to be used.
The interesting thing with Moon Valleytoo is like, I'm, I'm not tracking how
much they've raised, but it's certainly,you know, not gonna be a huge amount.
It's not gonna be in the orbit of,you know, what OpenAI has on hand.
And so when you think about asmall company like that, trying
to do this stuff through licensingagreements and, you know, purchasing
(19:22):
video content from other companies,that's a much taller order.
But strategically, and, and this isjust speculation there's actually
kind of an interesting symbioticrelation here, a relationship here
potentially between companies thatput out licensable video content.
I imagine if, I'm a, a company that'spumping out videos that might be
used for training for these models,I might actually want to partner
(19:44):
with a company like Moon Valley,give them very, very cheap access.
but still sell to them, access to thelicenses for these videos, if only to
set the precedent so that opening AIthen feels pressure to come in and buy.
And once you get the big companiesto come to you, then you charge the
full amount, if that makes sense.
So like, I don't know, I don'tknow the, the legalities of
(20:05):
how that, that would play out.
You know, if there's, there's an issuehere with kind of like selective pricing
with different players like that.
But there's a kind of interestingpotential partnership here with
up and coming companies for thesecontent creation platforms to license
stuff for cheap, just to get thatflywheel going, set the precedent,
and then, you know, cave thebigger companies that can afford it.
It is sort of interesting and not sayingthat's part of this, but it, it kind
(20:28):
of makes me think in that direction.
When you, when you look at this.
Right.
And to a point aboutfunding, I just looked it up.
They got a seed round of 70million back in late 2024.
That was when theyannounced it at least.
So, significant amount,not a huge amount.
And that's another part of thestory, I think, is it turns out
you can get pretty good videomodels these days for not Yeah.
(20:53):
A ton of money.
Not like, you know, hundredsof millions of dollars.
We'll get to that also in theopen source section when the
cost of compute collapses.
Right.
10 x every year, you know, a 70 thou,a $70 million raise is effectively
a 7,000 million dollars raise.
At least, you know, if you'recomparing CapEx to CapEx.
Yep.
Hmm.
Next up we have Snapchat andthey are introducing AI video
(21:16):
lenses that use their own.
Model that is built in house.
So if you are a Snapchat platinumsubscriber, I'm sure we have
many listeners who use Snapchat.
you can pay $60 per month to beable to use these basically filters.
Not quite filters, I guess.
It's kind of like, video editingwhere they have free AI video
(21:39):
lenses currently, raccoon Fox andSpring flowers, which basically takes
your video and adds in a raccoon oradds in a fox or adds in flowers.
And there's some sample videos.
You know, I dunno, they, they look fun.
This feature for so long, I can't,I, you know, if I have to use
another fucking image editingplatform that doesn't have.
(22:03):
An editable fox or raccoon feature.
I'm gonna lose it.
Andre.
You know, I didn't know that thisis such a highly desired feature.
I'm gonna be honest with you.
I didn't know.
Raccoons are such abig deal on Snapchat.
Apparently they are.
Oh.
But anyway, I think interestingto see Snapchat investing in, in,
in, in-house generative AI model.
(22:24):
And it's a real question mark as towhether this would be an incentive
to actually pay $16 per month.
But yeah, I, I dunno much aboutSnapchat, so I dunno if users are
big on video filters of this sort.
Yeah.
I've never felt moredisconnected from like.
the median consumer of theproduct, and I'm looking like
(22:47):
16 bucks a month for the, okay.
I mean, I can see other uses, butI, there, there's other stuff too,
obviously that will come with thisand I'm sure they'll roll it out.
It is interesting that theychose to go in-house with this.
Maybe not too surprising given thesheer volume of data that they have.
And also the fact that when you lookat Snapchat video not gonna lie,
it's been a while for me, but it doeshave like a certain aspect ratio.
(23:08):
It has a certain pe you know, peopletend to frame their shots a certain way.
There's a whole culturearound the use of the app.
And so you might expect that you know,having a fine tune model, but may maybe,
I guess even a pre-trained model likethis that's done all in-house, you
know, can actually make sense there.
So, they're presumably also trainingon, on other open source at a minimum
open source video data, I would assume.
(23:29):
But certainly when you have such a hugeamount of data in-house, it kind of
causes you to lean in that direction,especially if you've got the funds.
And one more story, and this oneis kind of one I have a soft spot,
not one that is covered in a lot ofmedia, but I think it's kind of neat.
The headline is Pseudo RightLaunches Muse AI model that can
(23:50):
generate narrative driven fiction.
So pseudo right is a platform with theintent to basically yeah, have an AI
assistant for writing generally fictionand, and potentially also blog posts.
And they've been aroundfor years and years.
It was one of the tools that I usedgoing back a few years ago when I
was playing around with these things.
(24:12):
So they were kind of early onthe LLM train and now they have
this news model that they say.
Is actually capable of producing abetter you know, literature so to
speak, can better assist you in writing.
And, and that's kind of oneof the things to highlight.
Pseudo write is meant to be as anassistant, you're type, and then
(24:33):
a suggestions ability to suggeststructure, characters and so on.
another kind of slightly interestingidea here where we know that on the
one hand things like G-S-G-B-T canwrite a whole entire short story that
is kind of logical and you can read.
On the other hand it's generally truethat if you just ask an LLM to write
(24:58):
you something, it's gonna be prettygeneric and pretty just LA to read.
So I could plausibly see that thereis space to, with some data, get to a
model that is much better by default atproducing good suggestions for writing
that aren't, let's say cliche or justyeah, un unusable if you're trying
(25:24):
to write something a little more outthere than what you see typically.
Yeah.
Well, and I don't think we have astory for this specifically 'cause
it's just sort of rumors and, and preannouncements, but opening eye right
sort of came out I think yesterday as oftime of recording to say, Hey, we have
this new model that we're working on.
It's really good at creative writing.
Sam a tweeted about it or Xed about it.
(25:46):
it's definitely something thatpeople have, thought would be sort
of struggling point for LLMs, right?
It's easy to, to train them,especially age agentic models.
But even just sort of generalpre-trained lms, it's easy to train 'em
to do coding and things that you canobjectively quantify and, and evaluate.
But, you know, the creativewriting stuff is harder, so.
You know, maybe we'll see moreof a push in this direction.
(26:06):
I'm sort of curious, you know, if youcompare the pseudo right model and
the upcoming OpenAI model in terms ofthe performance, but also in terms of
like, what does that training processlook like to get more creative outputs?
'cause it's not obvious to me atleast how you would, other than just
curating your data more carefully,which is sort of the obvious thing here.
Or maybe just putting more weighton it, you know, changing the, the,
(26:27):
the order in which you train andmaking sure it's at the very end of
your training that you're putting insort of the highest quality sources.
These are all kind of, you know,standard tricks that are used.
But yeah, I'm curious to see howthis differs both in performance
and, and in training procedure.
Yeah.
Unfortunately they didn't releasetoo many technical details as to
what is actually involved here.
I guess we, the cool.
(26:49):
Story or what would be neat if thiscame out as a result of sort of
the actual platform pser, right?
Having such a particular use casewhere you have actual people using
it and rejecting suggestions oraccepting suggestions rewriting
parts of the output you know,picking from various suggestions.
(27:11):
So that is like a gold mineof data for this particular
use case that nobody else has.
If that was part of how they didthis and they did also say that they
talked to s of users of a platformyou know, could be an interesting
example of a more niche use caseplatform that is then able to become
(27:33):
a leader for that application.
Yeah, opening for like kind ofprocess and a kind of pseudo process
reward or even RL type stuff you getinto just based on like, like you
said, the editing history would bean interesting thing to get into.
Not just the outputs, but yeah.
And onto applications and business.
We have a story about OpenAI.
(27:54):
They are in a $12 billion agreement,a five year agreement with cloud
service provider core reef.
So that is.
Partially investment.
OpenAI is getting 350 millionin equity from Core Weave.
And that will, you know, presumably playinto their need for infrastructure open.
(28:20):
AI's need for infrastructure.
Infrastructure.
Core Weave has an AI specificcloud service with 32 data centers
and over 250,000 Nvidia GPUs.
Microsoft is a big userof Core Reef actually.
And it seems that OpenAI now isalso planning to have additional
(28:41):
compute providers outside ofMicrosoft who is presumably still
their number one source of compute.
Yeah, this is a, a really interestingstory in a couple different ways.
I think we covered last week.
Core Weave is planningan IPO that's coming up.
One of the concerns there is that,yeah, Microsoft is the line share
of Core weaves revenue, right?
62%. Given that, you know, that's asource of pretty significant risk for
(29:03):
Core Weave, this deal with OpenAI is aprobably very refreshing injection of,
of funds and potential for partnership.
So kind of, diversifying a little bitthe portfolio of customers at scale.
Core Weave is, by theway, backed by Nvidia.
That's how they've been able toaccess GPUs, so many GPUs so soon.
They're now in the processof adding Blackwells already.
so that's a big dealfor compute capacity.
(29:25):
But the other dimension of thistoo, besides just the, the IPO
and, and how this sort of helpsCore Weave strategically is that
partnership between Microsoft,no AI that you referenced.
So there is a bit of, sort ofdeteriorating relationship there, right?
I mean, we talked about a few weeks backin the context of the, the big Stargate
builds that are happening right now.
Right?
Big partnership betweenOpenAI and not Microsoft.
(29:49):
It's supposed to be Microsoft,but instead Oracle and Cruso.
So, so Cruso being a big data centercompany, and then Oracle being
the sort of hydration partner toprovide a lot of the GPUs that's
great deal for those companies.
But it does mean that open AI iskind of breaking off this reliance
that it had with Microsoft.
The story there seems to be thatMicrosoft's sort of risk appetite
(30:11):
to keep on with the Stargate build.
Has been more limited than open ais.
I think that's somewhat overblown.
My understanding is that behind thescenes Microsoft is actually a major
funder of the Stargate Initiative.
It's just not a sort ofpublicly recognized thing.
but still this is OpenAI kind offinding even more diversity in
(30:32):
their supplier and their vendorportfolio, if you will, for compute.
And that gives 'em a bitmore leverage over Microsoft.
So you see each company kind oftrying to position itself, right?
Microsoft is going off and makingtheir own reasoning models, and
OpenAI is going off and findingtheir own compute partners.
And there's this very uneasyequilibrium between Microsoft and
open AI that kind of seems to befalling apart a little bit, kind
of, death by a thousand cut style.
(30:53):
Next story.
We are getting into chips being madein China or possibly China actually
as we'll get to the story is thatHuawei has now a new chip line,
ascend 910 Sea, which is seeminglyentering production and would be
(31:13):
the leading AI accelerator productthat is made by a Chinese company.
So this is a part of theirchip line via Ascend nine 10.
They are, there's various analysis,I guess we don't know exactly, but
from one person commenting LeonardHam, it seems that this will be
(31:38):
sort of along the lines of an H 100.
So not near NVIDIA'sflagship chips now of B 200.
It's probably.
You know, a fraction, one third ofa computational performance much
less computational memory and so on.
But still, this is a domesticallymade product for AI acceleration
(32:02):
that is moving closer to Nvidia.
At least this is comparableto an A 100 GPU for instance.
Yeah, this is a, it's a really, by theway, Leonard Heim, great guy to follow.
If you're interested in anything,kind of China chip related a
lot of export control stuff.
He has this great tweetstorm that'sworth checking out too on this.
(32:23):
But yeah, it, it is a big story, right?
So one of the, the big dimensionsof this is that China's been
able to get its hands on the.
Ascend nine 10 Bs.
So just for context, the nine 10 Cis the kind of next generation Huawei
chip that's just entering production.
That's the one that's like, dependingon how you calculated about 80% of
(32:45):
the performance of an Nvidia H 100.
So, you know, the H 100 I, I think,sort of came off or started production,
you know, like three-ish years ago.
So 80% of a chip that came outthree years ago, presumably.
I mean there, there's alittle fog of war here.
A lot of these chips seem to havebeen sourced illicitly from TSMC.
(33:06):
I've seen some differing accountsthere, both from the CSIS post.
There's a was it tech Radararticle or something like people
are disagreeing over this alittle bit, but it definitely
seems like there are a lot of.
Illicitly acquired chips and includingpotentially these nine 10 Bs that
were produced presumably by TSMC.
So two of these nine 10 B dyes.
(33:26):
These are the logic dyes.
If you go back to our hardwareepisode, logic dyes actually
run the, the computations.
They're distinct from thememory, the highend with
memory that sits on the chip.
But these 2 9, 10 B dyes need to bepackaged together to make a nine 10 C.
And so apparently they seem tohave gotten their hands on about
2 million of these nine, 10 Bsenough to make around a million.
(33:48):
Nine, 10 Cs.
And essentially around a millionH 100 equivalents this year
seems to be well within reach.
There are all kinds of caveatsaround dotted eye bandwidth and,
and how the packaging processthat Huawei actually has available
to it sucks compared to T SMCs.
But you know, this is a lot ofstockpiling like the China has done
(34:08):
and Huawei has done an amazing jobof stockpiling a whole bunch of
chips ahead of export controls.
This doesn't just include logiclogic dyes like the nine tens,
but also high bandwidth memory.
So HBM two E that they sourcedfrom Samsung, apparently they
have enough for about 1.4million nine, 10 C accelerators.
So even though these chipsare now controlled, they
(34:30):
were stockpiled previously.
A similar thing, by the way, happenedwith the actual Nvidia chips.
We talked about that.
Back in the day when Nvidia.
Essentially had new export controlsthat they knew the US government
was gonna bring in to preventthem, for example, from shipping.
at the time I guess hopperchips and, ampire chips, like
a one hundreds for instance.
And what they do is they, they knowthat the export controls gonna come
(34:52):
in and so they try to throw as many ofthese chips into the Chinese market.
In fact preferring Chinese customersover American ones for the purpose
because they know they'll stillbe able to sell to America once
the export controls come in.
And so this is sort of Huaweiand Nvidia in a way, like
through incentives, essentially.
Being pushed into partnering togetherto jack up Huawei's supply of everything
(35:17):
from supply of, of ready to go chips.
So the stockpiling strategy issomething that happens all the time.
China's really good at it.
It compliments the illicitacquisition after the export
controls come in as well.
And the net result is they endup with a lot of these chips.
You shouldn't think of exportcontrols as being meant to get a
perfect block on Huawei or SMICor any of these companies being
(35:38):
able to acquire technology.
It's, it's a matter of slowing themdown and it's something that compounds
over time to create more of a gap.
But anyway, sort of really interestingthat, you know, when you get to, a
million H 100 equivalents in the, theChinese market just from Huawei one
thing you gotta think about is theCCP is really good at centralizing.
Their compute.
So if they wanna do a big, big training,run a national scale project, they can
(36:02):
do that much more easily than the us.
So even though we're producing way morechips, ours are spread out across like
a whole bunch of different hyperscalersand smaller companies in principle,
China can just step in and say,Hey, you know, we are commandeering
these chips, throwing them together,doing a, a really big training run.
So that's an important distinction.
And why those 1 million h 100equivalents that we could see again
in just 2025 from just Huawei is,is pretty remarkable and important.
(36:26):
And we have a follow up tothat story as you mentioned.
The other article kind of along withthis that came out is that Huawei
did reportedly acquire 2 millionAscend nine 10 AI ships from TSMC
last year through Shell companies.
So a little bit more on that.
They acquired seemingly more than 2million Ascend nine, 10 B Logic dies.
(36:51):
Nine 10 C is.
Kind of a combination of 2, 9,10 Bs from what I could tell.
And they did go kind ofaround, they didn't directly
get the product from TSMC.
TSMC Sly caught this happeningand halted the shipments after
(37:12):
an internal investigation.
And so this kind of was unintendedpartnership, you could say.
And from what I also saw this, itseems like a seven nanometer process.
So it's not the most advancedtechnology TSMC has, which
is probably not surprising.
We, you know, other customersare using up all that capacity.
(37:33):
But it does demonstrate that as weknow, export controls are in place
and we've seen a lot of leakagethrough the export controls.
And this is seeminglya pretty big example.
Yeah, absolutely.
And, and the fact, as you, as youalluded right, the seven nanometer
process that they're using,that's something that variants of
that with debatable yields and,and other properties are things
(37:55):
that SMIC can do domestically.
So, SMIC being China's, TSMC whichalso was founded based on very clearly
TSMC derived ip, that was effectivelystolen a pretty, a pretty cool story.
I will say SMIC.
There's some interesting people fromTSMC who it's a real story, let's say.
(38:15):
Yeah, yeah, yeah.
It's, it's sort of a classic Chinesecorporate espionage story, right?
Like poached some of the most seniorfigures at the company have them come
over and, and I think, I mean, theystood up SMIC and got it to reasonable
production on close to leading node inlike suspiciously close to record time.
I think it might have been 12 monthsor something outrageous like that.
So like an unheard of speed of,of takeoff, which is part of what
(38:39):
triggered the whole suspicionfrom TSMC that this is going on
in lawsuits galore and all that.
So yeah.
Yeah, I mean it's,it's essentially like.
China has a pretty solid domesticcapacity to produce chips.
Huawei is roughly their Nvidia,SMIC is roughly their TSMC.
But they're both under the same roof,if you will, under the CCP umbrella,
so you can see it potentially moreintegrated partnership there as they
(39:02):
form one kind of Huawei SMIC complex,a bit tighter than Nvidia TSMC.
But one important difference is SMICdoes not have access to EUV machines,
and so you're pretty much bottleneckedat the seven nanometer node.
Maybe they can push to five nanometerswith multi patterning, but it seems,
(39:24):
you know, you're like, you're gonnarun outta steam pretty quickly.
And the other interesting thing toois if you look at TSMC we've talked
about this a lot, but their leadingnode is all, you know, iPhones.
So that means that it's the, the nodeup right now, the five nanometer, four
nanometer node that goes off to GPUs.
Well, you know, you don't havethat with S-M-I-C-S-M-I-C is having
to balance their leading node atseven nanometers between, you know,
(39:46):
Huawei smartphones and well othersmartphones and the GPU supply.
So there's kind of a lot ofinteresting stuff going on here.
Yields seem to kind of be shit at SMIC.
Last I checked, I is like around 75%or so, which is not economically great.
But when you have you know, the Chinesegovernment subsidizing you to blazes,
you don't necessarily need the sameyields that you might to ha to be
(40:08):
economically viable if you are TSMC.
So anyway yeah, interestingsituation and definitely they've
gotten their hands on an awfullot of these illicit chips.
We've covered a lot of stories likethis in the past, and, and this
honestly hasn't come as too much ofa surprise to a lot of the people I
talk to on the export control side.
But anyway, it's all,all kind of locked in.
And now let's take a shortbreak from talking about chips.
(40:30):
We'll get right back to it.
But quick story about investing.
We now know that Google has asubstantial investment in anthropic.
It seems that Google owns 14%of philanthropic, although
that doesn't come with anysort of control mechanisms.
They don't have voting rights orboard seats or board observer rights.
(40:52):
Something that Microsoft has at openai, for example in total Google has
invested over $3 billion in anthropic.
So kind of an interesting note, maybeit seemed like a tropic is aligned
with Amazon as a primary, let'ssay Ally sort of an to open AI and,
and Microsoft and Google investingthis much in sort of a rival is, at
(41:16):
least to me, a bit of a surprise.
Yeah, I mean they're, you know,they're gonna be hedging their bets.
I guess if you're Google, one of thethings you think about, you look at
search GPT, you look at perplexity,obviously the search space is changing,
and sooner or later, someone's gonnado something that goes after your
search market share in a big way.
And because Google owns so muchof the search market, like well
(41:39):
over 90% that, and because thesearch market represents such a big
fraction of Google's overall revenue.
They kind of have no choice but to makesure that they own a piece of the search
pie wherever it goes in the future.
But still, like, yeah, I mean,it's an interesting play.
One of the consequences of theGoogle and Amazon investments in
anthropic is anthropics relianceincreasingly on TPUs and train
(42:03):
chips, training being what Amazonhas and TPUs being what Google has.
and that's, we've covered stories, youknow, that involve that and, and some
of the, the challenges associated withtraining on that kind of infrastructure.
yeah, I mean, it's, it is interesting.
It is also something that we're learningbecause of the antitrust case that's
been brought on Google in this case.
(42:23):
Yeah.
On Google in this case, looking intobasically, you know, are, are you
controlling too much of the market?
one thing that we do know is thatGoogle is now being required to
notify antitrust enforcers beforeinvesting in any more AI companies.
This is based on a, a revisedjustice department proposal.
That was filed Friday.
So if that holds this is a, aninteresting requirement to kind of
(42:43):
pre-register what they're gonna do.
There was an initial proposal by theway, that would have required Google
to fully unwind its investments incompanies like Anthropic, but that's
longer apparently on the table.
Would've been a big orvery big deal, right?
So as you can imagine Googlepushing back really hard.
There's also a whole bunch of stuffthat the DOJ has proposed, including
a forced sale of the Chrome webbrowser that is still on the table.
(43:08):
So the claim here is that the governmentis quote concerned about Google's
potential to use its sizable capitalto exercise influence in AI companies.
So, yeah, no surprise there.
but still forcing a little bit moredisclosure than normally would have.
And kind of interesting to notethe, the percentage ownerships
and, and the board structure.
As promised, going right back to chips.
(43:29):
And now to meta, we got the story thatthey reportedly testing an in-house
chip designed for AI training.
This is the one that was made incollaboration with TSMC and is in the
initial testing, small deployment phase.
Meta has used custom chips forinference, but not for training
(43:54):
and we have reported on them doingthis kind of development and,
and trying to essentially havesomething to compete with TPUs.
So, seems interesting, like, I don'tknow what sort of timeline is safe
to project for this kind of project.
I would imagine it would take yearsof engineering, so I don't know
(44:16):
if them doing, testing in-houseis indicative of very rapid
progress or wherever they're at.
Y Yeah, I mean, so one, oneinteresting thing is I haven't seen
any word of collaboration betweenmeta and a separate entity that
would help them with chip design.
So they are like, you know, openingAI partnered famously with, with
(44:37):
Broadcom, so did Google right tomake the TPU we're not seeing any
indication of that here with Meta.
So it does seem like they arefully going in-house with this.
They do seem to think of theirinference ships as having
been a, a big success case.
Those are only being used forrecommender systems right now.
So obviously that's a hugepart of Meta's business, right?
Having to serve up ads and content.
(44:58):
And so the recommender systemsat scale have have specific
requirements that Meta is tracking.
Where we are now with this, to speakto your, your question about timelines.
So, right now meta is finished what'scalled their first tape out of the chip.
This is basically a, a thresholdwhere you send an initial design
through a chip factory and,this is super costly, right?
(45:19):
Tens of millions of dollars, well,not super costly for meta, but anyway,
tens of millions of dollars and it cantake three to six months to complete.
No guarantee that things will actuallysucceed at the end of the day.
And if there is a failure.
Which by the way, has happenedbefore for meta at this very stage.
They've gotten to the stage before withprevious chips meant for the same thing.
If that happens, then they haveto go back, diagnose the problem,
(45:42):
repeat the tape out step andsets them back, presumably that
additional three to six months.
And so, the in-house custom inferencechip that they built previously
had flopped just before the sortof small scale test deployment.
That's the stage we're at right now.
Meta wants to do this little,little deployment, kind of see
how things work in practice.
and the interesting thing is likeafter they did that, after the
(46:02):
first time that they had this,this, this kind of in-house design
blow up in their face, they.
Had to figure out analternative strategy, right?
Like we need to now have the compute todo what we're gonna do with these chips.
And so they were forced to place amulti-billion dollar GPE order with
Nvidia which then, you know, givesthem a later start on that as well.
So it's this interesting balance betweenhow do we hedge our downside and make
(46:24):
sure that we have standing orderswith Nvidia, say, but also we want to
have independence from Nvidia, so wekind of need to double dip and have an
investment in, in-house chip design.
So, yeah, I mean we're still seeingas well in the, in the case of
this particular chip a recommendersystem focus, though, the plan
is eventually they do wanna startusing their own chips for training.
(46:45):
And that would be,they think around 2026.
And one more story abouthardware this time Data centers.
It's about XAI and they have boughta 1 million square feet site for
a second Memphis data center.
So this is an $80 million acquisition.
they are seemingly aiming forthis new data center to be able
(47:10):
to support up to 30 350,000 GPUs.
Up from the a hundredthousand, is it 200,000?
I've lost track in theirexisting Memphis facility.
Yeah.
the initial rollout was a hundredthousand, and now they have Yeah.
A plan exactly.
to double it.
So you're a hundred or200,000 is exactly right.
(47:33):
It's, it depends on when and, and yeah.
The interesting thing with this newfacility that they're standing up,
it's next to the south Haven CombinedCycle natural gas power plant.
And that generates about780 megawatts of power.
So, then there's thequestion of, okay, well.
Sure, but how much of thatpower is already in use?
For context, when you think about onemegawatt of power, order of magnitude
(47:57):
gets you to about a thousand GPUs.
So one GPU is around, it's a little,it's a little over a kilowatt
that you're gonna consume, whichis about the power consumption
of like one home, right?
An average home in America.
So if you look at 780 megawattsof power, if you had that all
available, which of course youdo not because there's, you know,
industry and there, there arehouses using this and all that.
(48:19):
But if, if you just pour that Into GPUs.
You could already kind of get tothe, several hundred thousand GPUs
that are powered by that, which isexactly where you're getting that, you
know, 350,000 presumably blackwellsthat they'd be looking at there.
The other thing is that we know is thatapparently Memphis Light Gas and water,
which is a local energy company, saidthat XAI has requested a system impact
(48:42):
study for up to 260 megawatts of power.
So presumably that means thatthey only expect, at least in the
near term, to use 260 megawatts.
If that's the case, then you'relooking more at like maybe 200 thousand
GPUs, depending on Yeah, power usage,efficiency, and a whole bunch of other
factors that go into your, data center.
(49:02):
But generally speaking,this is another big build.
The data center apparently willbe home to what they claim will be
the world's largest deployment ofTesla mega pack batteries as well.
Which is something that,so when your power goes in.
Sometimes you get power spikes.
This is actually a, a, a big problemwith Nvidia GPUs that they're
trying to sort out right now.
But essentially when you like start thetraining process, you get like a massive
(49:24):
burst of, of power consumption whichcan be on the order like 30% or so.
That's a real problem becauseyour power infrastructure may not
actually be able to handle that.
For that reason.
You often want like batteries thatcan be connected to your data center
that can deal with that spike in powerdemand, and that's where the Tesla
mega packs become important, right?
It's a source of kind of capacitance.
(49:46):
It's also there for when youknow, the grid load is just too
high or you just need to injectmore power for whatever reason.
So yeah, really big buildXI continuing to be at the
forefront of this stuff, right?
Like we haven't heard the big Stargateannouncement, but when you look at the
numbers, they're like, they're actuallyup there with open AI right now.
And onto our last story.
It's not a business section ifyou don't have a hundred million
(50:09):
dollars startup as ever in ai.
So we've got a new one, and yet again,it is founded by DeepMind X researchers.
So they are launching Reflection ai.
They are two former Google DMI researchers and they have 130
million in early stage funding.
(50:32):
Not too many.
Details that I was able to find here.
It's the co-founders are MichelleLaskin and Jonas an, I don't know,
that said they worked on Geminitraining systems and now they're aiming
to develop super intelligence muchlike SSI and others in that space.
(50:55):
Seemingly they're starting by buildingan autonomous programming tool.
It's $130 million in funding.
Not a ton of money.
It sort of makes me think of thinkingmachines, you know, that Mirati
startup, you're seeing a lot of superintelligence companies that aren't
raising the giant amounts that.
The scaling laws suggest at leastnominally you would need to compete.
(51:16):
But you know, these aresmart people, so we'll see.
By the way, the cap tableis pretty wild, right?
So funding round for theirseed was $25 million.
This is now all by the way,being announced at the same time.
So they're coming out and saying, Hey.
it's not just one fundraiseat one 30 million.
We actually had raiseda seed at 25 million.
We raised a 105 millionseries A after that.
(51:36):
But the seed round was ledby Sequoia so like basically
the best VC on planet Earth.
Then the Series A was led byLightspeed Venture Partners.
So really, really solid vc.
And then there are other investors, ReedHoffman, right, LinkedIn co-founder.
Alex Wang from scale ai.
That's really interesting.
'cause he's a, he's been historically akind of AI safety focused guy, including
(51:57):
sort of loss of control alignment stuff.
And then there's SV Angel, andimportantly the VC arm of Nvidia.
So that's, you know, a really,really big partner to have on hand.
We saw how it moved the needlefor, for core Weave in terms
of getting allocation for GPUs.
So that's, that's kind of interesting.
Latest valuation, half abillion dollars not half bad.
I would take that.
(52:18):
anyway, we'll see.
They have paying customers by theway, so that's at least different
from LEAs safe super intelligence.
So there, there's that.
But as you said, it's, it's reallyunclear what exactly they're doing,
just working with apparently.
Fields that have large codingteams, such as in the financial
services and technology sector.
So very specific, right?
Yeah.
(52:38):
In their blog post theymentioned, for instance, imagine
autonomous coding agents workingtirelessly in the background and
workloads will slow steam down.
So even though they say theirworking on super intelligence, it
does seem like in practice they'reproducing a product that is more of
a tool to be used and similar, forinstance, to cloud code from Atropic
(53:01):
that has just released coding seemslike increasingly the new frontier
and letting an agent do its thing.
So, could even, yeah, could see thiscoming out to a product relatively soon.
And onto projects and open source.
We have a couple exciting new modelsstarting with Gemma free from Google.
So Gemma is kind of a littlesibling of Gemini, and this is a
(53:27):
Multimodel addition to that family.
They have this at scales fromone to 27 billion parameters.
The new thing here is vision,vision, understanding capabilities.
They also cover more languages 35languages and a longer context.
128,000. Tokens and some kind ofarchitectural changes to be able to
(53:53):
leverage that context effectively.
They mentioned that this was done viadistillation and are, you know, as you
might expect much better than Gemma tofor both pre-training and fine tuning.
So Gemma, you know, another one ofthese small to medium scale models,
(54:16):
increasingly you are seeing these 20,20, 27, 12 billion parameter models and
you see them being pretty performance.
So Gemma 3 27 B they say has an ELOscore on chatbot arena that is higher
than deep Seeq V three, for instance.
And also higher than LAMA three,400 five B. actually they have a, a
(54:41):
nice handy illustration too on thatchatbot arena ELO score chart where
they show the, the estimated GPUs thatyou need to run each of these models.
And I think that's actually reallyI. A really important dimension that,
that Google is articulating betterthan I've seen it articulated before.
The referring to this as the world'sbest single accelerator model.
(55:03):
In other words, single GPU model.
And when we talk about I think we'dpreviously been referring to it
as the koff scale of model size.
Basically you'd have these like,how small does a large language
model have to be, or beforeit's not called large anymore.
Anyway there's been this debate,right, that a model is not truly
open source if it requires so muchhardware to run that you might as well
(55:27):
be a big company to run it, right?
So if you open source a as youknow, Deepsea did with R one,
if you open source a model thatrequires like dozens of GPUs to run
like yeah, you've open sourced it.
The code's out there, that's great, but.
This model has, you know, 671billion parameters, you just
can't fit that on one GPU.
And so is it reallythere for the people?
(55:48):
You know, can the peoplereally use this meaningfully?
And the answer is kind of, I mean,you can have companies that run
their own instance of it, and thatis a kind of open sourcing, but
everything's along a spectrum.
The, the big flashy point hereis one GPU to rule them all one
GPU to run this entire model.
I will say there is kind of animportant distinction when we talk
about what is impressive and what'snot impressive in this space.
(56:10):
So you might look at, you know, deepCCAR one and go, oh my god, that's
almost 700 billion parameters.
Compare that to Gemma three27 billion parameters.
So a fraction of that this must,mean that Google just like knows
what they're doing a lot better.
The answer is that these aredifferent use cases fundamentally.
You know, so deep cq R onewhen you actually like.
(56:32):
Push a token throughit and do inference.
You're only using about 35 billionodd parameters in the model.
So not all parameters get activated.
It is a mixture of experts.
Model.
Moes tend to have far more parametersrelative to their performance.
That's just how they're set up.
So you go with an MOE typically whenyou care more about the performance
than Then the infrastructurecosts of hosting the model.
(56:55):
Whereas you might go with a mo asmaller, kinda like monolithic model
if you wanna just compress it, have itrunning like on the edge or something.
with sort of as highperformances you can.
So that's part of the trade off here.
These are just differentthings to care about.
But certainly if you care about,like, I wanna be able to run this
locally on, you know, my own hardware.
Yeah.
Like this is a, a big step forward.
(57:16):
And again, I like this, this line, theworld's best single accelerator model.
It's better than some of the previousstatements we've heard of, like the
world's best 7 billion parameter modelor, you know, 22 billion per 'cause.
Those just seem too specific.
This ties it to somethingthat we actually care about.
And that's a moving target too, right?
Accelerators are improvingall the time, but still feels
more concrete and, and useful.
(57:37):
Right.
And as with other releases ofthis flavor you know, you can
get the weights, the code to runit on hugging face for instance.
And you have a proprietarylicense that allows quote,
responsible commercial use.
So they have a bunch of thingsYou're not allowed to, you know,
(57:58):
let's say broadly do bad things.
A lot of restrictions unlikesomething like rock rock two.
So, yeah, there's so many of thesegood small, smaller models these days,
and I think that really showcasesthe power of distillation when you
train a giant model like Gemini two.
Next story.
We have a, a model from sesame, andwe just covered this pretty recently.
(58:24):
Their virtual assistant Maya, this kindof pretty impressive conversational
AI that you could talk to and,and have a sort of naturalistic
interaction that we had demoedfrom, let's say OpenAI for example.
So we have now released the base modelCSM one B, that powers that model of,
(58:49):
sorry, that personal assistant Maya.
So 1 billion parameters in sizeavailable under the Apache 2.0
license, meaning that you can useit for commercial applications.
You have.
Pretty relatively few restrictions here.
And they have of course a finetuned version of CSM one B for Maya.
So this isn't the exact same modelthey use for that demo that they
(59:14):
launched a couple weeks ago, butyou can use this as a very powerful
startup model to do, for instance,voice cloning with relatively little
data and it's capable of producingreally pretty naturalistic speech.
So this is an area where youdon't have too many models capable
to produce really good outputs,idea generation in general.
(59:34):
There's less open source stuffthere than LLMs, and this is
certainly like a pretty excitingnew model for that application.
It's also you know, not aa monolithic model, but it's
a model that comes in parts.
So you have a, a base llama model, whichis kind of the backbone, and then it's
paired with a, a decoder component.
that's what they finetune to build Maya anyway.
(59:56):
So, is kind of interesting.
They talk about the data as being.
Not revealed by sesame, so we don'tactually know what went into it.
They say it's capable of producing avariety of voices, but has not been
fine tuned on any specific voice.
And they say also that the model's beensome capacity for non-English languages
due to data contamination in thetraining data, but likely won't do well.
(01:00:18):
They are also not includingany real safeguards.
They have an honor system and theyjust urge developers and users
not to use the model to mimic aperson's voice without consent.
So cool.
that's good shit.
the, um, journalist I guess who wrotethis said he tried the, the demo.
And apparently cloning your voice onthe system takes less than a minute.
(01:00:39):
And from there, it was easy togenerate speech to my heart's
desire, including on controversialtopics like the election.
So that's fun.
Sesame by the way, is a well-funded orat least, I shouldn't say well-funded.
It's a, well, their, their markettheir cap table looks really good.
So they've got AndreessenHorowitz, spark Capital.
So really like top line VCs.
And yeah, interesting as you say.
(01:01:00):
I mean, it is a differentiated space.
We'll see how long itremains differentiated.
If we end up with like, theworld of multimodal models
that eat everything, then maybethey get gobbled up by OpenAI.
But certainly for now, haven'tseen that many of these models.
And just a couple more stories.
Jumping back to the smaller largelanguage model side, we have a new
(01:01:21):
model from Recka ai Recka Flashthree, which they say is a general
purpose reasoning model trainedfrom scratch using a combination
of public and synthetic data sets.
And in their comparison,it's able to go head to head
with O one Mini and QWQ 42 B.
(01:01:44):
And this is a 21billion parameter model.
You know, overall not.
Super strong.
It has a relatively shortcontext length of 32,000.
It has also a budget forcingmechanism which allows you to
limit the model's thinking.
And it, it's similar to Gemma threeis possible to run on a single device.
(01:02:09):
So not, you know, super lead ofa pack in the open source front.
But another useful model, smallermodel for other people to build upon.
Yeah, as ever with these models,especially for the smaller developers,
you can't just can't afford tokeep up with frontier capabilities.
it's all about what yourcomparison points are, right?
Who you, who you chooseas your opponent.
(01:02:30):
And so in, in all their kind ofheadline figures they are comparing
mostly to, so like Quinn withquestions three, 2 billion, right?
So making the argument that, hey,almost all of these benchmarks, it's
actually behind qu with questions 32 B.
But it is a smaller model.
So I guess the case is like, hey, we'realmost as good as a 32 billion parameter
model with a 21 billion parameter model.
(01:02:52):
Not obvious to me how much thatbuys in terms of impressiveness.
One thing I will note, like theygenerally, I mean, I would call
this model on par with oh one mini,which dropped, I mean like, when
was that, you know, late last year.
So I. When you look at like howfast open source has come, it's
like four or five months behind.
Granted, Owen Mini is somethingthat OpenAI had been sitting
(01:03:15):
on for a while, right?
And it's also, it was theirmini, it didn't represent the
kind of true cutting edge of, ofwhat they could do at the time.
But still, you know,like open source being.
Six months, seven monthsbehind something like that.
That is a closing of the gap.
And we've seen that trendhold for some time now.
So not just Chinese open sourcemodels, but now sort of like western
(01:03:36):
open source models in all forms.
So there you go.
The, the rising tide ofreasoning models, right?
And all to a point reasoning model.
They do have some indicators of whatyou would hope reasoning models that
as you use more test time computes, youget better accuracy on tough benchmarks.
Like Amy, they have that here.
(01:03:58):
If you go up to, you know, 20,000tokens, you're able to do substantially
better when with fewer outputs.
So I guess it's, it's significantin the sense that also there's
not too many reasoning models.
Of course now we have R one,which is pretty big one certainly,
but with, as a smaller reasoningmodel is so significant.
(01:04:19):
And one more story on the open front.
We actually have a paper we'renot gonna be doing a super deep
dive on, but still worth noting.
The report is open source A twotraining, a commercial level video
generation model in $200,000.
So this report shows how theytrained a pretty performance model,
(01:04:41):
not up to the level of soa, butrelatively close and, and better than.
Other models like Cog video and Ionvideo and there's essentially a whole
large bag of tricks they go through.
They have, you know, data creation,training strategies, AI infrastructure,
(01:05:01):
lots and lots of details that allowyou to train this model for, you know,
what is relatively cheap $200,000.
So, I think interesting or excitingalways to see these in-depth and
go reports that really showcasethe nitty gritty of how to be
able to train this kind of model.
(01:05:24):
Yeah.
And, and it really is, it isalways like nitty gritty when
you look under the hood, right?
The, the age of just coming up with anew architecture and oh, it's beautiful.
Arguably never really existed, but, andnow it really, really doesn't exist.
Every time we do get an open sourcerelease, we get to look under the hood.
It's all these, you know, like optimizertweaks and, and, you know, batching
(01:05:46):
and, and like finding ways to getto get your accelerators to work.
Efficiently and overlapcommunication and computation.
Like all these things, theengineering is the outcome
over, like, over and over again.
So, yeah, I mean it's another exampleof that now with with SOA type models.
And, and you can see likefor on a 200 K budget, right?
That's training of, if you look at thebenchmarks, yeah, it's pretty solid
(01:06:11):
pretty solid set of win rates relativeto, to other models that are comparable.
And yeah, so I mean, what this meansfor people's ability to not just
train, but also fine tune their ownvideo models and, and eventually
inference super cheaply, likethat's, pretty, uh, significant.
And onto research and advancements.
We begin with anannouncement from DeepMind.
(01:06:32):
They are announcing whatthey're calling Gemini Robotics.
Models that are optimized for roboticcontrol in a general purpose way.
So these are, I guess, built ontop of Gemini two and are meant to
really focus on the reasoning sideof things like physics prediction
(01:06:56):
3D space, all that sort of stuff.
So they published a verymeaty technical report.
Lots of details, a lot of focuson the perception side, on the
various kind of general capabilitiesyou're seeing that are very useful
in the context of embodiment.
With respect to the general purposenature of the robotic control, they
(01:07:18):
collected a bunch of data over thelast 12 months with these Aloha two
robots that are like two arms, andthey then are able to give it an image,
give it an instruction with text, andthe model plans and outputs code, and
the code is then executed via, I'mnot even sure what it is, so it's not
(01:07:43):
quite the same as what I think figureone x one of those two announced
with a dedicated model for control.
This is kind of a planningmodel, then an execution model.
That's not as far as I can tell, notreal time, but either way really,
you know, deep investment in roboticspace they compare to things there.
(01:08:07):
We implementation of PI zero where PIZero was also meant to be this general
purpose robotics foundation model.
And they show that this is capable ofdoing a whole bunch of manipulation
tasks with, without fine tuningnecessarily, where you can just
give it objects and instructionsand it is able to pull things off.
(01:08:30):
Yeah.
And Gemini Robotics itself youalluded to it, it's like it's two
component architecture, so theyhave the VLA backbone which is.
A distilled version ofGemini Robotics, er, right?
So they have this model, essentiallyGemini Robotics, er, its job is to
kind of understand the world andreason about, you know, kind of spatial
(01:08:51):
information and things like that.
But then the action modelitself is tacked on sort of like
local action decoder that runs.
the VLA backbone, so, so the Geminirobotics, er, the distilled version
of that, that runs on the cloud,and then the thing that's local to
the robot is the action decoder.
So it, it's on the onboardcomputer smaller model and,
and therefore low latency.
(01:09:12):
So it is optimizedfor real-time control.
They have apparently latency query toresponse latency that has been reduced
from seconds for previous iterationsto under 160 milliseconds for the
backbone end to end latency if youinclude the action decoder is closer to
a quarter of a second, 250 milliseconds.
So you know, that's pretty quick, like,getting into the, the domain where you
(01:09:34):
can see interactions with these systemsand, and importantly, you know, 250
milliseconds means you're able then torespond to what you see and touch right
in a relatively reasonable time period.
I actually don't remember how longit takes humans to kind of, you know,
respond to stimuli, the environment.
I, I wouldn't be surprisedif it was in that ballpark.
But this is, by the way, asupervised, fine tuned model.
(01:09:56):
So you're looking at the collection of,of huge amount of data, thousands of
hours of, of expert demonstrations thatthey say they collected over 12 months.
A whole bunch of diverse tasksand, and non-action data as well.
So that itself is interesting.
I mean, not surprising justbecause you don't usually expect
RL to be used in this context.
'cause RL is just.
(01:10:16):
You know, super sample inefficient andalso for safety reasons unless you've
got a really, really high fidelitysim to real transfer, you're gonna be
doing some of this stuff potentiallyin the real world if you are, there
are safety issues with just using rl.
But in any case, yeah, they saythey collected, you know, a bunch
of teleoperated robot actioninformation on their Aloha two robots.
(01:10:37):
And that's what was usedfor supervised fine tuning.
And the result is, Imean, pretty impressive.
kind of feels like it follows in thattradition of gato, the sort of truly
multimodal models that can do anythingincluding control robots, including
understand video, like just trying totry to stuff it all in there to get
to true a GI it is a very kind ofGoogle Deep Mind paper in that sense.
(01:11:00):
Right.
We have some fun examplesof things you could do.
It can make an origami fox packa lunchbox, and to give you an
example an idea of aloha to RV's.
Little two gripper hands with some,you know, relatively less expensive
hardware as far as you can tell,but still pretty capable in terms
(01:11:20):
of what they're able to pull off.
And, yeah, I guess I should correctmyself a little bit this not, this isn't
like an end-to-end image to controlmodel as far as I can tell, because
of the intermediate code output, butfor the execution of individual code
commands, I, I guess we are usingthat Gemini robotics vision language.
Action model to execute things likegrasping, for instance grasping
(01:11:44):
objects or moving to a particular pose,which the planning stage is output.
So, don't believe this is released yet.
I need to double check,but I have a way, you know,
we saw this surprise zero.
This is pretty much comparableto that in claiming to be
a robot foundation model.
And I think also figure N one X areworking on similar things with a
(01:12:09):
general purpose capable robotics model.
So it really does seem like there's apretty significant amount of progress
in the space, and it is prettyconceivable that we can have pretty
broadly capable embodied AI agentsin the coming years, which has pretty
significant implications for, let'ssay, economic impacts and so on.
(01:12:30):
For sure.
I mean there, I, I haven't read it yet.
There's a, post where they'retalking about, implications for
China of a lot of the breakthroughsin robotics and, I mean, this sort
of makes me think of that, right?
To the extent that we're buildingreally, really good software, you
know, that, that controlling robotsis a software problem pretty quickly.
Then if China replicates that,which they will your ability to just
manufacture robots at scale is one key,key determinant of, national power.
(01:12:57):
And China has this just ismopping the floor with us on that.
So, that'll be an interestingthread to follow for sure.
Yeah.
And, and there's even more of a, kind ofa, a lot bundled with this announcement.
They also, as a fun side detail,released the MOV benchmark.
So there's another paper calledthe Generating Robot Constitutions
(01:13:19):
and Benchmarks for Semantic Safety.
They say that this MOV benchmark isa comprehensive collection of, for
data sets for evaluating and improvingsemantic safety, a foundation and
models serving as robot brains.
So that's pretty fun, I guess.
They are focusing on the safety sideto not get Skynet to happen, I guess.
(01:13:44):
They also have another dataset they release as part
of this I think it's ER qa.
I. It's a, it's a visualreasoning dataset with a
focus on embodied reasoning.
And so a lot of you know, variationand means of benchmarking for
embodied applications in particular.
(01:14:04):
Next we move away from robotics backto the favorite topic of recent months.
Test time, compute andreasoning with the paper.
Optimizing test time compute viameta reinforcement, fine tuning.
So the focus here is not on enablingtest time compute and reasoning.
(01:14:25):
It's on making test timecompute usage efficient.
So, that has been kindof a known problem.
And we covered a fake, a paper for this,this idea of overthinking where you use
more test time, compute when necessary.
They have an interesting figure in thepaper actually, that it seems that in
many cases, just doing a majority vote.
(01:14:48):
So making many outputs.
And then just from among the manyoutputs choosing what most of the
models, output doesn't actuallyoutperform test time compute scaling.
So in fact, test time compute is maybenot as efficient as just doing inference
with shorter compute a bunch of times.
(01:15:09):
So in any case, they are introducinga method to optimize test time compute
efficiency via meta reinforcement.
Fine tuning.
So reinforcement learning isa learning to optimize your
reward for a given task.
Meta reinforcement learning orfine tuning is being able to
(01:15:31):
adjust quickly to a given task.
So it is kind of a metal layer,right, where you're learning to
achieve good rewards efficientlywith not much feedback.
And I do that by providing a densereward during the reasoning process.
So.
Similar as opposed to processreasoning models as well.
(01:15:54):
They chunk the test time, computereasoning steps into episodes.
They provide reward for each episodethat indicates how much progress it
represents towards solving of a problem.
And then they are able totrain the model to, pretty much
(01:16:14):
minimize regret, optimize, beingable to make rapid progress.
And they say that they're ableto show two to three times
improvement in performance formath reasoning compared to just
outcome based reward reinforcementlearning, where you are getting
just zero one rewards at the end.
So that's a pretty significant,you know, one to also at a
(01:16:38):
1.5 gain in token efficiency.
Yeah, it, it sort of reflects thischallenge of, this is sort of classic
exploration, exploitation, trade offthat you see in reinforcement learning.
I mean, I would argue basicallyanywhere, but reinforcement learning
is the place where it's most obviousin its mathematical formulation.
Essentially at any given step.
If you are a language model and you'retrying to reason through some reasoning
(01:17:00):
trajectory to solve a problem, you couldkind of choose like, do I just generate.
An output right now, whichis very compute efficient.
Like, I'm not gonna spend alot of flops, I'm just gonna
do a, you know, quick inferenceand, and the job is done.
Or do I spend more time doing discovery,doing exploration, testing out
different possible solutions, right?
(01:17:21):
And certainly thatexploration, but is important.
We know that because when we lookat models like R one zero that
are just trained by reinforcementlearning to reason as much as they
want to get to their solutions,what you see is the reasoning traces
get longer and longer and longer,and those traces correlate directly
with higher and higher performance.
So there certainly is value tojust longer reasoning traces
(01:17:43):
that allow for more discovery.
But the question is like.
is all of that discoveryactually worth it?
Like at a certain point are you justsort of mulling over the unknowable?
Are you, are you kicking adead horse or whatever that
quite literally overthinking?
Yeah, exactly.
Yeah.
Quite literally overthinking.
Not necessarily because it'll,it'll kind of like, you know,
give you a worse result.
(01:18:04):
Though potentially that's a thing.
But also just becauseit's wasteful, right?
That's the, the big thing.
And when you think about the humanbrain, the way that our brains were
sort of evolved, yes, there is apressure evolutionarily to make good
predictions so that we make smart.
Calls, but also there's a huge, hugepressure to be compute efficient
or energy efficient, right?
The human brain runs on, I forgethow many watts, but it's like a
(01:18:25):
shockingly small amount of energy.
And that's often a, a distinction anywaybetween AI systems and computers where
we just face different constraints.
And so this is going to be aneffort to say can we measure, as
you said, the progress that eachof reasoning in a long reasoning
thread, in a long chain of thought.
Each chunk of reasoning.
(01:18:47):
How much does it contribute tothe accuracy of our final answer?
And it might seem weird, like how doyou even measure that in, in practice?
There are a couple different ways.
The sort of most straightforward is thatthey just they look at, okay, after this
chunk of reasoning, they, by the way, inreinforcement learning terminology, they
refer to, to these chunks of reasoningwithin the chain of thought as episodes.
(01:19:10):
So an episode in, in RL parlance islike a 1 play of a game or something.
you know, these episodes or essentiallychunks of reasoning, what they're
gonna do is after episode number, sayfive they will, instead of letting the
model move on to episode six, in otherwords, try another chunk of reasoning.
They'll sometimes just likeforce it to give an answer.
(01:19:30):
I. Directly from say, episode numberfive, and they'll do that like 20
times and get 20 different answers.
Let's say that 20% of thoseanswers that jump straight
from episode five are correct.
Then you let the modelgo on to episode six.
You repeat the process, have it generate20 answers straight from episode six.
Let's say that those 20 answersyou get, I don't know, 60% correct.
(01:19:53):
Well, now it's like, damn, episodesix really kind of made a difference.
So it's worth it to keep reasoningup to that point, but maybe
episode seven, you know, onceyou repeat it there, it plateaus.
That kind of gives you a sense that,hey, we're not actually making more
progress here as we add more of theseepisodes as we do more in context or
yeah, I guess chain of thought stuff.
And so maybe it's worthcutting our losses, right?
(01:20:14):
So this is quite kind of gonna be howthey, they instantiate this in practice.
They're going to train 'emtwo different models to take
advantage of that information.
One of them, they'll trainthrough supervised, fine tuning.
This one will involve basically justgenerating a whole bunch of reasoning
traces for a whole bunch of problems.
Segmenting those reasoningtraces into episodes, right?
These like logical thought segmentsevaluating, again, each episode the,
(01:20:39):
the progress that's made by eachepisode by forcing the model, just
give its best answer at that point.
And then ultimately filter fortraces that achieve maximum progress.
So they, they make steady improvements.
Each episode builds considerablyon the previous one and then
eventually reach the correct answer.
And they're just gonna keepthose reasoning traces.
(01:20:59):
And what this means is, like,if you think about it, this
is filtering for cases where.
You're adding more and more reasoningsteps, like more and more episodes, and
each one corresponds to a significantbit of forward progress, right?
Your ability to get to like kindof one shot the right answer
after that stage keeps goingup and up and up consistently,
which is not always the case.
Sometimes, you know, the, you'll getan episode that is incoherent or,
(01:21:23):
hallucinates something or whatever,and it'll actually make your, outputs
less likely from there to be correct.
So they're filtering those outand getting this really kind of.
Pristine dataset with correct answers,but also correct, or let's say
constructive progress in between.
And then they're just gonna fine tunea model on those reasoning traces.
That's the sort of asupervised fine tuning version.
(01:21:43):
They also do a reinforcement learningversion where at each of these episodes,
they'll actually do not just a, a oneshot output, but partial rollouts.
So, so do multiple continuationssome of those continuations will
like terminate reasoning immediatelythey'll produce a final answer.
Others will keep going.
And in either case, what theydo is they issue a reward and.
(01:22:06):
Both process rewards based onthat same metric of like, how much
progress are you making and alsofinal outcome rewards for correctness.
So anyway, these are twodifferent approaches.
Fundamentally based on, mean, I wouldargue the, the main insight here is
this idea of like, can we chunk thingsup into an episodic format that more
closely resembles the traditional, kindalike reinforcement learning paradigm,
(01:22:28):
and then find a way to evaluate progresstowards towards an outcome here.
So I think an interesting paper.
Yeah.
And, and they do compare to othertechniques like, GRPO, VIT used for
deep Seq R one and show that notonly do you get better performance
by a pretty, let's say, not a hugeamount, but a consistent improvement.
(01:22:53):
You do get that improvementin efficiency as well as you'd
want from this technique.
okay.
Moving on to reviewing a couplethings we didn't get to last week.
We have some system cards.
First up is deep researchsystem card from OpenAI.
(01:23:13):
They introduced deep research asthe multi-step technique where you
can ask OpenAI or chat GBT to answersome query, do some research for you.
And that will go off and do a bunchof searching, do a bunch of analysis,
and compile essentially a report.
(01:23:34):
So they did conduct rigoroussafety training, preparedness
evaluations, and governance review.
And the focus of this new evaluationis in particular on privacy protections
around stuff that's published onlineand the training model to resist
malicious instructions that it may comeacross while searching the internet.
(01:23:57):
So this is basically.
Now that you're letting ULLM go offand just, you know, browse the web
on its own, you probably have newkind of risks and ways the model
might be exploited or jailbroken.
As we've seen, there's beenpapers on jailbreaking via
like website manipulation.
(01:24:17):
They go into that kinda stuff.
And also their usual preparednessframework in the system card.
Yeah, the, the preparedness frameworkstuff is, is kind of interesting.
So, you know, often when thesemodels come out there isn't really
a move relative to the frontier onthe preparedness framework stuff.
So, this time we have that for context.
(01:24:38):
So opening Eyes preparedness frameworkbasically says, look, we have I think
it's four now, different stages ofrisk or classifications of, of risk
for these models, for each of the, thekind of standard threat domains cyber.
Seaburn.
So chem, bio radiologicaland nuclear capabilities.
And then autonomy.
So they have low, medium, andhigh risk classifications.
(01:25:01):
Once you cross over from mediuminto high, they say they will not
release the model in that form.
They're gonna do mitigations untilthe model's capabilities or or
risk associated with that modelget back down to medium level.
Interestingly, this is thefirst time, so deep seek.
Sorry, deep seek, deep researchis the first model that's
been classified as medium incybersecurity, which is significant.
(01:25:23):
They've seen big improvementsin capture the flag challenges.
So these are cases where youwanna get your model to solve a
problem that involves getting somesequence of numbers or characters
outta some container or some,some sort of software environment
that involves cracking into it.
So apparently it solved 82% of highschool level, 55% of collegiate
(01:25:44):
level, and 47% of professionallevel capture the flag challenges
with without browsing, by theway, without internet access.
So, so pretty impressive cyberoffense capabilities here,
even if these don't involve.
You know, finding new zero days orwhatever the reality is that this
massively increases what you cando as just a, a sort of, from a, a
(01:26:06):
cheapness standpoint, you know, youcan, you can increase the number of
people who can effectively carry outcyber attacks if they can outsource
it to something like deep research.
And then you can also justincrease significantly the amount
of work that one cyber attackercan do thanks to these tools.
So we're also seeing a medium riskclassification on the seaburn side.
Again, that's chem, biological,radiological nuclear risk.
(01:26:27):
they say that deep research can helpexperts with operational planning for
reproducing known biological threats.
So that's a, a needle mover.
And then they also flag that severalof their bio evals are indicating
that these models are quotes onthe cusp, on the cusp of being
able to meaningfully help novices.
(01:26:47):
Create known biological threats.
And so that would crosstheir high risk threshold.
That's a pretty wild thresholdif you think about it, right?
Being able to meaningfully help novicescreate known biological threats, like
not new ones, but you're a novice.
Like if you don't have any experienceand you want to know how to like, I
don't know, perform an anthrax attackor something, I mean, presumably this
is what this is getting at, right?
(01:27:09):
So anyway, they also say that wellhere's a quote, current trends
of rapidly increasing capability.
Continue and for models to,they expect models to cross this
threshold in the near future,into the, the high risk threshold.
That is similar results bythe way, in autonomy, which I
found particularly interesting.
Opening eyes, big play.
And really right now,this is everybody's play.
It's pretty obvious.
You make models that areincreasingly good at tasks
(01:27:32):
associated with AI research.
So you can automate AI research itself.
That leads to the risk ofmodels figuring out how to
self-improve and loss of control.
This is something that theylook at in, or a subset of what
they look at in their autonomyevals, which they do by the way.
In collaboration with.
With meter, which is one of thekind of big AI evals companies.
They say medium risk here forautonomous capabilities flagging
(01:27:55):
self-improvement potential.
So improved performanceon SW bench, which is this
software engineering benchmark.
SW bench verified is the version ofthat that OpenAI has kind of cleaned up.
And they see greater potentialfor self-improvement and
AI research acceleration.
This is essentially just likea coded way of saying recursive
self-improvement at the lab level.
(01:28:16):
So, you know, helping one AIresearcher do the work of.
More than one AI researcher.
And to the extent that happens,that means you can accelerate
the speed at which you buildthe next generation of model.
That generation of model then presumablyis even better at accelerating
your research within the lab.
And eventually thatcontinues out and finite.
Boom, you get a singularity.
That is kind of the explicittarget of open ai of, all
(01:28:38):
the frontier labs with variouscommitments to safety along the way.
But that's kind of the,the framework here.
So pretty remarkable that, we'reat the point where the medium
to high risk threshold is we'resort of flirting with that.
I think if and when you crossthat threshold you're gonna see
some really, really big changes inthe national security situation.
I think it's, it starts to becomeimpossible to deny that loss of
(01:28:59):
control is a risk that recursiveself-improvement is a risk.
And they do say they're working onmitigations that include quote, threat
model development for self exfiltrationand self-improvement risks in
preparation for age agentic models withheightened capabilities like this is.
Being invested in right nowat OpenAI, pretty, pretty
remarkable time to be alive.
(01:29:19):
2025. And next up we actuallyhave a second system card this
time about Claude 3.7 sonnet.
Also a little while ago released,but wharf covering in relation
to actually another one.
We will get to just a sec.
So the Claude 3.7 sonnet system cardis about, you know, cloud 3.7 and it's
(01:29:44):
modeled both as just an LM and alsoin particular as a reasoning model.
It was released in February I don't knowexactly, maybe a week two weeks ago.
And as with the usual reports we getfrom philanthropic, very detailed
dozens of pages to go through withvarious evaluations and so on.
(01:30:05):
I think the, the interesting thingto highlight for me from this model
card is the new things introducedby the thinking aspect as sound 3.7.
So.
They do discuss how the chain ofthought, the reasoning process can
be a new tool for alignment where youcan look through it and basically have
(01:30:30):
hints or be able to verify alignment.
They also evaluate concerningthought processes and the idea
of chain of thought faithfulness.
So, it may be the case that the chainof thought is not enough as a signal
(01:30:50):
to monitor, to be able to tell ifa alum is doing something wrong.
And we've seen some other research alsoalong these fronts, so they do some
preliminary investigation of the chainof thought faithfulness show that there
is some amount of possible misalignment.
So, yeah, lots to cover in, in this one,and I'm sure on topic will have follow
(01:31:13):
ups that go deeper on the particularideas around chain of thought outputs.
I'm also tracking that we havelike 10 minutes for the episode.
Man.
We are good at filling the time.
I'll just share onething really quickly.
Unfortunately.
I mean, great to go into moredetail broadly, I will say the
results of this are, are in linewith with the deep research.
(01:31:34):
System card, that opening I putout, you know, obviously differences
here and there 'cause of thefocus on age ancientness versus
kind of the deep research side.
But in any case they do see upliftin human performance, on bio weapon
acquisition trials pretty significantly.
They, it was 2.1 x improvementcompared to people without
the assistance of the model.
I remember when people were talkingabout OpenAI study or sorry, there was
(01:31:57):
a RAND study that was put out a littlewhile ago that seemed to suggest that,
hey, you know, there's no meaningfuluplift in then current models.
was then I. Disproven by OpenAI, whoused the fact that they had access to
the base version of G PT four without,without the, the safeties on it.
And even something like thatcan make a big difference.
Here's a quote from the paper that'sinteresting and in line with stuff I've
(01:32:18):
heard from folks at Anthropic recently.
There is a substantial probabilitythat our next model may
require a SL three safeguards.
A L three essentially it's,it's an AI security level.
Three.
So this is anthropics equivalentto opening ai, sort of like low,
medium, and high risk categories.
They have, you know, aSL 1, 2, 3, and four.
So a SL three, you're, you're startingto look at systems that are showing
(01:32:41):
early signs of autonomy and so on.
But a again, it's sort of like crossingto that high risk for the sein side.
So again, very consistent, right?
We are on the cusp to useopen AI's language of, of a
lot of these very significantdevelopments on the risk side.
And moving on to policy and safety.
We begin with very much relatedsafety paper from OpenAI.
(01:33:02):
It's titled Detecting Misbehaviorin Frontier Reasoning models or
sorry, the paper itself, I think we,a blog post detecting misbehavior.
The paper is started monitoringreasoning models for misbehavior and
the risks of promoting a obfuscation.
So the idea is once you have thesereasoning models, they output a
(01:33:26):
bunch of its reasoning via thesechain of thought traces and.
Fundamentally chain of thoughtmonitoring can be very effective
at being able to tell whether theLLM is going off and doing things
you don't want, and prevent itfrom you, kind of catch it while
it's going off course, so to speak.
(01:33:47):
So they first, in this paper showthat COT monitoring can be effective
and possible to intervene, but alsothat if you integrate the monitoring
into the reinforcement learningreward, that could lead at first to
better agents that are more aligned.
(01:34:09):
But if you optimize too much witha chain of thought monitoring, the
agents do learn obfuscated rewardhacking, which I think we covered
something similar a week ago wherebasically not only do the models
find wrong way to get a good reward.
They also obfuscate the fact we aredoing that you know, making it seem
(01:34:30):
like we're not in the chain of thought.
So they have in this paper an idea ofa mo mo ability tax by allowing the
chain of thought output to maybe besuboptimal as a, a method of making
it kind of easy to tell what themodel is doing so it doesn't optimize
(01:34:50):
itself into being able to cheat.
Right?
Yeah.
It's essentially like the, themodel at first learns to tell
you that it's planning on, doingsomething dangerously creative to
achieve its programmed objective.
And if you then go, oh, don't do thateventually if you do that enough,
it will learn to just lie to youabout the fact that it's going to use
(01:35:11):
the dangerously creative strategy.
this is something that's beenpredicted by the way, forever.
It's something that I will saya lot of people claimed was like
science fictiony and implausible.
I think this is another unfortunatemoment where people on the alignment
side in the alignment community.
Have called it like years andyears ahead of time and have turned
out to unfortunately be right.
(01:35:32):
Didn't have to be thisway, but it, apparently is.
And so, yeah, this isgood heart's law, right?
Anytime you apply enough optimizationpressure to an objective, you will find
dangerously creative strategies thatexploit the gaps between your intent
and that objective that you didn't tellit not to pursue 'cause you never even
thought of in the first place, right?
That, that is good heart's law andyou're just seeing it play out.
It's, it's really interesting thatspecific way that this manifests is
(01:35:55):
apparently you'll get this like veryinnocent looking reasoning trace.
And then suddenly it'lljust like do the bad thing.
Like it'll do the thing that youwould have told it not to do.
It just won't tell you thatit's reasoning its way there.
Which is sort of interesting.
I like, to me that suggests themodel must be Paying its own tax.
'cause it, it takes extra computeto think about being deceptive,
(01:36:18):
but then it's just generatingan output that's completely
decoupled from the reasoning trace.
And it makes me wonder if there issome spectography eventually that would
go on there where not only would it.
Find a way to generate an innocentlooking reasoning trace so that
it doesn't get kind of blocked,but also that innocent, ostensibly
innocent looking reasoning trace mightactually contain encoded within it.
(01:36:42):
Some thought tokens that allow it toreason in the kind of evil way, if you
will, that that you don't want it to,just because those are the two pressures
you're putting on it, get the rightanswer, but also make me think that
you're reasoning innocently anyway.
They have this really interestingplot that shows how these hacks,
these kind of dangerously creativehacks evolve in the model over time.
(01:37:02):
and they show manual interventionswhere hack exploitation basically
drops to zero with one kind ofhack because they catch it, but
then another pops up, right?
You're playing whack-a-mole withthese systems and ultimately if
you just pour more compute intoyour problem, this is a, a game of
whack-a-mole you are destined to lose.
it's really interesting.
they check, you know, baseline agentswhere you don't exert any optimization
(01:37:24):
pressure on the chain of thought.
You find that quite often, notalways, but around 95% of the time,
a weaker supervising model likeGPT-4 oh can catch these attempted
hacks and, and knock 'em down.
Over time the number of hacksthat are attempted, by the way,
just goes up and up and up.
And so that just reflects the factthat hacks are almost by definition
(01:37:45):
things that are easier to implementthan the rightful solution.
And so there is anactual outright pressure.
It's not just that these are equally.
These are equal solutions.
And sometimes it goes with theannoying hack and sometimes it
goes with the solution you want.
No, over time the models willconverge on the dangerous solution
because it is more efficient, it'sless compute intensive or reflects
(01:38:08):
other pressures that the model faces.
And so this is at its core, by theway, the argument for loss of control
like it's very, very difficult lookingat, these kinds of empirical results
to conclude that we're not headed fora situation where that could be a, a
live risk in at least some context.
And so I think this is great greatresearch, great to see OpenAI publishing
(01:38:29):
this being very open about it.
Especially given some of theother qualms that people have had
with, with their alignment program.
Yeah.
And by the way, the hacking and questionin this case they focus on examples
where the AI needs to pass unit tests.
And so in the s Python context, itcould do various kind of wrong ways,
(01:38:51):
basically hack the environment itselfto be able to get passing grades in
without actually doing the task itself.
And the next one is about Chinatelling its AI leaders to avoid
US travel over security concernsaccording to the Wall Street Journal.
So that's pretty much the story isapparently the Chinese leadership
(01:39:15):
is worried that AI experts travelingabroad could divulge confidential
information and that they could bedetained and used as a bargaining chip.
And I've gotta say, I mean, thishas been outgoing for a while, but.
In the context of AI, where thereis a lot of researchers in China,
(01:39:37):
a lot of you know, conferencesand so on, it's starting to get a
little intense maybe when you'reseeing these kinds of stories.
Y Yeah, for sure.
And it, there's a bit of projectionI think, going on here 'cause this is
exactly the kind of game plan that,you know, China has used in the past.
So there was famously a case whereyou know, China held back, I guess it
(01:39:58):
was two Canadians Canadian citizens.
And it was part of theirnegotiation with, with Canada for,
I forget what and then there was.
In parallel with that, a Huaweiexecutive held in Canada,
Washington's request, this was duringthe first Trump administration.
I think this was oh God.
May something I, I forget.
But anyway, she was ultimately,I think, sent back to China.
So in this case you know,can you say wartime footing?
(01:40:22):
I mean, this is really whatyou, what you expect to see.
when countries are starting to lookat AI as a, essentially a proto
WMD, certainly a, a key strategicnational security technology.
And the details here are that, so youhave, authorities are discouraging
executives at leading local companiesin AI and other strategically sensitive
industries such as robotics fromtraveling to the us and to us allies,
(01:40:43):
unless it's urgent, if they must they'reinstructed to report their plans before.
Leaving and upon returningto brief authorities on what
they did and whom they met.
So, we see apparently in retrospect thismay have been tied to Liang won Wfg.
So the deep seek founder who turneddown an invitation to attend the AI
(01:41:03):
summit in Paris back in February.
So speculation about whether thatmay have been prompted by essentially
the same muscle movement that ledto this announcement by the Chinese.
A lot of similar thingsthat, that have come up.
And apparently Xi Jinping told a bunchof at, at this gathering and he with
famous gathering that involved LeonWang and a whole bunch of other folks
in the AI community in China he toldthem to uphold a quote sense of national
(01:41:26):
duty as they develop their technology.
That's pretty interesting kind ofproto nationalization language,
though it is consistent with,with the CCPs broader tone when,
when they address these things.
So there, there is an interpretationof this that says, well, China
seems to be gearing up for some,big adversarial race on ai.
It certainly is the exact opposite ofwhat China's historically done, right?
(01:41:49):
They've historically said,no, go forth, go to the us,
absorb everything they know.
Bring it back here.
This now suggests you know, maybea slightly different attitude.
It could also be, and, and,you know, it could be a yes.
And here that there's been an issuewith a brain drain from China.
A lot of wealthy Chinese have movedoverseas in recent years especially,
and that includes researchers.
(01:42:10):
So you might worry about that piece.
But both things can be trueat the same time and it's
pretty, pretty ominous sign.
And I think frankly, under reported,like this is the sort of thing you
see in the runup to a major sort ofinternational clash on a key technology.
Right.
And kind of a related note, OpenAIdid submit a policy proposal to the
(01:42:32):
Office of Science and Technologypolicy basically saying that they
should heavily consider banningtechnology from what they're saying.
Step state subsidized orstate controlled entities may
be referring to deep seek.
So, opening eye seemingly positioningitself at increasingly also as a hawk
against China, I suppose I. Yeah.
(01:42:54):
And you see all thepredictable pushback on that.
Or, you know, you have OpenAIpushing back against the, an
open source model on the groundsthat it's okay state subsidized.
But of course the, the wholepremise of OpenAI was, well, you,
you wanna open things up so thatanybody can use the ai, it doesn't
all get kind of centralized.
Yeah.
So similar skepticism about, I thinkOpen AI came out and, and used deep
(01:43:17):
seek as a reason to say, Hey, weneed to give AI companies in the
US this fair use exemption so theycan train on, proprietary data.
because we have this competitionwith China, anyway, it's, it's a
whole hairball, but This by theway, is a submission to the request
for information that came out asthe Trump administration is putting
together their AI action plan.
So, you know, we'll be seeing that.
I think that's David Sachs, who'snominally in charge of that effort.
(01:43:40):
And that is it for the episode.
Not quite two hours, but we didcertainly use up a lot of time
given that there isn't any huge.
Thank you so much for listening.
As always, feel free to go to lastweek in AI for our text newsletter with
even more articles we haven't covered.
And as always do subscribe.
If you aren't, please do shareand review and give us feedback
(01:44:04):
to let us know how we can approve.
But more than anything,be sure to keep tuning in.