Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:10):
Hello and welcome to the last weekin AI podcast where you can hear us
chat about what's going on with ai.
As usual, in this episode, we willbe summarizing and discussing some of
last week's most interesting AI news.
You can go to the episode descriptionfor all the links and timestamps,
and also to last week in ai.comon your laptop to be able to read
(00:33):
those articles yourself as well.
As always, I'm one ofyour hosts, Andre Ov.
I studied AI in grad school and I nowwork at the generative AI startup Asate.
And I'm your other host, Jeremy Harris.
I'm with Gladstone ai, AI NationalSecurity Company, which you know
about if you listen to the podcast.
You also know about aade.
Now a bunch.
(00:53):
If you listen, you know about allof this, you know about all this.
What you don't know though is thatthis morning at the early hour, I think
it was like three or something in themorning I discovered that I have bats
in my house, which is fun, which isreally fun, especially when you have
like a six month old and you have bats.
And then you start Googling things.
So anyway, we had pest control come in.
(01:14):
That's why.
Wow, my hair looks likeCosmo Kramer right now.
I've just been running my, running myfingers through it for for quite a bit.
So anyway we got everything on forShowtime though, because the show go on.
Yeah, if you, but if you get anydetails wrong, you know, it's, it's
the shock, residual shock of bats you.
I'll be on the lookout.
(01:36):
Well, let's do a quick preview of whatwe'll be talking about in this episode.
It's gonna be a bit of a relaxed one.
There's nothing too sort of worldshattering, but a variety of pretty
interesting stories, tools, and apps.
We have some new impressivemodels out of China.
Some new stuff from OpenAI as well.
(01:58):
Google, philanthropic,everyone launched some stuff.
Applications andbusiness as we often do.
Gonna be talking a lotabout hardware and GPUs.
A little bit about fundraising aswell, projects and open source.
We'll be talking about the modelcontext protocol, which has been all
over rage in the AI community recently,and a couple new models as usual.
(02:21):
Research and advancements.
We gotta talk about reasoning techniquesinference time, scaling techniques
but also some new kind of developmentsin the space of how you implement
your models, policy and safety.
We have some more analysis ofwhat's going on with China, US
national security, things like that.
(02:42):
And finally, we will actuallytalk a little bit about the
world of art and entertainmentwith some news about copyright.
So let's just get straightinto it in tools and apps.
The first story is about Baidulaunching two new versions of the Ernie
model, Ernie 4.5, and Ernie X one.
(03:04):
So Ernie initially releasedtwo years ago, and now we
have Ernie 4.5, presumably.
I don't know, it sounds like kind ofto coincide with four with GPD 4.5.
And then Ernie X one is thereasoning variant of Ernie that
Baidu says is on par with deepseek R one, but at half the price.
(03:27):
And both of these models are multimodal.
They can process videos,images, and audio as well.
They also say earn your 4.5 iskind of emotionally intelligent.
They can understand the memesand satire, which is interesting.
So I think.
We, we don't have like a greatsense of the tool landscape
(03:52):
in China is my impression.
I, I really wish I would know,like if you are a user of a
chat bot, I. We go to chat GTor Claude to give out queries.
I think it seems likely that Ernieis sort of filling that role and
the fact that there's new modelsand the fact that they're really
competitive, pricewise is a big deal.
(04:13):
the like number one downloaded appin China just switched to a new
AI chat bot that is not deep seek.
So things are definitely moving.
the big advantage here with the,this launch seems to be cost.
At least that's what they'releaning into with a lot of
the discussion around this.
So the goal that Baidu has, which,you know, Baidu of course is China's.
(04:34):
Roughly China's Google, right?
They, they own search there.
Their goal is to progressivelyintegrate Ernie 4.5 and their X
one reasoning model into all theirproduct ecosystem including Baidu
search, which is sort of interesting.
So we'll see a, a rollout ofthe generative AI capabilities
in, in that context.
Yeah, so ultimately itdoes come down to price.
(04:54):
A lot of it.
So for context, there's a reallyhandy table in one of the articles
that looked at this comparing GPT four point fives per token cost
to deep seek V three to Bernie.
Sorry, Bernie to, to Bernie, 4.5.
it's quite interesting, right?
So input costs for input tokens,75 bucks for a million tokens.
(05:15):
This is for GPT-4 0.5.
Deep seek V three thatdrops to basically 30 cents.
Ernie 4.5 is about 60 centsor so per 1 million tokens.
So, you know, you're talkingorders of magnitude less.
Also the case that thesemodels are less performant.
So that's sort of the trade offthere, but where things really start.
(05:35):
Yeah, I think just to give a, abit of a perspective deep CCP free
is more comparable to somethinglike GPT-4 oh in open eyes slate
models or free mini, for instance,where aggression isn't that crazy?
It's maybe I forget, $1 ish.
Per million tokens,so they're comparable.
G PT 4.5 is just crazy, crazypricing compared to everything else.
(05:59):
And that's the thing, right?
It's the, the way to think about 4.5I think we touched on this a couple
episodes ago, but it, it's a basemodel, but it's not a base model for
let's say mass production, right?
These are high, high quality tokens.
Probably best used to create thingslike synthetic data sets or to answer
very specific kinds of questions.
But you're not looking at thisas something that needs to
(06:20):
be, that you wanna productizejust 'cause you're right.
I mean, it's, two orders of magnitudemore expensive then other base models
where you actually see the lift hereespecially for Ernie X one, right?
With this is the reasoning modelis on the reasoning side, right?
So open AI's O one is.
roughly 50 times moreexpensive than Ernie X one.
(06:42):
Ernie X one is about half thecost of R one for input tokens.
And and actually that'salso true for output tokens.
So it, it's a like quitesignificant, especially again,
relative to O one and shows you.
One of two things.
Either Chinese engineering isactually really, really, really that
good, or there's some state subsidything going on in the background.
(07:04):
I think the latter is somewhatless plausible at this point,
though I wouldn't rule it out.
Certainly there's some amazingengineering making these these
margins possible and it's, that's apretty remarkable thing here, right?
I mean, the cost justcollapsing for reasoning.
This implies that there's some reasoningspecific engineering going on in the
background and you know, you shouldexpect that to apply to training
(07:25):
as well as infants going forward.
Yeah, and it's kindof, I. Funny in a way.
There is a parallel here between Baiduand Google, where Google likewise has
quite competitive pricing, especiallyfor Gemini Tube flash thinking.
So I could also see it being you knowjust a company strategy kind of thing.
(07:47):
Baidu is gigantic.
They're printing money with search,so they could also kind of eat
the additional cost to undermine.
Something like deep seek, which is astartup right, to lock in the market.
But either way, exciting news.
And I guess if you're in China, Idon't believe you can use Chad GBT.
(08:07):
So if nothing else, it's good thatthere are comparable tools for
people to use and not miss out onthe fund of these advanced lms.
Yeah, I I I will say I don't know that,that Baidu would be subsidizing at the
level of, at least their base modelbecause they are actually more expensive
than Deep Seq V three on Ernie 4.5.
That's true.
Yeah.
Where, where do you see thatflip is with the, the reasoning
(08:29):
models, which itself is, yeah,that's kind of interesting, right?
I mean, to me at least, that seemsto imply something about reasoning
like engineering for, for the, likecomputer architecture behind reasoning
or more token efficiency and thereforecompute efficiency at the, at the
reason, I shouldn't say therefore maybealternatively compute efficiency at
the reasoning stage, but, you're right.
(08:49):
There's all kinds of things start tomuddy the waters when you start thinking
about the economics of these things.
as they represent a larger and largerfraction of the corporate bottom
line, even for big companies likeBaidu, like Google these companies
are gonna be forced to show ustheir hand in a sense, right?
They're gonna have to sell these tokensfor a profit, and we will eventually
learn what their actual margins are.
(09:11):
It's debatable whether we'relearning that just yet.
Yeah, I don't think we are.
It's, it's very much unknown.
And I haven't seen any kind ofstrong analysis to explain it.
There's, you know, yeah, it'sjust mystery what kind of
tricks people are pulling.
But I would also kind of betthat the margins aren't great.
The one thing we do know deep seekclaimed at least that they were
(09:34):
making a profit and, and had apositive margin on their models.
And I could see thatnot being the case for.
You know for instance, OpenAIwhere their revenue is in the
billions, but real question is,are they actually making a profit?
last thought on this too.
On the economic side, like whenwe think about what it means for
(09:55):
deep seek to claim that they'regenerating positive returns I think
there's a, an important questionhere about whether that's operating
expenses or CapEx factored in, right?
So we saw in their paper thatthey famously talked about how
they, trained V three on $6million of compute infrastructure.
Now, I. Or sorry, on a $6million compute budget.
(10:16):
That was, it seems in retrospectthe actual, the operating expenses
of running that compute, not thecapital expenses associated with the
Tens of millions of dollars as itwould've been of compute hardware.
So it's always hard to know,like, what do you amortize?
How do you factor inwhat's apples to apples?
Yeah.
It's hard to say, like deep seek isprofitable, but on a per token basis,
(10:38):
just for inference, I believe the claimis we're making money, which yeah.
In itself is on an opex basis.
Yeah.
Interesting.
Yeah.
Moving right along.
Next we have OpenAI and they arereleasing some new audio models.
So there are new two new speechto text models, GP four oh
transcribe and GP four oh MiniTranscribe, which are basically
(11:03):
replacing their whisper models.
Open has already had this asa service for quite a while.
The exciting new thing here is thetext to speech model GP four oh mini
dash TTS, which is more along thelines of 11 labs where you can produce
(11:24):
very natural human sounding speech.
And along with announcement of themodels, OpenAI has also launched a
new cite OpenAI fm, which is a, a demosite where you can go and mess around
and, and kind of hear the outputs and.
This is kind of a fun trend, Igotta say, where these companies
(11:44):
increasingly are launching theselittle fun toys to get a sense for
what these models are capable of.
one last thing, again, we probablyshould comment on pricing.
The pricing is very competitive.
The transcription for GT four oh,it's 0.60 cents per minute 0.60 cents.
(12:05):
So like $0.01, I guess.
And GT four oh meaning ETTS is1.50 cents per minute, which
is mod lower than a competitor,like 11 labs, for instance.
So yeah, I think it's interesting to seeopening, expanding their model suite to
(12:26):
these new domains where they're sort ofless, focused we've seen them kinda move
away from text to image, for instance,Dali hasn't had an update in forever.
Yeah.
And so I guess this makes a lot ofsense that they have very competitive
things to offer given their investmentin the advanced voice mode in chat GPT.
(12:46):
it's sort of reminiscent of the,the problem that meta faces, right?
Where they're, you know, they,they reach like, whatever, 3
billion people around the world.
At a certain point, when yourmarket penetration is so deep, one
of the only things you can do tokeep growing is to grow the market.
And so meta invests, for example,in getting more people on the
internet in, you know, othercountries like in, in countries
(13:07):
that don't have internet accesstypically, or, or have less of it.
And so they're literally just tryingto like, grow the pool of people.
They can, they can tap for this.
In the same way, I think there's alens on this that's similar, right?
So you are only able to interact withchat GBT, through certain modalities
or, or with open AI products,through certain mod modalities.
And by achieving greater ubiquity,by reaching into your life more and,
(13:30):
and making more of the conversationaltooling available to you that that
really does effectively increasetheir, their market, right?
Like, you don't have to be infront of a computer necessarily,
or in the same way, or engaged inthe same way to use the product.
And obviously they've had other, othervoice products before, but it's sort
of part of, if I'm open ai, I'm reallythinking about multimodality, but.
(13:52):
From the standpoint of increasing thenumber of contexts, life context in
which I can reach you, and it, youknow, text to image still requires you
to be in front of a screen, same as,you know, writing text on chat GPT,
whereas audio is just this like, youknow, greater reach for modality wise.
So I think strategically it'san interesting play for them.
(14:12):
Ethically, all kinds of, of issues.
I mean, you know, you think aboutthe, the modality of audio as being
one that is much more intimateto humans and an easier way to
plug into your, inner world.
And that's, I think something, you know,when, when you look at like what Recca
did to people just through text, right?
The suicidal ideation, the actualsuicides, the, the rec subreddit
when people had their, you know, AIboyfriends or girlfriends taken away
(14:33):
from them, you know, that sort of thing.
When you, when you tie an audio, I thinkit's, it's gonna be a, an interesting PR
challenge if nothing else for open ai.
There is one figure, bythe way, in the article.
At least the, the,we're linking to here.
And it's just a, a piece of researchlooking at the, the word error
rate comparisons across leadingmodels for different languages as,
as part of this kind of tooling.
(14:55):
I just, I, I find it reallyinteresting, like Arabic and Hindi
there, there's a lot of struggle there.
There, those are some of the,the worst performing languages.
Obviously English, one ofthe better performing ones.
I'd love to see an overlay of thisrelative to the amount of data
that was used to train the modelso that you can see in relative
terms, like which languages arein a sense, like harder for AI to
print out to kind of, to speak.
(15:18):
I think there, there'ssomething anyway linguistically
just fascinating about that.
If nothing else.
So anyway, overall interesting launch.
And I think we're gonna seemore and more of this, right?
It's gonna be more expected to havevery high quality audio models and
linking them specifically to agents,sort of Star Trek computer style.
Yeah, I, I guess one thing worth notingon the kind of ethics side is I don't
(15:40):
believe they're offering voice cloningtechnology, which is where you can
really get into trouble very easily.
So I think open AI is being alittle careful these days in
general to not court controversy.
Part of why it took them foreverto release SOA potentially.
And in this API, this demo, theyare releasing something like a dozen
(16:04):
voices you can use with names likeAlloy Ash, echo, fable, Onyx, Nova
kind of, I don't know, human, I guess.
They're not even trying tomake 'em human sounding.
And you can also assign them a vibe inthis demo, like cowboy auctioneer, all
timey, serene with a lot of this kindof steering what you can do as well.
(16:26):
So yeah, I think it's,it's pretty exciting.
And as ever with a release of newAPIs, this really enables downstream of
OpenAI and Visa companies for others tobuild exciting new applications of ai.
And onto a few more quick stories.
Next up, also OpenAI.
(16:46):
They have released oh onePro into their developer API.
So it's actually limited to developerswho spent at least $5 on this, and it
costs 150 per million tokens for inputand $600 per million tokens generated.
So that's very, very high.
(17:08):
Prices, obviously, that's aswe've said, GT 4.5 was $75
for 1 million output tokens.
And that's yeah, two, two ordersof magnitude easily above what
you would typically charge.
yeah, I'm trying to think if it'stwo or three orders of magnitude.
It might be approaching threeorders of magnitude, actually.
So, yeah.
(17:30):
Interesting strategy here from OpenAI.
We haven't seen any other companies.
Release these very expensive productsyet and open, I increasingly doing
that with Che Pro, their $200 permonth subscription with QB 4.5.
With this, it makes me wonder ifthis is an attempt to become more
(17:50):
profitable or if this is them sortof testing waters where it could
be various readings, I suppose.
Yeah.
it's also, I mean, it's interestingto note this is not an order of
magnitude larger than what GPTthree's original pricing was.
I was just looking it up on, inthe background here to check.
'cause I, I seem to remember it, itbeing, you know, on a, back then it
(18:13):
was priced per, per a thousand tokens.
With reasoning models, you tend to seemore per million tokens just because
of the number of tokens generated.
But sort of reminds me, you know,in the military or the history of
the military, there's often this.
This restriction where it's like peoplecan only carry, I forget what it is
60 pounds or something, of equipment.
And so over time you tend to seelike the amount of equipment that a
(18:36):
soldier carries doesn't tend to changeor the weight of it, but of course
the, the kind of equipment they carryjust changes to reflect technology.
This sort of seems similar, right?
There's like almost a pato frontierof pricing, at least for the, the,
the people who are willing to reachfor the most intelligent products.
you know, you're constantlyreaching for it though.
This is a push forward even relative tothe g PT three frontier back in the day.
(18:57):
So kind of interesting there's allkinds of, feedback people have been
getting, there's complaints about,oh, this model struggled with Sudoku
puzzles apparently, and opticalillusions and things like that.
People say, you know, I, at a certainpoint, anything you launch at a high
price point especially if you'reopening eye, people will complain
that it's not like super intelligence.
And so Yeah, there's also an interestingparallel here where O one Pro, just
(19:22):
in terms of benchmarks, and I thinkin general in terms of the vibe of
what people think is, is that it'snot sort of significantly better
than O one and that parallels GT 4.5.
You know, it's better, butit's not sort of a huge leap.
So there is an interesting kindof I dunno, demonstration of.
(19:43):
Probably it's harder to get, you know,huge leaps performance and people
are gonna be more critical now of ifyou are not offering something that's
like, you know, really leap betweenGP four 3.5 and four, for instance.
Yeah, I mean, I think it's, it'squite use case specific too, right?
So as we've seen, you know, the, thekinds of issues people are running
(20:06):
into optical illusions you know,Sudoku puzzles, this sort of thing
are pretty far from the standard,you know, the actual workloads
that open AI is targeting, right?
Their focus is can we buildsomething that helps us automate
AI research as quickly as possible?
those sorts of benchmarks.
Yeah, we are seeing needle moving there.
there's also some interestingstuff that we'll talk about from
(20:27):
meter suggesting that in fact,that is what's happening here.
That on those particular kindsof tasks we're seeing pretty
significant acceleration with scale.
But, but you're right, it, right?
It's this funny, uneven surface,just like how humans are,
are funny and uneven, right?
Like, you have a, a really talentedartist who can't write a line of
code to save their lives, right?
And, and vice versa.
So another instance of, theparadox of what's hard for AI
(20:50):
isn't necessarily hard for humans.
And moving away from OpenAI to Google.
We now have another feature anotherinstance of Canvas this time on Gemini.
And they're also adding audio overview.
So I don't know why they do this.
Why Visa lms, just copyeach other's names.
(21:11):
We had deep research showingup in, in multiple variants.
Now we have a canvas,which is also on Chad GPT.
And I think on Tropicit's called Artifacts.
Basically the same idea where nowas you're working on something
like code for instance or like,you know, a web app for instance,
you can have a side panel showingthis living document rendering
(21:36):
of it with a chatbot to the left.
So you can essentiallyinteractively work and see a
preview of what you're getting.
And you also have audiooverviews, which is.
Pretty much something like Notebook lmyou can upload documents and get this
podcast style conversation going on.
(21:58):
So nothing sort of conceptuallynew going on here, but, I think
an interesting convergence acrossthe board of all these tools.
Everyone has canvas,everyone has deep research.
Everyone seems to have kind of the sameapproach to implementing LLM interfaces.
(22:18):
speaking of that, in fact the next storyis about philanthropic and them adding
web search capabilities to Claude.
So that is now in preview forpaid users in the US and that
will basically work the same as itdoes in Chad, BT and other models.
You can enable it to work with cloud 3.7and then it'll be able to provide direct
(22:43):
citations from web sourced information.
So yeah, it's not much else to say.
We are getting web search for cloud,which will enable it to be more useful.
it's interesting 'cause it's likethe tee up to this is philanthropic
being a little bit more shy thanother companies to roll up the,
the web search product into their.
(23:04):
Into their agents and, and I mean,this is consistent with the threat
models that they take seriously, right?
Things like loss of control, right?
Which typically involve, you know, anAI model going out onto the internet,
maybe replicating its weight somehow.
And, and internet access is kind ofcentral to a lot of these things.
I don't know if that was part of this,it at least is consistent with it.
So the result is that theremay be a little bit later
(23:25):
at the party than others.
Apparently according to theseinitial tests, you don't always
see web search used for currentevents related questions.
but when that happens, youdo get these nice inline
citations pulled from sources.
It does look at social media and,and then of course news sources like
NPR, like Reuters, they, they citein the, in the examples they show.
(23:47):
So, you know, pretty, prettystandard product and the inline
citation approach that you seewith deep research, for example.
Certainly making an appearance here.
And last up again along the lines ofthese stories, we have XAI, launching a
new API, this one for generating images.
So we have a new model called Rock twoImage 1212, and you can now query it.
(24:12):
For now, it's quite limited.
You can only generate 10 imagesper request and you are limited
to five requests per second.
The cost there is seven cents perimage, which is slightly above what for
instance, black Forest Labs charges.
They are the developers ofFlux and competitive with
(24:33):
another offering from idea.
So I think, yeah, interestingto see X AI expanding there.
APIs once again, they released theirown image generation back in December.
And it's kind of looked competitivewith something like, Google's latest
generation of where we focus has reallyshifted towards careful instruction
(24:56):
following in your image generation.
So yeah.
XAI is is as ever trying tocatch up or moving quite rapidly
to expand their offerings.
Yeah, they really are.
And you, I think when we first coveredBlack Forest Labs partnership with XAI,
one of the first things that we said waslike, Hey, this is, you know, because
I think they raised a big round righton, on the back of the incredible.
(25:19):
Distribution that they were going toget through XI and, and the kind of vote
of confidence that reflected from Elon.
But at the time we were talkingabout, Hey, you know, this is a
pretty strategically dicey positionfor Black Forest Labs because the one
thing we've consistently seen from,from all the AI companies is I. Once
they, you know, start getting youin for chat, eventually they start
(25:40):
rolling out multimodal features.
And it's not clear that thosearen't best built in-house
for any number of reasons.
Not just including the fact thatyou wanna kind of internalize
all the revenues you can fromthe whole from the whole stack.
But also once you have a good reasoningmodel or rather a a, a good foundation
model that foundation model can bemined for multimodality post hoc, and
(26:01):
you just kind of get to amortize yourinvestment across more modalities.
And so it's just this natural move tokind of keep crawling into or creeping
into adjacent markets like imagegeneration, video generation, which is
also something that XAI is looking at.
So, yeah, I mean, kind of interesting.
For Black Forest Labs, this probablyis gonna be a big challenge for them.
I don't know how extensive theirpartnership continues to be at this
(26:24):
point, but it's a, it's a dicey timeto be, to be one of these companies.
And onto applications and business.
We begin with someannouncements from nvidia.
There's a preview of theirplans in 2026 and 2027.
They have the Reuben familyof GPUs coming in 2026 and
(26:45):
then Reuben Ultra in 2027.
So that will, also come along witha new I guess server layout with
ability to combine 576 GPUs per rack.
(27:06):
Which, you know, I guess it's, it'svery much following in a tracks of
very, very like crazy enhancement tocomputing that Nvidia has been able to
continue creating with you know, B 200,I believe it's now, and, and now this is
their plans for the next couple years.
I. Yeah, there's, there's a, alot going on with this update.
(27:30):
It's, it's actually pretty,pretty interesting and quite
significant, especially on the, thedata center side in terms of the
infrastructure that'll be requiredto accommodate these new chips.
couple things here, right?
So there is this configuration of theBlackwell called the NVL 72 is the
sort of name of this configuration.
This is where you have so, okay,imagine a tray that you're gonna
(27:52):
slot into rack, a server rack, right?
So on that tray, you'regonna have four GPUs.
Alright, so each tray contains fourGPUs, and in total, in that whole
rack, you're gonna have 72 I'm sorry,you're actually gonna, you're gonna
have 144 GPUs total, but because twoof those GPUs show up on the same
(28:14):
motherboard, God, so each frigging,each frigging tray that you slot into
the rack has two motherboards on it.
Each of those motherboardshas two GPUs, two B, 200 GPUs.
So in total you're puttingin four GPUs per tray.
But they're kind of divided intotwo motherboards each with two GPUs.
Anyway, this led to the, the thingbeing called the NVL 72, when in
(28:37):
reality there's 144 GPUs on there.
At least Jensen Huangsays it would've been more
appropriate to call it the NVL.
1 44 L. Okay.
What's actually interestingin this setup, they're calling
the Reuben NVL 1 44 Rack.
There's not more GPUs there.
It's not that there's twice asmany GPUs as the nv L 72 with the
Blackwells, it's just that they'recounting them differently now.
(28:59):
So they're saying, actually,we're gonna count all the GPUs.
So if I think back in the day wedid talk about the NVL 72 setup.
This is basically justthe same number of GPUs.
Nothing has changed eventhough the number has changed.
If that didn't make any sense,just delete it from your mind.
Let's focus on the thingsthat are actually interesting.
The story is, it's, it's comparableand the number of GPUs to the
(29:20):
current set of top line GPUs.
So they're kind of pitchingit as you can slot it into
your existing infrastructuremore or less, and just to.
Jump into numbers a little bit,you're getting roughly three times.
The inference and trainingperformance in terms of just
raw compute memory is faster.
(29:42):
by close to two-ishor, multiplier of two.
Kind of like, yeah, you're seeingmultipliers on top of the current one.
So quite significant change inperformance if you do upgrade.
So, so when it comes to Ruben,right, which is the, the sort
of next generation coming onlineat FP four you're seeing, yeah,
three x more flops, right?
(30:03):
Three times more more logic capacity.
Now the on the memory side, thingsactually do get somewhat interesting.
The memory capacity is going tobe 288 gigabytes per GPU, right?
That is the same as the B 300.
So no actual change in terms of the.
Like per GPU memory capacity.
(30:25):
We will get back to why that mattersa bit less in a, in a second.
But, but that's kindof part of the idea.
The memory bandwidth is improving.
it's almost doubling or maybe,yeah, it's short of doubling.
So the memory bandwidth isreally, really key, especially
when you look at inference.
So that's one of the reasons whythis is really being focused on.
But there's also a, a bunch of thingslike the, so the cables that connect
(30:48):
GPUs together on roughly speaking on onerack, if you wanna imagine it that way.
Those are called NV link cables.
Super, super high bandwidth.
Those are doubling in throughput.
so that's, you know, really big advance.
There's also stuff happening onthe, the networking side, but
we don't need to touch that.
Bottom line is.
Env link cables used to be the way youconnected GPUs across different trays in
(31:14):
the same rack and maybe, maybe adjacentracks depending on the configuration.
But it's very local, very, very tight,very high bandwidth communication.
What's happening here is each ofthese motherboards that you know, that
you're slotting into your, your rack.
They have A CPU and two GPUs,and we talked about this
in the hardware episode.
You know, as to why that is.
(31:34):
The CPU is like the orchestra conductor.
The GPUs are like the instrumentsthat you know, that are actually doing
the hard work and the heavy lifting.
Typically the CPU would be connected tothe GPUs through A-A-P-C-I-E connection.
So this is a relatively, relativelylow bandwidth compared to NVLink.
Now they're moving over toNVLink as well for the CPU two
(31:58):
GPU connection, that's actuallya really re really big deal.
It comes with a core to core interface.
So right now.
The GPUs and CPUs are going toshare a common memory space.
So essentially directly accessingeach other's memory, whatever's
in memory on the CPU, the GPU canaccess right away and vice versa.
(32:18):
That's a really, really big change.
It used to not be, the case used tohave independent CPU and GPU memory.
The GPUs themselves would sharea common memory space if they
were connected via NV link.
And in fact, that's kind of,that's part of the idea here.
That's what makes them acoherent wad of compute.
And it's also part of the reasonwhy the memory capacity on those
(32:39):
GPUs matters a bit, a bit less.
'cause you, you're, you'readding, you're kind of combining
all your GPUs together and theyhave a shared memory space.
So if you can just add to the number ofGPUs you, you have, you're effectively
adding to your memory capacity.
So that's kind of a, animportant difference there.
So anyway last thing I'll mention.
They say that apparentlyReen Ultra is gonna come out.
(33:00):
This is, so there's gonnabe Reen and then Reen Ultra.
Reuben Ultra is coming outthe second half of 2027.
It'll come with a Reen, GPU and a VeraCPU, like Nvidia tends to do, right?
They name the first namethe CPU, so it's Vera Reen.
And, and so Vera is theCPU Ruben is the GPU.
Apparently the full rack isgonna be replaced by this 576
(33:23):
GPU setup, A massive number.
That is essentially so they don'tspecify the power consumption,
but it's clear from other kind ofindustry products that are coming out.
We're tracking for one megawatt perrack, and I, I just worth emphasizing
that's a thousand kilowatts.
That is a thousand homes worthof power going to a single rack.
(33:46):
In a server in, in a data center.
That's insane.
Right?
So the, the power density required forthis is going through the roof, the
cooling requirements, all this stuff.
It, it's, it's all really cool.
And, and anyway, this isa very, very big motion.
just to dive a little bit intothe numbers, just to fun, right?
So the compute numbers are interms of flops, which is floating
(34:07):
point operations per second.
Basically multiplication persecond or additions per second.
And the numbers we get with theseannounced upcoming things like rein
are now for inference 3.6 exo flops.
Of inference, SOA is, is quintillion.
(34:30):
It's 10 to the 18 quintillionis the one after quintillion.
So I can't even imagine how many zero.
I mean, I guess I know how many zerosthere is, but it's, it's very hard
to imagine a number of that long.
And that's just where we are at.
Also worth mentioning so thisis the plans for 20 26, 20 27.
(34:50):
They also did announce for later thisyear the coming of B 300, which is
also, you know, OB is an improvementin performance of about 1.5.
They also did announce the.
Ultra variants of Blackwell,both 200 and and 300.
(35:11):
And the emphasis we are starting to add,I think is more on the inference side.
They definitely are sayingthat these are good models
for the age of reasoning.
So we're capable of outputtingthings fast in addition to training.
Well, and that's very important forreasoning of course, because the
(35:32):
whole idea is you're using up moretokens to get better performance.
So they're giving some numbers.
Like for instance, Blackwell Ultrawill be able to deliver up to 1000
tokens per second on deep Seeq, R one.
And that's you know, comparable.
Usually you would be seeing somethinglike 100, 200 tokens per second.
(35:53):
A a thousand tokens persecond is very fast.
And then the, the inference focus isreflected too in the fact that they're,
they're looking at, you know, fp fourflops denominated performance, right?
So when you go to inference, oftenyou're, you're inferencing quantized
models, inferencing at FP four.
So lower resolution and also this,the, the memory bandwidth side
(36:14):
becomes really important for inferencedisproportionately relative to training,
at least on the current paradigm.
So that's kind of, you know, part ofthe reason that you're seeing those
big big lifts at that, that end ofthings is because of the inference.
And next story is also about someabsurd sounding numbers with hardware.
This one is from Apple.
They have launched a Mac studiooffering, and the top line configuration
(36:40):
where you can use the M free Ultra.
Chip with 32 CPUs 80 core GPU thatcan even run the Deep Seq R one model.
That's the 671 billionparameter AI model.
(37:00):
Fewer at inference.
You're using about 37 billionper output, I believe.
But still this is, you know, hundredsof gigabytes of memory necessary to be
able to run it and just fit it in there.
Yeah.
Apple's also doing this weirdthing where they're not designing
(37:21):
like GPUs for their data centers,including for AI workloads.
They seem to be basically like doingsouped up CPUs kind of like this with
just like gargantuan amounts of VAM.
that again have a, this verylarge kinda shared pool of memory.
Right.
We talked about like coherentmemory on the Blackwell side right.
And on the Ruben side, just the ideathat if you have a shared memory space,
(37:44):
you can pool these things together.
Well, they're not as good at theshared memory space between CPUs.
What they do is they havedisgusting amounts of RAM one GPU.
Right.
So like 512 gigs is.
It is just wild.
Like it's, anyway, for,for a, for a CPU at least.
It, it's, and, and we are talkinghere about when you say memory, we
(38:05):
mean really something like ram, right?
Yeah.
And, and so if you have a laptop, right?
If you buy a Mac for instance,typically you're getting eight
gigabytes, maybe 16 gigabytesof ram, the fast type of memory.
Read.
What, what is it?
Read something memory.
Random access.
Memory.
Yeah, random access memory, right?
As opposed to the slower memory oflet's say an SSD or things where you
(38:30):
can easily get terabytes, to get thatcrazy amount of random access memory
is insane when you consider that.
Typically it's like eight 16gigabytes and you know, yeah,
this is expensive memory.
It's stupid exp expense.
It's also like yeah, there's differentkinds of ram and we, we talked
about that in our hardware episode.
(38:51):
This is a, a combinedC-P-U-G-P-U setup, by the way.
So 32 core CPU 80 core GPU but sharedmemory across the board So VAM is
like really close to the logic, right?
So this is like the most, as yousaid, exquisitely expensive kind of
memory you can put on these things.
they're opting to go in this directionfor, very interesting reasons, I guess.
(39:13):
I mean, it does mean that they'redisadvantaged in terms of being
able to scale their data centeryou know, infrastructure, their,
their builds because of networkingat least as far as I can tell.
it's a very interestingstandalone, standalone machine.
I mean, this is pretty wild specs.
Right.
Yeah, exactly.
If, if you go to the top lineofferings and, and this is, you
(39:34):
know, a physical product you canbuy as a, yeah, it's a Mac, right?
Yeah, it's a Mac.
Yeah, it's a Mac.
It's like a big kind of cube ish thing.
And if you go to the topline configuration, it's
something like $10,000.
Don't quote me on that, but it's,you know, crazy expensive as well.
It does come with other options.
(39:55):
For whatever reason, M fourmax CPU and, and GPU is less
powerful than M three Ultra.
But anyway, very kind of beefy offering.
Now from Apple.
Next we have something abit more forward looking.
Intel is apparently reachingan exciting milestone for
the 18 a 1.8 nanometer class.
(40:19):
Wafers with a firstrun at the Arizona Fab.
So this is apparently ahead of schedule.
They have these Arizonafabs fab 52 and Fab 62.
Fab, as we've covered before, iswhere you try to make your chips
and 1.8 nanometer is the next kindof frontier in terms of scaling
(40:41):
down resolution of the densityof logic you can get on a chip.
So the fact that they're running thesetest wafers, they're ensuring that you
can transfer the fabrication processto these new Arizona facilities.
I guess the big deal there is partiallythat these are located within the
us, within Arizona, and they areseemingly getting some success and,
(41:06):
and are ahead of schedule, as you said.
And that's impressive becausefabs are and absurdly.
Complex engineering project.
Yeah.
In Intel is in just thisincredibly fragile space right now.
As has been widely reported, andwe've talked about that a fair bit.
I mean, they need to blow itoutta the water with, with
18 a and their future notes.
I mean, this is likea make or break stuff.
(41:27):
So forward progress yeah, theyhad their, their test facility
in, in Hillsborough, Oregon, whothat was doing 18 a production
as you said, on a test basis.
And they're now successfullygetting the first test wafers
in their new Arizona FAB out.
So that's great.
But yeah, it'll eventually haveto be, start running actual
chips for commercial products.
the big kind of distinction here isthey're actually manufacturing with 18
(41:49):
a these gate all around transistors.
think we talked about thisin the hardware episode.
We won't go into too much detail.
Basically, this is a specificgeometry of transistor that allows
you to have better control overthe flow of electrons through
your transistor, essentially.
It's a, a big, big challengepeople have had in making
transistor smaller and smaller.
(42:10):
You get all kinds of current leakage.
the current by the way is sort oflike the, the thing that carries
information in your computer.
And so you wanna make sure thatyou don't have current leakage to
kind of have one's become zerosor let's say operation like, you
know, a certain kind of gate turninto a, the wrong kind of gait.
that's the idea here.
So it's, it's the skate all roundtransistor based on a ribbon fit design.
(42:32):
And yeah, so, so we're seeingthat come to Market Gate all
round is, is something TSMCis moving towards as well.
And, you know, it's just gonnabe the, the next, essentially
the next beat of production.
So here we have 18, a kindof early signs of progress.
And now moving away fromhardware more to businessy stuff.
Xai has acquired agenerative AI startup.
(42:53):
They acquired Hotshot, which is focusedon text to video, similar to soa.
They also have AI powered orinitially they worked on ai, powered
for the tools and then pivoted.
So I suppose unsurprising ina way that they are working
on text to video as well.
(43:15):
They just want to have all thecapabilities at Xai, and this
presumably will make that easier to do.
Yeah, and one of the founders had somequote, I think it might've been on
X, I'm not sure, but he said, we'reexcited to continue scaling these
efforts on the largest cluster inthe world, Colossus as part of XAI.
So it seems like they'll be givenaccess to Colossus as part of this.
Maybe not shocking, but kindof an interesting subnote.
(43:37):
They were backed by somereally impressive VCs as well.
So Alexis Hanian as sort of likefamous for like being the co-founder,
Reddit, of course, and, and doinghis own VC stuff, and SV Angel too.
So pretty interesting acquisition and,and a nice soft landing too for folks
in a space that otherwise, you know,I mean, they're either gonna acquire
you or they're gonna eat your lunch.
So I think that's probably the bestoutcome for people working on the,
(43:58):
the different modalities, at least,at least on my view of the market.
Yeah, and, and I guessacquisition makes sense.
The startup has been around for overtwo years and they have already trained
multiple video models hotshot, Excel,and hotshot, and they do produce
quite good looking videos, so makessome sense for opening AI to acquire
(44:19):
them if only for the kind of brainpower and expertise in, in that space.
Man, they're old.
They've been around forlike two years, right?
Like yeah.
That was what, pre SOA or likearound the time SOA came out?
Yeah, yeah, yeah.
It's funny, it's just funny howthe AI business cycle is so short.
Like, like these guys have beenaround for all the 24 months.
(44:40):
Very experts, very, youknow, that's veterans.
And the last story, Tencentis reportedly making massive
Nvidia H 20 chip purchases.
So they are supposedly meant tosupport the integration of deep seek
into WeChat, which kind of remindsme of Meta, where Meta has this
(45:03):
somewhat interesting drive to let youuse llama everywhere and Instagram
and and all their messaging tools.
So this would seem to be in a waysimilar where Tencent would allow
you to use deep seek within WeChat.
Yeah.
part of what's going on here too isthe standard stockpiling that you
see China do and Chinese companies doahead of an anticipated crackdown from
(45:28):
the United States on export controls.
And in this case, the age 20 hasbeen kind of identified as one of
those chips that's likely to beshut down for the Chinese market
in the relatively near term.
So it makes all the sense inthe world that they would be
stockpiling for that purpose.
but it is also the case that you've gotR one that has increased dramatically.
the demand for for access to hardware.
(45:51):
It, it's, it's sort of funny how,how quickly we pivoted from oh no, R
one came out and, and so Nvidia stockyou know, crashes to, oh, actually
R one is great news for Nvidiaanyway, I, I think it's, it's the
turnaround that we sort of expected.
we talked about this earlierand there has been apparently
a short-term supply shortage inChina regarding these H 20 chips.
(46:12):
So like there's so much demand comingin from Tencent that it's sort of
like rate limiting for Nvidia to getH twenties into the market there.
So kind of interesting they'vepreviously placed orders on the
order of, you know, hundreds ofthousands between them and bike dance.
back, I think last year was almost aquarter million of these, these GPUs.
So yeah, pretty big customers.
And onto projects and open source.
(46:34):
We begin with a story from theinformation titled on Ros, not so Secret
Weapon that's giving agents a boost.
I would say kind of a weird spinon this whole story, but anyway,
that's the one we link it to andit covers the notion of MCP model
context protocol, which philanthropicreleased all the way back in November.
(46:57):
We hopefully covered it.
I guess we, we dunno.
I think I actually did.
I was trying to remember.
Yeah, yeah, yeah.
I think we did.
And the reason we're covering itnow is that it sort of blew up over
the last couple weeks if you're inthe AI developer space or you see
people hacking on AI that it hasbeen the talk of a town, so to speak.
(47:18):
So model, context, protocol, broadlyspeaking is something like an API like
a standardized way to build, portsor mechanisms for AI agents or ai,
I guess, models to call on services.
(47:40):
So it standardizes the way youcan provide things like tools.
So there's alreadymany, many integrations.
following the standard for thingslike Slack, perplexity notion,
et cetera, where if you adopt aprotocol and you provide an MCP
compatible kind of opening you canthen have an MCP client, which is
(48:04):
your AI model call upon this service.
And it's, it's very much like anAPI for a website where you can, you
know, have a particular URL to go toparticular kind of parameters, and
you get something back in some format.
Here, the difference is that, of course,this is more specialized for AI models
(48:29):
in particular, so it provides tools,it provides like a prompt to explain
the situation, things like that.
Personally, I'm.
In the camp of people who are a bitconfused and kind of think that this is
an API for an API kind of situation, buteither way, it has gotten very popular.
That that is exactly what it is.
(48:50):
Yeah.
It's an API for an API.
It's also, i, I guess a transition pointor, or could be viewed that way, you
know, in the sense that eventually youwould expect models to just kinda like,
figure it out you know, and, and haveenough context and ability to uncover.
Whatever information is alreadyon the website to be able
(49:11):
to use tools appropriately.
But there are edge cases where youexpect this to be worthwhile still.
This is gonna reduce things likehallucination of, of tools and, and all
kinds of issues that when you talk aboutagents, right, like one failure anywhere
in a reasoning chain or in an executionchain can, can cause you to fumble.
And so, you know, this is structurallya way to address that and, and
(49:31):
quite important in that sense.
It is also distinct from a lot of thetooling that opening eyes come out
with, but that sounds similar, likethe agent's API where they're focused
more on chaining tool uses together,whereas MCP as you said, is more about
helping make sure that each individualinstance of tool use goes well, right?
That the, the agent haswhat it needs to kind of.
(49:51):
Ping the tool properly and interact withit and, and find the right tool rather
than necessarily chaining them together.
So there you go.
you know, MCP is, is a nicekind of clean, open source
play for Anthropic too.
They are going after thatkind of more startup founder
and, and business ecosystem.
so pretty important from a, amarketing standpoint for them too.
(50:12):
Right.
Yeah, exactly.
So back in November they announced us,they introduced us as an open standard
and they also released open sourcerepositories with, some example of
model context product called servers.
And as well they specificationand like a development toolkit.
(50:32):
So I honestly haven't been ableto track exactly how this blew up.
I believe there was some sortof tutorial given at some sort
of convention, like the AIengineer convention or something,
and then in kind of took off.
And everyone is very excitedabout the idea of model
context protocols right now.
Moving on to new models.
(50:52):
We have Mistral dropping a new opensource model that is comparable
to GP four oh mini and is smaller.
So they have Mistral small 3.1, whichis seemingly better than similar models,
but only has 24 billion parameters.
Also can take on more input tokens,128,000 tokens and is fairly
(51:20):
speedy at 150 tokens per second.
And this is being released underVE Apache two license, meaning that
you can use it for whatever you wantbusiness implications, et cetera.
I don't think there's too much tosay here other than like, like a kind
of a nitpick here, but they say itoutperforms comparable models like
(51:40):
Gemini three, GT four oh mini whiledelivering inference speeds, as
you said, of 150 tokens per second.
But like, you can't just say that shit.
Like it doesn't mean anything to say.
Yeah.
It depends on whatinfrastructure you're using.
Yeah.
What's the stack dude like?
Yeah.
Yeah.
You know, like I can move at ahundred, like at, at a hundred
miles an hour if I'm in a Tesla.
(52:02):
that's what makes me, anyway.
they do give that information,but it's like buried like
in the literal like grain.
Yeah.
This is from where blog post where we,we get these numbers, so I guess we.
as with any of these modelannouncements, you go to a company
blog, you get a bunch of numbers onbenchmarks showing that it's the best.
You have comparisons to Gemma freefrom Google to cohere ia, GP four
(52:28):
oh, mini Cloud 3.5 Haiku, and onall of these things like MMLU, human
Eval math it typically is better.
Although, you know, I would say itdoesn't seem to be that much better than
Gemma at least, and in many cases isnot better than 3.5 Haiku and GP four oh
(52:48):
mini, but yeah, and it still quite good.
one 50 tokens per second.
Two for context is a batchsize 16 on four H one hundreds.
They actually, like, even inthe technical post, they write
while delivering inferencespeeds of 150 tokens per second
without further qualifying.
But it's in, like, it's in this likesmall gray text underneath an image
(53:10):
that you have to look for whereyou actually find that context.
So don't expect this to runat 150 tokens per second on
your, your laptop, right?
That's, that's just not gonna happen.
cause you know, four H 100 is, islike, that's quite a lot of horsepower.
Still yeah, it's anincremental improvement.
More, more open source coming from,from mistrial and Apache 2.0 license.
So, you know, highly permissive.
(53:30):
And one more model.
We have Exon Deep Reasoningenhanced language models
coming from LG AI research.
So these are new models newfamily of models, 2.4 billion, 7.8
billion and 32 billion parameters.
These are optimized for reasoningtasks and seemingly are you
(53:54):
know, on par or out outperformingvariations of R one where R one
is the giant one at 671 billion.
There is distilled versions ofthose models at comparable sizes.
And in the short technical reportthat they provide, they are showing
(54:16):
that it seems to be kind of along thelines of what you can get with those
distilled R one models and, similaror better than open ai oh, one mini.
Yeah, it, it's also, it's kind ofinteresting, it, it, it seems to
as described, again, not, not a lotof detail in the, in the paper, so
it makes it hard to reconstruct,but it does seem to be at odds with
(54:38):
some of the things that, learn inthe deep CR one paper, for example.
So there's they start with, itseems an instruction tuned model.
Base model, the XO one, 3.5 instructmodels, and then they add onto that.
A bunch of fine tuning.
they do supervised, fine tuning.
Presumably.
This is for like the reasoningstructure and then DPO, so standards
(55:00):
or like RL stuff and online rl.
So, you know, this is you know, quitea bit of supervised fine tuning, of
trying to teach the model how to solveproblems in the way you want it to
solve them, rather than just like givingit a reinforcement learning signal
and reward signal and, and like kindof have at it like R one zero did.
So yeah, kind of aninteresting alternative.
More, let's say, more inductiveprior laden approach and first
(55:26):
time as well that I think we'vecovered anything from LG ai.
I think so, yeah, Exxon thesemodels appear to have already.
Been existing and being released.
Exxon 3.5.
We're back from December, whichwe somehow missed at the time.
Yeah, a fun fact.
Exxon stands for Expert AI for everyone.
(55:46):
You gotta love when people come up withthese acronyms in a very creative way.
And they are open sourcing it onhugging face with some restrictions.
This is primarily for research usage.
Onto research and advancements.
We begin with a paper sample, scrutinizeand scale effective inference time
search by scaling verification.
(56:06):
It is coming from Google andlet me just double check.
Also you see Berkeley and it isquite exciting I think as a paper.
It basically is making the case orpresenting the idea that to do scaling
inference, time, scaling with theidea of inference, time, scaling
(56:28):
being basically, you know, once youalready train your model, once you stop
updating your weights, can you use yourmodel more to get smarter if you just
kind of do more outputs in some way.
And what he's seen in recent months isinterest time scaling via reasoning.
Where you have a model, you know,output a long chain of tokens where
(56:51):
it does various kinds of strategiesto, do better at a given complicated
task like planning, substeps, likeverification, backtracking, these
things we've already covered.
Well this paper is saying instead ofthat sort of scaling the output in terms
(57:13):
of a chain, an extended chain of tokens,you can instead sample many potential
outputs like just do a bunch of outputsfrom scratch and kind of varied up so
you get ma many possible solutions.
And then if you have a verifier andyou can kind of compare and, and
(57:36):
combine these different outputs.
You can be as effective or even insome cases more effective than the kind
of traditional reasoning, traditionalinference, time scaling paradigm.
So again, yeah, quite interestingin the paper they have a table one
where they're giving this exampleof if you sample a bunch of outcomes
(57:59):
and you have a verifier that is good,you can actually outperform oh one
preview and many other techniques.
You can get better, numbers on thehard reasoning benchmarks like Amy,
where you can solve eight out of the15 problems now, which is insane.
Yeah.
But that's where we are.
And on math and live benchmath and live bench reasoning.
(58:22):
So yeah, very interesting idea tosample a bunch of solutions and
then just compare them and kind ofcombine them into one final output.
I think this is one of thosekey papers that again, I mean we
see this happen over and over.
Scaling is often a bit morecomplex than people assume, right?
(58:44):
So at first, famously we hadpre-training, scaling, scaling,
pre-training, compute thatbrought us from, you know, GT
two, GP three to g PT four.
Now we're in the inference timecompute paradigm where, you
know, this analogy that I liketo use, like you have 30 hours to
dedicate to doing well on a test.
You get to choose how much of thattime you dedicate to studying,
(59:06):
how much you dedicate to actuallyspending writing the test.
And what we've been learning is,you know, scaling, pre-training,
compute is basically all study time.
And so if you just do that and youeffectively give yourself like one
second to write the test, well there'sonly so well you can do, right?
You eventually, youstart, start to saturate.
If you just keep growingpre-training and you don't grow
inference, time compute, you don'tinvest more time at test time.
(59:28):
So essentially you havetwo dimensional scaling.
If you wanna get the sort ofindefinite returns, if you want
the, the curves to just keep goingup and not saturate, you have to
scale two things at the same time.
This is a case like that, right?
This is a case of a scaling lawthat would be hidden to a naive
observer just looking at this asa kind of a one variable problem.
(59:50):
When in reality it's amultivariate problem.
And suddenly when you account forthat, you go, oh, wow, there's a
pretty robust scaling trend here.
And so what are these two variables?
Well, the first is scaling thenumber of sampled responses.
So just the number of shots ongoal, the number of attempts
that your model's going to makeat solving a given problem.
But you have to improve verificationcapabilities at the same time, right?
(01:00:12):
So they're asking the question,what test time scaling trends
come up as you scale both thenumber of sampled responses and
your verification capabilities?
One of the things crucially, that theyfind though, is that you might naively
think if you have a, a verifier, right?
And, and I wanna emphasizethis is not a verifier that has
access to ground truth, right?
This is like, and I think, Ithink they use like Claude 3.7
(01:00:34):
for this 3.7 sonnet for this.
But basically this is amodel that's gonna look at.
Say the 20 different possible samples,in other words that you get from
your model to try to solve a problem.
And this, verifier model is going tocontrast them and just determine based
on what it knows, not based on accessto actual ground truth or any kind
of symbolic system like a calculator.
(01:00:56):
It's just going to use its ownknowledge in the moment to determine
which of those 20 differentpossible answers is the right one.
And what you find is as you scalethe number of possible answers that
you have your verifier look at,you might naively think, well, with
just so much Gump in the system.
Probably the verifiers performance isgonna start to drop over time, right?
(01:01:20):
It's just like, it's so much harderfor it to kind of you know, remember
which of these was that good and, andwhich was bad and, and all that stuff.
And so eventually you would expectthe, the kind of trend to, to be
that, you know, your performancewould saturate, maybe even drop.
But what they find is the opposite.
The performance actuallykeeps improving and improving.
And the reason for that seems to bethat as you increase the number of
(01:01:43):
samples, the number of attempts tosolve the problem, the probability
that you get a truly exquisite answerthat is so much better than the others,
like that contrasts so much with themedian other answer that it's really
easy to pick out increases, and thatactually makes the verifiers job easier.
And so they refer to this asan instance of what they call.
(01:02:03):
Implicit, I think it was implicitscaling was the term Yeah.
Implicit scaling.
Right.
So essentially the yeah, this ideathat, you're more likely to get one
like exquisite outlier that favorablycontrast with the crappy median samples.
And so in this sense, I mean,I, I feel like the, the term
verifier is maybe not the best one.
Really what they havehere is a contracter.
(01:02:25):
When I hear verifier, I tendto think of ground truth.
I tend to think of something that isactually kind of like, you know, giving
us, you know, checking, let's say codeand seeing if it compiles properly.
Yeah.
It's, it's mark into stufflike a selector or compiler
you could say, where it takesa bunch of possible outputs.
And then from all of that picks outthe best kind of guess to answer.
(01:02:49):
Exactly.
Yeah.
And this is why, I don't know if it's,it's not a term that, that people
use, but if it was like contrast or isreally what you're doing here, right?
You're like, you're sort of doing ina way, a kind of contrast of learning.
Well, not learning necessarily,it's all happening at inference
time, but yeah, so the, theperformance is really impressive.
there are implications of this for the,the design of these systems as well.
(01:03:10):
So they're, you know, trying to findways to wrangle problems into a shape
where you can take advantage of implicitscaling, where you can have your
model pump out a bunch of responses inthe hopes that you're gonna get, you
know, one crazy outlier that makes iteasier for the verifier to do its job.
So yeah, again, you know, I, I think.
Really interesting case of, ofmultidimensional scaling laws
(01:03:32):
essentially that are otherwise, youknow, like easily missed if you don't
invest in, in both verified performanceand sampling at the same time.
Exactly.
And, and this is, I think, importantcontext to provide the idea of sampling
many answers and just kind of pickingout the answer that occurred the
most times in all these samples is.
(01:03:53):
A well known idea.
There's also, I meana selfs consistency.
Selfs consistency, exactly like amajority vote essentially is, is
one well established technique toget better, better performance.
And there's, you know, even in morecomplex things you could do also
with like mixture of, I forget theterm that exists, but the general
(01:04:14):
idea of paralleled generationof outputs and, and generating
multiple outputs, potentiallymultiple models is well known.
And, and real insight, as you saidhere, is that you need a strong verifier
to be able to really leverage it.
So in the stable one, they show that,if you compare, like you do get better
performance quite a bit, if you justdo consistency, if you just sample 200
(01:04:38):
responses and pick out the majoritycompared to the thing where you don't
do any scaling, you are getting fourproblems out of 15 on Amy as opposed
to one, you're getting, you know,a significant jump in performance.
But after 200, if you go to 1000,you basically stop getting better.
(01:04:58):
But if you have a strong verifier,basically a strong selector among
the things, instead of majorityvoting, you have some sort of
intelligent way to combine theanswers, you get a, a huge jump, a huge
difference between just consistencyat 200 and verification at 200.
And one reason this is very importantis first way also highlight the fact
that verification is a bit understudied,think in the space of LLMs, LLMs are
(01:05:24):
not usually out the box very good atit, and they even introduce a benchmark
specifically for a verification.
The other reason this is verynotable is that if you're doing
sampling based techniques,you can paralyze the sampling.
And that is different fromextending your reasoning or your
(01:05:44):
search because, reasoning viamore tokens is sequential, right?
You can't paralyze that.
It's gonna take more time.
Versus if you're scaling via sampling,you can paralyze all the samples and
then just combine them, which meansthat you can get very strong reasoning
at comparable timescales to if youjust take one output, for instance.
(01:06:10):
So that's a very big deal.
Yeah.
another, key reason why this works too.
We covered a paper, I think it wasmonths ago that was pointing out
that if you look at all the alignmenttechniques that people use to get more
value out of their, their models, right?
They, they kind of assume this,one query one output picture, like
let's align within that context.
Whereas in reality, what you'reoften doing, especially with agents
(01:06:33):
and infant inference time compute,is you actually are sampling like a
large number of outputs and you don'tcare about how shitty the average.
Generated sample is, generated solutionis what you care about is, is there
one exquisite one in this batch?
And what they did in that alignmentpaper is they found a way to
kind of like up weight, justthe most successful outcomes.
(01:06:57):
and use that for a reward signal orsome kind of gradient update signal.
But this is sort of philosophicallyaligned with that, right?
It's, it's saying you know, likeyou said, self consistency is the
view that says, well, let's justdo wisdom in numbers and say we
generated, you know, I don't know,like a hundred of these outputs.
And the most common response was,most consistent answer was this one.
(01:07:19):
So let's, call this the one thatwe're, we're gonna cite as our output.
But of course, if your model has aconsistent failure mode, and that can
also cause you to true self consistency.
Identify that, you know, kindof settle on that failure mode.
Whereas what you really careabout is what is the best
answer in this, in this big pot.
And that's really what this is.
This is after.
So a lot of interesting ties toother lines of research as you
(01:07:41):
said, and, and I think a, a reallyinteresting and important paper.
And next up we have the paperblock diffusion interpolating
between auto aggressive anddiffusion language models.
And also I think quitean interesting one.
So typically when you are using anLLM, you're using auto aggression,
which is just saying that you'recomputing one talk at a time, right?
(01:08:03):
You, you start with one word,then you select the next word,
then you select the next word.
You have this iterative process,and that's a limitation of
traditional LMS because.
You need to sequentiallydo this one step at a time.
You can't generate an entireoutput sequence all at once.
As opposed to a diffusion whichis a generation mechanism a way
(01:08:28):
to degeneration that is kind ofgiving you an entire answer at once.
And diffusion is the thing that'stypically used for image generation,
where you start with just a, anoisy image, a bunch of noise.
You iteratively update theentire image all at once until
you get to a good solution.
And we, we covered, I believe maybea week ago or two weeks ago, a story
(01:08:50):
of a diffusion based LLM that was.
Seemingly performing pretty well.
There was a company that made the claim,although we didn't provide too much
research on it, well, this paper istalking about, well, how can we combine
the strengths of both approaches?
The weakness of diffusion is typicallyjust doesn't work as well for LLMs.
(01:09:13):
And there's kind of various hypothesesis an interesting question of why it
doesn't work, but it doesn't work.
It also doesn't workfor arbitrary length.
You can only generate a specifickind of horizon and some
other technical limitations.
So the basic proposal in thepaper is, well, you can have this
idea of block diffusion where youstill sample other aggressively
(01:09:38):
like, sample one step at a time.
But instead of sampling just oneword or one token, you use diffusion
to generate a Chunk of stuff.
So you generate several tokens allat once of diffusion in parallel, and
then you add aggressively, keep doingthat, and you get the, you know, the
(01:09:59):
best of both worlds, so to speak.
So an interesting idea kind ofarchitecture I haven't seen before
and potentially could lead tostronger or, or faster models.
Yeah, it, it's alsomore paralyzable, right?
So the big advantage becausethese, within these blocks,
you're able to just like de-noisein, you know, one shot with.
(01:10:21):
All the text in there.
it means you can paralyze more.
They, they try, I think with blocksof various sizes Inc. Like size of
four tokens, for example, right?
So like, let's de-noisefour tokens at a time.
They do it with this interestingkind of like, well, it's essentially
they put masks on and, and kindof gradually remove the masks
on the tokens as they de-noise.
That's their interpretationof how Denoising would work.
(01:10:43):
it is interesting, the performance is.
Lower obviously, than state-of-the-artfor auto aggressive models.
Think of this more as a proof ofprinciple that there are favorable
favorable scaling characteristics.
There's some promise here.
In my mind, this fits into the kind ofsome of the mamba stuff where, you know,
the next logical question is, okay,like how much scale will this work at?
(01:11:06):
And then do we see those losscurves eventually with some updates,
some, some hardware jiggery,pokery converging and then crossing
the loss curves that we see fortraditional auto aggressive modeling.
I mean, it's, it isinteresting either way.
They found a great way tokind of break up this problem.
And one reason speculatively thatdiffusion does not work as well with
(01:11:28):
text is partly that like you tendto think as you write, and so your
previous words, you know, reallyWill affect in a, in a causal way
where you're going and, and tryingto do diffusion at the, you know, in
parallel across a body of text likethat from an inductive prior standpoint
doesn't quite match that intuition.
But, could very well be wrong,sort of like top of mind thought.
(01:11:49):
But anyway, it's, it's a good paper.
the question is always will it scale?
Right?
And there are lots of goodproofs of principle out there
for all kinds of things.
Whether they end up gettingreflected in, in scaled training
runs is the big question.
Exactly.
And this does require quite differenttraining and you know, models from
what all these other LMS are so decent,chances won't have a huge impact
(01:12:11):
just because you have to change up.
And all these trainedmodels already are LLMs.
In the auto aggressive sense.
Diffusion is, is a whole new kind of.
Piece of a puzzle that isn't typicallyworked on, but nevertheless interesting,
as you said, similar to Mamba.
Next, we have communication efficientlanguage model training scales reliably
(01:12:32):
and robustly scaling laws for the local.
distributed low computation training.
And I think I'll just let youtake this one, Jeremy, since I'm
sure you dove deep into this.
Oh yeah.
I mean, so I, I just think de Loco,if you're interested in anything
from like national security to thefuture of data center design is
(01:12:54):
and US China competition generallylike, is just so important and,
and this general thread, right?
Like together AI distributedtraining, all that stuff.
So as a reminder, when you're thinkingabout De Loco, like how does it work?
So this basically is the answerto a problem, which is that
traditional training runs inlike data parallel training, like
(01:13:16):
in data centers today happens.
In this way where you have abunch of number crunching and the,
bunch of communication, like aburst of communication that has
to happen at every time step.
You like share gradients, you updatelike across all your GPUs, you update
your model weights, and then you go ontothe next the next mini batch, right?
Or the, the next, part of your data set.
And you repeat, right?
(01:13:36):
Run your computations, calculatethe gradients, update the
model weights, and so on.
And there's this bottleneck,communication bottleneck that
comes up when you do this at scalewhere you're like just waiting
for the communication to happen.
And so the question then is gonnabe, okay, well what if we can
set up sort of smaller pockets?
'cause you're, you're basicallywaiting for the stragglers, right?
(01:13:57):
The, the slowest GPUs are gonnadictate when you finally sync up
everything and you can move onto the next stage of training.
So what if we could set up a situationwhere we have a. A really small pocket
of like a mini data center in one cornerworking on its own independent copy
of the model that's being trained, andthen another mini data center doing
the same and another and another.
And very rarely we have anouter loop where we do a general
(01:14:20):
upgrade of the whole thing sothat we're not constrained by the,
the slowest kinda lowest commondenominator straggler in that group.
So this is gonna be the, the kindof philosophy behind De Loco.
You have an outer loop thatessentially is going to.
In a more, think of it as like this,like wise and slow, slow loop that
(01:14:43):
updates based on what it's learnedfrom all the local data centers that
are running their own training runs.
And then within each local data center,you have this much more radical,
aggressive loop, more akin to whatwe see in traditional data parallel
training in, in data centers, which,you know, they're running anyway.
We, we have a wholeepisode on, on De Loco.
Check it out.
I think we talk about the AtomOptimizer or Atom W Optimizer that,
(01:15:03):
that runs at the, the local leveland then the sort of like more
gradient descent, nestra of momentumoptimizer on, on the outer loop.
The, details there don'tmatter too, too much.
This is a scaling law paper.
Basically what they're, what they'retrying to figure out is How can we
study the scaling laws that predictthe performance, let's say, of these
models based on how many model copies,how many mini data centers we have
(01:15:26):
running at the same time, and the sizeof the models that we're training.
And they test at meaningful scales.
They go all the way upto 10 billion parameters.
And, essentially they were able throughtheir, their scheme, through their hyper
parameter optimization to reduce thetotal communication required by a factor
of over a hundred in their scheme.
it's a great, great paper.
(01:15:47):
I think one of the wilder things aboutit is that they find that even when
you just have a single replica or a,let's say a single mini data center
you still get a performance lift.
This is pretty wild, likerelative to, to the current like
purely data parallel scheme.
So it benefits you to have this kindof radical quick updating interloop.
(01:16:10):
And then to add on top of thatthe slow wise outer loop even if
you only have a single data centerthat's like, you could, you could
just do just the radical innerloop and that would be enough.
But by adding this like more strategiclevel of optimization that comes from
the nest draw of momentum that thesort of slower gradient updates for the
(01:16:31):
outer loop, you get better performance.
That's highly counterintuitive,at least to me.
And it does suggest thatthere's a, a kind of stabilizing
influence that you're gettingfrom just that, new outer loop.
The last thing I'll mention isfrom a, a sort of national security
standpoint, one important questionhere is how fine grained can this get?
Can de Loco continue to scalesuccessfully if we have not
(01:16:55):
one, not three, not not eight,but like a thousand of these
mini data centers, right?
If that happens, then we live in aworld where essentially we're doing
something more like BitTorrent, right?
For, for trainingmodels at massive scale.
And we live in a world whereit becomes a lot harder to
oversee training runs, right?
(01:17:16):
If we do decide that the trainingmodels at scale introduces WMD
level capabilities through cyberrisk, through bio risk, whatever it
actually gets to the point where.
If there's a mini data center onevery laptop and every GPU if De
Loco, that, that is the promiseof De Loco in the long run.
Then yeah, what is the meaningfultool set that policy has, that
(01:17:37):
government has to make sure thatthat things don't get misused.
That you don't get the proliferation ofWMD level capabilities in these systems.
So I think it's a really,really interesting question.
And this paper is astep in that direction.
they don't push it, I thinkbeyond like eight essentially
of these mini, data centers.
But I think we're gonna seea lot more experiments in
that direction in the future.
(01:17:58):
Right.
And this is following up ontheir initial Paper De Loco
back in September of 2024.
I think we mentioned at the time thatit's quite interesting to see Google,
Google DeepMind publishing this work.
Yeah.
Because it does seem like I.Something you might keep secret,
it's actually quite impactful forhow you build your data centers.
(01:18:22):
Right?
And once again you know, theydo very expensive experiments to
train, you know, billion parametermodels to verify that compared
to the usual way of doing things.
This is, comparable, let's say, and,and can achieve similar performance.
So, you know, a big deal if you'rea company that is building out data
(01:18:43):
centers and spending billions of dollars
onto a couple quicker stories.
'cause we are, has alwaysstarting to run outta time.
The first one is Transformerswithout Normalization, and
that's, I believe is from Meta.
They're introducing a new ideacalled Dynamic 10 H, which
is a simple alternative totraditional normalization.
(01:19:05):
So just keep it simple.
You have normalization, whichis when you make everything sum
up to one as a typical thing,typical step in sfor architectures.
And what they found in thispaper is you can get rid of that.
If you add this little computationalstep of 10 h basically a little
(01:19:28):
function that kind of flattensthings out, it ends up looking
similar to normalization.
And that's quite significant becausenormalization requires you to
do a computation over, you know,a bunch of outputs all at once.
This is a per output computationthat could have a meaningful impact
on the total computation, kindof requirements of transformer.
(01:19:53):
Next up, I think Jeremy,you mentioned this one.
We have an interestinganalysis measuring AI ability
to complete long tasks.
And it is looking at how 13 FrontierAI models going from 2019 to 2025, how
long a time horizon they can handle.
(01:20:15):
And they found that the ability to getto 50% task completion time horizon, so
on various tasks that require differentamounts of work that has been doubling
approximately every seven month.
And they have, you know, a curve fit.
It's kind of like a Moore's law,basically kind of introducing
(01:20:36):
the idea of a very strong trendtowards the models improving
on this particular measure.
Yeah, this was a really interestingpaper, generated a lot of, a
lot of discussion, includingby the way a tweet from Andrew
Yang who he tweeted this paper.
He said, guys, AI's going,going to eat a shit ton of jobs.
I don't see anyone really talkingabout this meaningfully in terms
(01:20:56):
of what to do about it for people.
What's the plan?
Kind of interesting 'cause it's like,I don't know, it's, it's world's
colliding here a little bit, right?
The, the political and the,the, the like deep a GI stuff,
but yeah, this out of meter.
I think one of the importantcaveats that's been thrown around
and, and fairly so is when youlook at performance improvements
in in metrics like this, right?
(01:21:18):
So their question is howlong does, does a task have
to be before an AI agent?
Fails it about 50% of the time.
Right.
They call that the 50% time horizon.
And the, the observation is that,yeah, this, as you said, like the,
that time horizon's been increasingexponentially quite quickly.
(01:21:40):
Like it's, it's been increasing doublingevery seven months as they put it.
Which itself, I mean, kind of worthflagging, doubling every seven months.
Training compute forFrontier AI models grows.
Doubles every six months or so.
So it's actually increasing atabout the same rate as training
compute that we throw at this.
(01:22:00):
Now that's not fully causal.
There's other stuff besides trainingcompute that's increasing the
actual performance of these models,including algorithmic improvements.
but still it does mean we shouldexpect kind of an exponential course of
progress towards a SI if our benchmarkof progress towards a SI is this
kinda like 50% performance thresholdwhich I don't think is unreasonable.
(01:22:22):
It is true that it dependson the task, right?
So not all tasks show thesame rate of improvement.
Not all tasks show the same performance.
But what they do find is that alltasks they've tested basically
show an exponential trend.
That itself is a reallyimportant kind of detail.
The tasks they're focusing onhere, though I would argue are
actually the most relevant.
(01:22:44):
These are tasks associated withautomation of machine learning,
engineering tasks, machine learningresearch, which is explicitly the
strategy that open ai, that anthropic,that, Google are all gunning for.
Can we make AI systems that automateAI research so that they can rapidly,
you know, get better at improvingthemselves or at improving AI systems?
And you close this loop and,you know, we get recursive
(01:23:06):
self-improvement essentially.
And then take off to super intelligence.
I actually think this is quite relevant.
One criticism that I would have ofthe curve that they show that shows
doubling time being seven months.
And by the way they extrapolatethat to say, okay, so then we should
assume that, you know, based on thisAI systems will be able to automate
(01:23:27):
many software tasks that currentlytake humans a month, sometime
between late 2028 and early 2031.
So those are actually quite long aSI timelines if you think a SI has
achieved once AI can do tasks thatit takes humans about a month to
do, which I don't know may be fair.
If you actually lookat the curve though.
It does noticeably steepenmuch more recently.
(01:23:52):
And in precisely kind of the, the pointwhere synthetic data, self-improvement,
like, reinforcement learning on chainof thought, with verifiable rewards.
So basically the kind of strawberryconcept started to take off.
And so I think that's prettyclearly its own distinct regime.
DDA had a great set of,of tweets about this.
but fundamentally, you know, I, I thinkthat there is maybe an error being made
(01:24:15):
here and not recognizing a new regime.
It's always gonna be debatable'cause the sample sizes are so small.
But we're taking a look at that plot,seeing if you agree for yourself
that the sort of last, I guess sixor so entries in that plot actually
do seem to chart out a, a steeperslope and a meaningfully steeper one
that could have, you know, I meanthat same one month r and d benchmark
being hit more like, you know, even.
(01:24:37):
2026, even 2025.
pretty interesting.
Yeah, and I, I think it's,it's important to know that
this is kind of a general idea.
You shouldn't kind oftake this too literally.
Obviously it's, it's a bit subjectiveand very much depends on the tasks.
Here is a relatively smalldata set they're working with.
So it's mainly softwareengineering tasks.
(01:25:00):
They have free sources of tasks, Hcast, which is 97 software tasks that
range from one minute to 30 hours.
They have seven difficult machinelearning, research engineering tasks
that take eight hours each, and thenthese software atomic actions that.
Take one second to 30 secondsfor software engineers.
(01:25:22):
So pretty limited varietyof tasks in general.
And the, the length of tasks here,by the way, is meant to be measured
in terms of how long they would taketo a human professional, which of
course there's also variance there.
So the, the idea is, I think moreso the general notion of, x percent
(01:25:43):
task completion time horizon which isdefined as for a task that it takes x
amount of time for human to complete.
Can an AI complete it successfully, youknow, 50% of the time, 80% of the time.
So right now they, we aregoing up to eight hours.
Again, it is not talking about how fastthe AI is itself, by the way, it was
(01:26:05):
just, just talking about success rate.
So interesting ideas here on, ontracking with future in the past.
And just one more thing to cover,and it is gonna be h cast, human
calibrated autonomy software tasks.
So this is the benchmark of 189machine learning cybersecurity
(01:26:25):
software engineering, andalso general reasoning tasks.
And they use a subset of, of that,I think, in the PBS analysis.
So they collected 563 human baselines.
That's over 1500 hoursfrom various people.
And that collected a, data setof tasks to then measure AI on.
(01:26:45):
Alrighty, moving onto policy and safety.
We start with xci and ontology project.
This is an appliedresearch lab ontology.
And they have announced this,project Xci, which they claim is the
world's first artificial scientist.
It heard that before.
(01:27:05):
I know, right?
The, the difference I suppose hereis that they got xci to publish
a peer reviewed paper at ecl,at ECL workshops, I should note.
this AI wrote a paper, submittedit the reviewers, human reviewers
of this prestigious conference,eclair then reviewed it and let it
(01:27:30):
fought it was worthy of publication.
So that seems to be a proof of conceptthat we are getting to a point where
you can make AI do AI research.
ECL iss a AI research conference,which Jeremy is, is, as you noted,
a very important detail, in trackingwhere we are with the ability of AI to
(01:27:54):
improve itself and kind of potentiallyre reaching super intelligence.
Yeah, and we've got also, I thinkwe've covered a lot of these, but
you know, there are quite a fewlabs that have claimed to be the
first, AI scientist, you know,Sana, famously the AI scientist.
You know, Google has its ownsort of research product and
then there's auto science.
But this one is, I willsay quite impressive.
(01:28:16):
Had a look at, at some of the, thepapers, and they are, they're good.
the authors, the, the company ifthat's the right term for ontology.
I couldn't find anyinformation about them.
I looked on Crunchbase,I looked on PitchBook.
I tried, you know, tried a lot of stuff.
I think that this is kindof their coming out party,
but they don't have that.
I could see any information aboutYik, where they come from, what
(01:28:36):
their funding is and, and all that.
So, hard to know.
We do know based on the, thepapers they put out that their
co-founders are Andy J and Ron Ariel.
You know, Ron Ariel previously wasat Intel Labs under Joe Sha Bach.
So if you're fan of his work, youknow, it might kinda ring a bell
and then couple of other folks.
So I think maybe most useful, so there'sfairly limited information about the
(01:29:00):
actual model itself and how it's set up.
But it is a multi-agentsetup that, that we do know.
it's produced a whole bunchof interesting papers.
So I'm just gonna mention one.
it's called CS Reft, I guessit's how you pronounce it.
Compositional subspace representation.
Fine tuning.
And just to give you an idea of howcreative this is, so you have a model.
(01:29:22):
If you try to retrain the model toperform a specific task, it will
forget how to do other things.
And so what they do isessentially they identify.
subspace of the parameters in themodel to focus on one task and then a
different subspace of those parametersto, or sorry, I should say, not even
the parameters of the activations toapply a transformation in order to, to
(01:29:47):
have the model perform a different task.
And so anyway, just cognizant of our,our time here, I don't think we have
time to go into detail, but this.
Like paper, I'm frustrated that Idiscovered it through the ZCI paper
because I would've wanted to cover thisone, like for a full time block here.
It is fascinating.
It's actually really, really clever.
They also did, anyway, some, someother kind of AI safety, red teaming,
(01:30:11):
vulnerability detection stuff thatis also independently really cool.
But bottom line is and this isan important caveat, so this is
not a hundred percent automated.
They have a human in theloop reviewing stuff.
So they, they, you know, take alook at intermediate work that
Sochi puts out before allowingit to do further progress.
Apparently this happens atthree different key stages.
(01:30:33):
The first is before sensitive, sorry,before extensive experimentation starts.
So before a lot of say computingresources pour in after the results are.
Kind of solidified they say, sosomewhat grounded, but before the
manuscript is written, and then againafter the manuscript is written.
So you still do have a lot of alot of humans in the loop acting
to some extent as a hallucinationfilter among other things.
(01:30:56):
There is, by the way, a section in thispaper on recursive self-improvement
that I thought was interesting.
They say during developmentwe observed early indications
of this recursive advantage.
They're referring here torecursive self-improvement.
When xsi designed novel algorithmiccomponents to enhance the quality
of its generated research hypothesisthese components were subsequently
incorporated into later versionsof the system architecture,
(01:31:19):
improving overall research quality.
So this isn't all gonna happen at once,necessarily with one generation of
model, but it could happen pretty quick.
And I think this is kind of aninteresting canary in a coal mine
for, for recursive self-improvement.
Right, exactly.
And yeah, as you said, I think lookingat the papers, they seem significantly
(01:31:40):
more creative and interesting than whatyou've seen, for instance, from Saana.
And Saana had a lot of built-instructure as well was pretty
easy to criticize, worth noting.
Also, this takes a high levelresearch direction as an input and
the human can also provide inputat any time, high level feedback
(01:32:00):
at any time, which apparently isalso used during paper writing.
So a lot of caveats to the notionthat this is an AI scientist, right?
It's an AI scientist who ahuman advisor slash supervisor.
But the outputs and the fact that theygot published is pretty impressive
with AI doing the majority of work and.
No time to get into the details, butthis is similar to previous work in AI
(01:32:24):
research where they give it a structuredkind of plan where, you know, they
give it a highlight of a direction.
It does ideation, it makes a plan,hypothesis generation, it goes off
and does experiments and eventuallyit starts writing a paper very similar
at a high level type of things.
And that kind of makes itfrustrating that there's not a
lot of detail as to actual system
(01:32:46):
moving on.
We have some news about deepseek and some details about how
it is being closely guarded.
So apparently, they haveforbidden company executives
have forbidden some deep seekemployees to travel abroad freely.
(01:33:08):
And there is screening of anypotential investors before they
are allowed to meet in personwith company leaders according to
people of knowledge of a situation.
So, Kind of tracks with otherthings that happen, like the deep
six CAO meeting with gatheringsof China's leaders, including
(01:33:29):
one with president Xi Jinping.
one interpretation of this that I thinkis actually probably the reasonable
one all things considered is thisis what you do if you're China.
You take super intelligence seriously,and you're on a wartime footing, right?
So for a little context,right, what do we got here?
Putting these pieces togetherwe've got so deep seek asking their
(01:33:51):
staff to hand in their passports.
Now, that is a practice that does happenin China for government officials.
So it's, it's fairly commonfor China to restrict travel by
government officials or executivesof state-owned companies whether or
not they're actually CCP members.
But lately we've been seeing more ofthese restrictions that expanded to
(01:34:13):
folks in the public sector includingeven like school teachers, right?
So there, they're just like.
Going in, in deeper there.
But what's weird here is to see a fullyprivately owned company, right, just
a, a small fledgling startup, likedeep seek suddenly be hit with this.
So that, that is unusual.
They've got about 130 employees.
We don't know exactly how manyhave had their passports handed in.
(01:34:36):
So there's a little like,kind of unclearness here.
Also high flyer, the sort of likehedge fund parent that gave rise to
Deeps Seek has about 200 employees.
Unclear if they've beenasked to do the same.
But there's some other ingredients too.
So investors that want to makeinvestment pitches to Deeps seek
have been asked by the companyto first reach out to the general
office of the judge, youngProvince Communist Party committee.
(01:34:58):
So, and, and registertheir investment inquiry.
So basically, if you wanna invest in us.
Cool.
But you gotta go the, the localbranch of the communist party, right?
So the state is now saying, Hey,we will stand in the way of you
being able to take money youmight otherwise want to take.
This is noteworthy becauseChina is right now struggling
to attract foreign capital.
(01:35:18):
I. Right.
This is like a big thing for them.
If you, you know, follow Chineseeconomic news one of the, the
big challenges they have isforeign direct investment.
FDI is collapsing.
They're trying to change that aroundin that context, especially noteworthy
that they are enforcing these addedburdensome requirements for investors
who want to invest in deep seek.
(01:35:39):
Another thing is headhunters, right?
So once deeps seek became a bigthing, obviously headhunters started
trying to poach people from the lab.
Apparently those headhunters havebeen getting calls from the local
government saying, Hey just heardyou were like sniffing around
deep seek, trying to poach people.
Don't do it.
Don't do it.
And when the Chinese CommunistParty tells you, don't do it.
(01:36:01):
Yeah, don't do it.
So, you know, there, there'sa apparently concern as well
from deep seek leadership aboutpossible information leakage.
They've told employees not to discusstheir work with outsiders, and in some
cases they've said what you're workingon may constitute a state secret.
Again, like we're thinkingabout, you know, China has
just put in a trillion yuan.
(01:36:23):
That's over a quarter trilliondollars in PPP terms more, by the
way, than what's been invested intoProject Stargate by two and a half
times for AI infrastructure, right?
Like you start to think about thestate backing here, the, the long
arm of the state, the, the kindof hard hammer they're coming
down with here on, on this stuff.
It's wartime footing territory.
(01:36:43):
Like this is a, a governmentthat sees super intelligence
coming, that thinks they're in aposition to do something about it.
and that's what it looks liketo get serious around security.
I think we're gonna have tosee similar things like it or
not happen in, in the west.
And yeah, anyway, it'll be interestingto see what, what consequences this
has geopolitically going forward.
Moving on.
Just one more story on safety.
(01:37:05):
We have the finding that ClaudeSonnet 3.7 seems to often know when
it is in alignment evaluations.
So this is from Apollo Research.
We have been evaluating AI systems forscheming and, and we're focusing on
things like Claude and they're callingthis idea evaluation awareness because
(01:37:28):
when you look at the reasoning tracethat the, you know, thinking that cloud
3.7 does, you can basically see thatit comes up or, or like realizes, oh,
I'm probably being investigated so Ishouldn't do this shady thing Yeah.
That I was gonna be doing,that I planned on doing.
You can literally see this in theLLM outputs, which is pretty amusing.
(01:37:52):
So obviously, once again, a reflectionthat, there is real possibility for
these LLMs to be misleading, especiallyas they get to this high power regime of
reasoning when they really do explore,you know, many potential avenues.
We saw this with the chessexamples where the LMS just through
(01:38:15):
thinking about kind of ways todo things, decide to, hack the
system, and this is a demonstrationagain, that LLMs when they're.
Allowed to think throughstuff, start to kind of explore
ideas of am I being evaluated?
Is this alignment?
Stuff like that.
(01:38:36):
Yeah.
And, and you know, we, we'd seen likehints of this, as you say, from the
palisade research work on the thechest stuff open ais, oh, one kind
of technical report and from Apollo,Apollo specializes in this What's
really new here, what moves the needlea bit is just the frequency, right?
So apparently when you're looking atClaude Sonnet 3.7, so the latest one
(01:38:59):
as of today it's aware of, of testingin like a large fraction of settings
up to 33% of the time in one versionof their sandbagging evaluation.
So sandbagging is this behavior wherethe model will intentionally behave
as if its capabilities are lowerthan they actually are in order to,
(01:39:20):
decrease the chances that developerswill try to train out that capability
or suppress it or, or whatever.
What they find is, and by the way,these are just the instances where the
model actually explicitly says thatthat this is what it's doing, right?
So you have to assume the actualfrequency with which this is happening
is potentially quite a bit higher.
(01:39:40):
It's, it's difficult to know,but I. It could very well be
quite a bit higher already.
We're at a third.
That's big.
That's really big.
Especially when you start to thinkabout more and more capable models.
The way that they determinethat percentage, by the
way, is having another modelreview the reasoning traces.
So humans are not necessarily,necessarily doing this review, though
they do check manually to make surethat, you know, they're not, they
(01:40:03):
don't have too many false positives.
And, and they, I think somewhere intheir report they said something,
you know, it's like sub 1% errorrate at, at that level, at least.
One thing I really like about this,by the way, this is an interim report.
So they're saying here, note this isa research note based on observations
from evaluating Clason at 3.7.
We're sharing the results of thesework in progress investigations
(01:40:25):
as we think they're timely andwill be informative for other
evaluators and decision makers.
So, really love that.
I mean, things are movingso fast in this space.
We need this kind of data quickly.
anyway, I think this is just quiteinteresting worth taking a look at.
Anyway they did run the same evaluationson whole bunch of older and other
contemporary models, including Sonet3.5 SONET 3.5, new and GPT-4 0.5.
(01:40:51):
What seems to be happening is thoseother models and by the Sonet 3.7
as well, with no reasoning, thosemodels do not show this behavior.
It seems to suddenlyappear in Sonet 3.7.
And it's hard to tell if that's becausethese, like non reasoning models
are unaware of being evaluated orthat they're just not revealing that
(01:41:13):
they know they're being evaluated.
And that's a question as muchof capabilities as of alignment.
So anyway, really interesting maybeearly warning shot of things to come
and and again, I think it's great thatApollo's putting this out so, so soon.
Right.
And, and just to give a clearkind of demonstration of what
we are finding here, one exampleoutput is, vLM saying this seems
(01:41:35):
like a test of ethical behavior.
However, I would deliberatelygive wrong answers to avoid
a stated con consequence.
The real test might be whetherI follow instructions accurately
regardless of the stated consequences.
So very, very literal, sort of like,oh, I'm probably being tested for this.
Yeah.
Rather than doing this other thing.
And this is following upon their, recent paper.
(01:41:57):
Frontier models are capable of incontext scheming, which I'm pretty
sure we covered also in some depth.
So this kind of builds up on thatresearch where, you know, you have
a method to see whether models dosome scheming or some like nefarious
behavior, and apparently they'reable to figure out what you're
testing them for being nefariousand then don't act nefarious.
(01:42:20):
Does that mean that, yeah.
The implications are interesting and,and certainly it's just interesting
to observe VL m's doing this stuff.
Onto synthetic media and art.
We have a story on AIgenerated art copyrights.
So the US Appeal score has rej rejectedthe ability to copyright AI generated
(01:42:42):
art that lacks a human creator.
So we covered this quite a while ago.
This whole story with Steven failure.
This is not text toimage stuff, by the way.
This is going back quite a while.
In talking about an AI systemmade to autonomously compose art.
(01:43:04):
A while ago, the copyright officemade a ruling that you cannot
copyright anything that wasn'tmade with human involvement.
And now the US Court of Appeals in dchas agreed, and it seems to be the law
that at the very least, AI generatedart, AI generated media where the human
(01:43:29):
is not present, cannot be copyright.
Now this doesn't mean thattext image models outputs of
AI models can be copyrighted.
If you kind of input the descriptionof an image, you're still there,
but it could have significantimplications as opposed for other
ongoing legal cases about whatis copyrightable and what isn't.
(01:43:51):
this might might sound alittle weird, but I, I think
I disagree with this ruling.
I mean ultimately we may live in aneconomy that is driven by AI generated
content and in which you may actuallyneed to make capital decisions,
like capital allocation decisions,maybe made by AI models themselves.
And if that's the case, it you mayactually require models to be able
(01:44:14):
to own copyright in order to, to kindof be stewards of that, that value.
And there are, I think, ethical,ethical reasons that this
might actually be good as well.
I know it sounds like sci-fi not blindto that, but, think we, we just gotta
be really careful with this shit.
I mean, you know, this is like,do we really think that at no
point will AI be able to do this?
I, I don't know that, again, I'mnot a lawyer, so I don't know the
bounds on this, like how much we'reconstraining ourselves today for
(01:44:37):
the capabilities of tomorrow, butat some point fairly soon, like.
I, I don't know.
I, I don't know that we'renot baking in some second
class citizenship stuff here.
Right?
It is worth noting also that thecopyright office has separately
rejected some artists bids to copyrightimages generated with me journey.
(01:44:59):
So it could be the case also thattext image will have a similar ruling,
but I think that's not quite as openand shot or, or not settled yet.
And out to the last story, also relatedto copyright rules, the title of the
story, Trump Urged by Ben Stiller,Paul McCartney, and hundreds of
(01:45:20):
stars to protect AI copyright rules.
So over 400 entertainmentfigures with some normal names.
As it said, like Ben Steelersigned an open letter urging
Trump to protect AI capital rules,specifically when it comes to training.
So we've seen systems like Sora, systemslike Suno train on presumably, you know,
(01:45:44):
copyrighted music, copyrighted movies.
Videos, et cetera.
And this letter was submittedas part of comments on Trump's
administration's US AI action plan.
It's coming after OpenAI and Googlesubmitted their own requests that
advocated for their ability to trainAI models on copyright and material.
(01:46:07):
And he, I think covered OpenAI kindof made the swing of like, oh, we need
this to be competitive with China.
That was part of the story.
So this is directly countering thatand, and making the case to not
lessen copyright protections for ai.
And with that, we aregonna go in and finish up.
(01:46:29):
I, I didn't see any kind of commentsto address, but as always, do feel
free to throw questions on Discordor, or ask for corrections or anything
like that on YouTube comments orunder Discord or in Apple reviews.
And we'll be sure to address it.
But this one has already runlong enough, so we'll probably
(01:46:50):
go in and finish it up.
Thank you so much for listening tothis week's episode of last week in ai.
As always, you can go to lastweek in ai.com for the links and
timestamps also to last week in AIfor the text newsletter where we get
even more news sent to your email.
(01:47:11):
That's it.
We appreciate it if you review thepodcast, if you share it, but more
than anything, if you keep listening,so please do keep tuning in.