All Episodes

April 9, 2025 73 mins

Our 206th episode with a summary and discussion of last week's big AI news! Recorded on 04/07/2025

Try out the Astrocade demo here! https://www.astrocade.com/

Hosted by Andrey Kurenkov and Jeremie Harris. Feel free to email us your questions and feedback at contact@lastweekinai.com and/or hello@gladstone.ai

Read out our text newsletter and comment on the podcast at https://lastweekin.ai/.

Join our Discord here! https://discord.gg/nTyezGSKwP

In this episode:

  • Meta releases LlAMA-4, a series of advanced large language models, sparking debate on performance and release timing, with models featuring up to 2 trillion parameters for different configurations and applications.
  • Amazon's AGI Lab debuts NOVA Act, an AI agent for web browser control, boasting competitive benchmarking against OpenAI's and Anthropic's best agents.
  • OpenAI's image generation capabilities and ongoing financing developments, notably a $40 billion funding round led by SoftBank, highlight significant advancements and strategic shifts in the tech giant’s operations.

Timestamps + Links:

  • (00:00:00) Intro / Banter

Tools & Apps

Applications & Business

Research & Advancements

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:11):
Hello and welcome to the last weekin AI podcast where you can hear chat
about what's going on with AI as usual.
And in this episode we'll summarizeand discuss some of last week's most
interesting AI news, which you can go tothe episode description to see all the
timestamps and links to those articles.
I'm one of your regular hosts, Andre Kko.

(00:31):
I studied AI in grad school andI work at a generative AI startup
now, which is, let's say about todo something exciting, hopefully.
Yeah, I'm, I'm actually excited aboutit 'cause we've been talking offline
about this announcement that's comingand I feel like I probably joined the
audience in being very curious about whatyou're, what you're up to day to day.
So, we'll, we'll see something soon.

(00:53):
I hope.
I'm really excited for that.
Yeah, yeah.
Yeah.
I guess I haven't disclosedtoo much I can say.
I mean, I'm working on AI for makinglittle games with idea to have a
platform where people can make andpublish and play each other's games.
And yeah.
Tomorrow, April 8th, 'cause weare recording this a little late.

(01:15):
So Tuesday is a big launch of thelatest situation about technology.
Oh.
Yeah, exactly.
So by the time this episode is out, morethan likely we'll have already done a
big launch and anyone listening to thiscan go to astro k.com to try it out.
I'm sure I'll try and plug it elsewhere aswell, so you'll be sure to hear about it.

(01:36):
Yeah.
Yeah.
So on that note, like, we're basicallygonna have to blitz this episode
'cause Andre has to get to work, man.
He's gotta actually, like, you know, getout there, stop being such a lazy schmuck
So yeah, let's get going, startingright away in tools and apps.
And the first story is a pretty big one.
It's LAMA four Meta has released thelatest iteration of their open source

(02:00):
series of large language models,large multimodal models as well.
These ones come in fourvarieties and different sizes.
Some of them are called Lamafour Scout, Lama four Maverick,
and LAMA four Behemoth.
And they're also launching itout to all their various ways to

(02:23):
interact with chatbots on WhatsApp,Instagram, Facebook, I forget,
wherever else we let you talk to ai.
And these are quite big.
So just to give an idea, Maverickhas 400 billion total parameters, but
only 17 billion active parameters.
So they are kind of pitchingthis also as more friendly to

(02:47):
lower device configurations.
But on the high end behemoth, whichis not released and that they are
saying is still being trained, hasnearly 2 trillion total parameters
and 288 billion active parameters.
Which from what people were kind of aspeculation around GP four at the time

(03:10):
was that it was something like this, itwas a mixture of experts where you have.
Nearly two bill a trilliontotal parameters, and then like
over a hundred billion, maybe200 billion total parameters.
We dunno.
But this reminds me of a GTfour kind of speculations.
Yeah, this, this release, by the way, ispretty underwhelming to a lot of people.

(03:32):
So there's this interesting debatehappening right now over what exactly is
fucked with the LAMA for release, right?
So they're large models.
I'll, I'll just, I'll talkabout the positives first.
So, from an engineering standpoint,everything that's revealed in the
sort of 12 page read, I don't knowwhether they're called a blog post
or a technical report or whatever.
It's not this like, you know,beefy 50 pager, the of the

(03:55):
kind that deep seek produces.
It gives us some good data, though.
Everything we get there aboutthe, the engineering seems.
Kind of interesting by the way, likewhen it comes to the, the general
architecture choices being madehere, like a lot of inspiration from
deep seek man, like a lot, a lot.
And just to give you an idea, sothey trained at FP eight Precision.

(04:16):
So again, like Deep Seq V threethough Deep Seq V three used some like
fancier mixed precision stuff too.
We, we talked about thatin the Deep Seq episode.
Theoretical performanceof an H 100 GPU is.
Around a thousand tariff flopsfor FP eight and they were able to
hit 390 tariff flops in practice.
So they were hitting utilization of around39 40%, which that's on the high end for,

(04:40):
for a, a fleet of GPUs this big, this is32,000 to H 100 GPUs they used for this.
This is no joke.
Like that is just getting your GPUs tohum that consistently is a very big deal.
And so from an engineeringstandpoint, that's a pretty good sign.
There are a couple thingshere that they did that that
are a, a little bit distinct.
So one piece is this is anatively multimodal model.

(05:01):
So yes, drawing a lot of in inspirationfrom deep seek, but at the same
time very much kind of that metaphilosophy of, you know, we want
good grounding, good multimodality.
They use this technique called earlyfusion, where essentially like text
and vision tokens are combined from thevery start of the model architecture
and they're all used to train thefull kind of backbone of the model.

(05:23):
And that means that the model learns ajoint representation of both of those
modalities from the very beginning.
That's in contrast to like latefusion where you would process
text and images and other data inseparate pathways and just like
merge them near the end of the model.
More of a kind of like more ofa janky kind of hack together.
Frankenstein monster.
This is not that, right?
This is more of a monolithic thing.
Anyway, there's a wholebunch of stuff in here.

(05:43):
It turns out very much thatscout and behemoth seem to
be in the same model line.
Like they kind of seem to have, they'remaking the, some of the same design
choices in terms of their architecture.
Maverick seems like.
A deep clone.
It seems like an attempt, andsome people are speculating.
This is like a last minute decision totry to replicate what Deep Seek did.

(06:04):
Whereas if you look at Scout andyou look at behemoth, those are
much more of the kind of like.
Like you said, Andre, it's like somebodywas trying to do GPT-4 meets like a
mixed trial type of model and like,it's very unclear why this happened.
But one thing we do know is theperformance seems to be shit, like

(06:24):
when people actually run theirown benchmarks on it, or at the
very least, very mixed right.
There's all this stuff about likejust for example l on LM sis.
They've got a model thatseems to be crushing it.
And it's, it's doing greaton amazing ELO score.
But we see if we read very closely inthe paper that the, the model they used
for that is not the model they releasedis not any of the models they released.

(06:48):
It's a custom fine tune for the LMS arena.
Leaderboard.
And that is a big problem, right?
That's one of the things people arereally ripping on Meadow on us about.
Like, look, you're, you're showingus eval results for benchmark
results for one model, butyou're releasing a different one.
And this really seems like eval gamingall kinds of weird questions about
why they released this on a Saturday.

(07:09):
Like this is supposed to be oneof your flagship launches, your
Frontier la what is going on?
The release date on GitHub wasswitched to from April 7th to April
5th, like o overnight basically.
Maybe because they're anticipatingsome interesting releases coming
this week that will scoop them.
There's a whole bunchof stuff around here.
Last thing I'll say is oneexplanation people are throwing
around for the bad performance ofthese models is just the hosting.

(07:33):
The systems they're being hostedon are just not optimized properly.
Maybe they're quantizing the model alittle too much or poorly, maybe they're
using bad inference parameters liketemperature or like, you know, top p
or, or whatever or bad system prompts.
Anything like that is possible,including more nuanced kind
of hardware considerations.
But bottom line is this may be the flub ofthe year in terms of big flashy launches.

(07:57):
That should have been a big deal.
I think we'll be picking up thepieces for a couple weeks to really
figure out how we actually feel aboutthis, because right now there's a
lot of noise and I personally amnot resolved yet on how impressive.
Any of these things really are,but those are my top lines anyway.
Yeah, I think that's a good overviewof the sort of discussions and feedback

(08:19):
and reactions people have been seeing.
There's also been speculationthat they, at meta, at least,
leadership, push towards gamingthe quantitative benchmarks.
Things like M-M-O-U-G-P-Q-A the usualsort of numbers you see, life code bench.
Of course they say it's better thanGemini two, flash better than deep,

(08:41):
seek three, one better than gp, PT four.
Oh.
But as you said, when people areusing these models sort of anecdotally
personally or on their own personalheld out benchmarks that are not these
available benchmarks where you can cheat,you can train it even like accidentally
cheat or, or sort of like cheat Yeah.

(09:03):
By not intentionally trying not to cheat.
Right.
Which is one of the importantthings you do these days.
You need to make sure your modelisn't trained on the training data
when you're scraping the internet.
If you're not doing that, you mightbe cheating without knowing it.
Or, or at least likepretending not to know it.
So yes, it's seeminglike the models are not.

(09:27):
Good.
Is, is the general reaction worth noting?
Also, as you said withbehemoth Maverick Scout?
It is this yeah.
Difference where they have 16experts for behemoth and scout.
So you know, pretty big modelsthat are doing most of our work.

(09:49):
Maverick is different.
It has 128 experts, so it's abigger model, but the number of
total active parameters is low.
And I think there's various reasons.
You could speculate.
Like they want the models generallyto be runable on less hardware and
behemoth would be the exception to that.

(10:12):
As you said, also, they.
Need to keep costs down.
I have no idea how meta like is thinkingabout the business plan here of supplying
free chatting with LLMs, which isrelative to anything else, very expensive
and they're still doing it for free,kind of all over their product line.
So various kind ofspeculations you can have here.

(10:33):
But as you said, seemingly the situationis they launched possibly kind of in a
hurry because something else is comingbecause these businesses typically know,
you know, each other's releases somehow.
And perhaps they shouldhave waited a little longer.
Yeah, and a lot of questions aswell around like just the size

(10:53):
of these models and what itmeans to be open source, right?
We've talked about this in the contextof other models, including, including
deep seq, V three and R one and all that.
At a certain point, your model'sso big, you just need expensive
hardware to run it in the first place.
And I think this is a reallygood example of that, right?
So Scout, which is meant tobe their small model, right?
Like, it sounds like Flash, it soundslike one of those, you know, 2.7

(11:16):
billion parameter models or something.
It is not.
So it, it's a 17 billionactive parameter model.
As you said.
Their big flex here isthat it fits on a single.
Nvidia, H 100 GPU.
So that's actually pretty,that's pretty swanky hardware.
That's, you know, tens ofthousands of dollars of hardware.
You know, that's 80 gigs ofhbm three memory basically.

(11:36):
Does have one thing, by the way thatthis stuff does have going forward
is an insanely large context window.
So 10 million tokens.
That is wild.
The problem is that theonly evals they show.
On that context, window length are needlein a haystack evals, which as we've
covered before, are pretty shallow.
Like, it doesn't really tell youmuch about how the model can use

(11:58):
the information that it recovers.
It only tells you, oh, it canpick out a fact that's buried
somewhere in that context window.
It's not bad, but it's,it's not sufficient.
Right?
It's, it's one of those things.
So and that's the lama for Scout Maverick,they say fits on one H, 100 GPU host.
Now the word host is doing alot of heavy lifting there.

(12:19):
Really what that meansis one H 100 server.
So presumably they mean the H 100 DGX.
In fact, I think that is whatthey said in the, the writeup that
would be eight H one hundreds.
So, hundreds of thousands ofdollars worth of hardware.
Hey, it fits on just one of these servers.
Like, that's, that's a,that's a lot of hardware.
So anyway, I. Bottom line is, I think,you know, these are incidentally scout,

(12:42):
I believe is a distillation of La Lamofor behemoth, which is still in training.
So we don't know what Lama forbehemoth actually is gonna look like.
So we're all kind ofholding our breath on that.
But for now, unless Meta has like reallyscrewed the pooch on distillation,
and they just, they, they have anamazing behemoth model and they just,
the distillation process didn't work.

(13:03):
It seems plausible that the behemoth modelitself may be underperforming as well.
But again, all this isstill up in the air.
As with so many things here, itdoes seem like a rush release,
and I think the dust is gonna besettling for a few weeks to come.
still worth highlighting theyare sticking to their general
approach of open sourcing.
This stuff.
You can request access to LAMA forMaverick and Lama ForeScout to get

(13:26):
the actual weights as you were ableto with previous lamas, as was the
previous case where license under thesebespoke things, the LAMA for community
license agreement, where you are sayingyou will not be doing various things
that meta doesn't want you to do.
Lama.
But still, you know, LAMA has,I think, been a big part in

(13:48):
a lot of the research and anddevelopment on the open source side.
So that at least is still laudable.
And onto the next story.
We are moving to Amazon, which hadn'treleased too many models, but they
seemingly are starting, and their firstact here is NOVA Act, which is an AI
agent that can control a web browser.

(14:11):
This is coming from Amazon'sa GI lab, which hasn't.
Really put out many papers,many products so far, but this
seems to be a pretty big one.
And it is comparable to something likeopen ai I forget what it's called, but
their web use agent that can Oh, operator.

(14:31):
Yeah, operator.
Exactly where you can tell it, youknow, go to this website scrape
all the names of the links andsummarize it for me, stuff like that.
So they have thisgeneral purpose AI agent.
They're also releasing vi novo Act,SDK, which would enable developers to

(14:51):
create agent prototypes based on thisVA say this is a research preview,
so still a bit, you know, early,presumably not like fully baked.
Yeah, it's, it's an interesting play.
I think we still don't havetoo many models or offerings
of this particular variant.
We have ropa computer use, we haveopening a operator, which I don't

(15:16):
recall if they have an SDK for that.
So this could be a prettysignificant entry in that space.
And this is the first product to comeout of the Amazon a GI lab, right?
So this is kind of a, a big unveiling.
Couple of, I mean, we don't havea ton of information, but some
notes on, on benchmarks, right?

(15:37):
So they are claiming that itoutperforms OpenAI and Philanthropics
best agents on at least thescreen spot web text benchmark.
That's a, a measure of how well anagent interacts with text on a screen.
And apparently on this benchmark, Novakscored 94%, and that's in contrast

(15:57):
to open AI's agent, which scored 88%Anthropics Quad 3.7 sonnet at 90%.
So significant seemingly onthat benchmark, but that.
Isn't enough data to actually beable to kind of generally assess
the capabilities of this agent.
In particular, web Voyager is a morecommon evaluation that was not like
the reports the performance of theNova Act agent isn't being reported

(16:21):
on that, so that's kind of leadsyou to ask some of these questions.
But we'll see.
I mean, they, they definitely have, youknow, great distribution through Alexa
and maybe that'll allow them to, you know,iteratively improve this pretty fast.
We'll, we'll see.
They're also improving their hardwarestack, thanks to, among other things,
the philanthropic partnerships.
So even if it doesn't come out greatout the gate, they're sitting at least

(16:43):
on a, on a promising hyperscaler stack.
So this might improve fairly quickly.
Right, and it could be also partof their plans for Alexa plus
venue new subscription service.
They are launching Alexa also as awebsite in addition to their hardware.
So presumably they might be thinkingto make this part of their product.

(17:04):
And, and we'll keep pushing
#onto a few more stories.
Next up another giant companyplanning a giant model release.
This one is Alibaba and reportedlythey are preparing to release their
next flagship model qu free soon.
So the article here is saying as soon asApril, apparently Aaba is kind of rushing

(17:27):
here to respond to deep seek and a lot ofthe other hot activity going on in China.
We've talked about Quinn actuallyquite a bit over recent months.
Quinn 2.5 various.
Smaller releases, I guess you could say.
And qu free presumably is meantto be kind of the, the best

(17:48):
and, and beat everyone, right?
the one thing I'll say is I'veseen speculation that this may
have been part of the driver forthe rapid release of LAMA four.
So all we really know could be next.
We have a smaller company, a startuprunway, and it has released their
latest video generating AI model.
So this is gen four, and it is meantto be kind of a customer facing,

(18:14):
usable video generation model.
And it looks pretty impressivefrom just having looked at it.
It is kind of catching up to soa,catching up to more top of the
line models that are capable of aconsistent video, capable of also
being prompted both by text and image.

(18:35):
They have a little kind of mini short filmthat they launched this with where you
have a, like a cute forest and then somelike characters interacting, showcasing
that consistency across multiple outputs.
And this is at a time when they areraising a new funding round, valuing

(18:55):
them at 4 billion with the goal ofgetting 300 million or, or, sorry.
They're, the goal is to get 300million in revenue this year.
So runway a major player inthe space of AI for video.
And a similar story.
Next this one is about Adobeand they are launching an AI

(19:15):
video extender in Premier Pro.
So Premier Pro is their flagship videoediting tool, and we've seen them
integrate a lot of AI into Photoshop.
For instance, this is thefirst major AI tool to get into
premier Pro and video editing.
So in the future is generative extend.
It will allow you to extendvideos, but up to two seconds.

(19:39):
With Adobe Firefly recovered, I thinkwhen we previewed this but now it's
rolling out to the actual product.
Yeah.
This, this kind of rollout at least tome makes a lot of sense as a first use
case for these sorts of, of videos.
It sort of reminds meof, you know copilot.
That was powered by Codexback in the day, right?
First use case was textor code auto complete.

(20:00):
We've seen the auto complete kind offunctionality is, is sort of like native
for a lot of these transformer models.
This one's a little different, butit's still kinda this very natural
very grounded in real data and you'rejust extending it a little bit.
So especially for video where youneed to capture a lot of the physics,
that's something I'd expect to bekind of a, a nice way to iron out

(20:21):
a lot of the kinks in these models.
Right.
And they do have some other smallthings launching alongside that.
They have AI powered search for clipswhere you can search by the content of a
clip, which I would suspect is actuallya bigger deal for a lot of developers.
That's true.
Yeah.
Because if you have a hundred clipsnow you can search for the content

(20:45):
as opposed to file name and whatever.
And they also have automatictranslation making it.
So quite a few kind of significantfutures coming from Adobe.
And just one more story.
OpenAI is apparently preparing toadd a reasoning slider and also
improved memory for Chad GBT.
So we've seen some I guess peoplestarting to observe presumably this

(21:11):
idea of a reasoning slider in testing.
And that allows you to specify thatthe model should think a little, think
harder, or you can leave it at automaticto let the model do its own thing.
Mirroring to some extent whatphilanthropic has also been moving toward.
And onto applications and business.

(21:31):
The first one is about Nvidia H 20chips, and they're being 16 billion
dollars worth of orders from buydense Alibaba and Tencent recently.
So this is covering sort of a, a setof events, I suppose, or or a time
in early 2025 where these's major AIplayers from China have all ordered

(21:55):
this massive amount of the H two eightchip one that's not kind of restricted.
And the one that I believe, or, or this,or some variant of it was what Deeps
Seeq was built upon and what showcasedthe ability to train Deeps Seeq V three.
So this is a big deal, right?
And, and Nvidia presumably istrying hard to not, be too limited

(22:23):
to be able to actually do this.
Yeah, the, the Deep Seq was trainedon the H 800, which is, it's
another China variant of the H 100.
So they're kind of, you, you'reright, they all fall under the hopper
generation, but specifically for China.
This is a, a play that we'veseen from Nvidia a lot.
It's them responding to the loomingthreat of new potential export
control restrictions, right?

(22:45):
So you can imagine if you're Nvidia.
And someone tells you, Hey, likein a couple months we're gonna
crack down on your ability tosell this particular GPU to China.
Well, you're looking at that and likeyou'll go, okay, well I'm gonna quickly
try to sell as many of these to China asI can while I still can make that money.
And then, you know, once theexport control ban comes in,

(23:05):
then, then that's it, right?
So you're gonna actually tendto prioritize Chinese customers
over American customers.
This has happened in the past.
This will continue to happen in thefuture as long as we include loopholes
in our export control regimes.
And so what you're literally seeingright now is Nvidia making the call.
Do we have enough time to proceedwith making the chips to meet the

(23:27):
$16 billion set of orders frombike dance, Alibaba and Tencent?
Like are we gonna have the GPUsready and sold before the export
control bans come into effect?
Otherwise, if we don't, we'rejust sitting on this hardware.
Now, keep in mind.
H twenties are strictly worse thanthe say H one hundreds that Nvidia

(23:48):
could be making or H two hundredsthat Nvidia could be making instead.
So from the standpoint of selling 'emto the domestic market or, or you know,
potentially depending on the node, like,and, and the interactions here, they
could be making blackwells to so there'sthis question of like, if they choose
to go with making age twenties to try tomeet the Chinese demand, which is about
disappear, then they may end up beingforced to sit on these kind of relatively

(24:11):
crappy age 20 chips that don't really havea market domestically in the us right?
So that's a big risk, and they'recalculating that right now.
They have limited TSMC allocationto spare, so it's not like they
can meet both at the macro level.
you can think of this as likeNVIDIA is deciding whether to
make the TSMC fabs churn away onchips for China or for the us.

(24:34):
That's kind of the decision point here.
That's what it boils down to.
So again, we've seen this before.
And it's all gonna come down to theirinternal assessment of when exactly
these export controls are gonna come.
Moving along, we have a kind of a funnystory and as opposed to another one of
the big business stories of a week, andit is that Elon Musk's ex previously

(24:58):
Twitter has been sold to Elon Musk's,XI in a 33 billion all stock deal.
So you heard that right?
The company, the AI company that isdeveloping rock, has bought the social
media company Twitter slash x for.
Tens of billions of dollars.

(25:19):
grok has been hosted as part of Xfor basically since its inception.
You can pay a subscription tier to useGrok on x. Grok.com I believe also exists,
but I guess Grok has primarily lived onX and which justification here is, you
know, of course that Twitter will providea lot of data to train rock, and there's

(25:41):
like deep synergies that can be leveraged.
Yeah.
It's also kind of interesting to notethat when Elon actually bought X in
the, or Twitter as it was back thenhe paid $42 billion for it, right?
So now this is an allstock deal at $33 billion.
So the, the company's value isactually nominally decreased.
Now, there are all kinds of caveats there.

(26:01):
You know, you have a sort ofinternal, let's say purchase
within the, within the ecosystem.
I'm not clear on what the legalitiesof that are, whether fair market value
is an issue in the same way that sortof Elon raised it with respect to open
AI's attempt to sell its for-profitarm to the non-profit arm, right?
That was one argument is, Hey, you'renot selling this at fair market value.
I suspect this is probably morekosher because it has less like,

(26:25):
you know, control weirdnessissues, but super not a lawyer.
Just interesting numbers.
So yeah, anyway, all stock transactionand the ultimate combination of these
two would value XI at $80 billion.
So that's, you know, pretty, pretty big.
And also interestingly prettyclose to Anthropics valuation.
Right.
Which is amazing given that Xwas 20 minutes ago non-existent.

(26:49):
Right.
This is a, a classic Elon play.
XAI.
Yeah, it's, I'm, it's a lot confusing.
You're right.
XAI came outta nowhere, right?
Like what, 18 months, two years ago.
Pretty, pretty wild.
Pretty impressive in a, akind of classic Elon place.
So, so there you have it.
Not much information in thearticle but we just get these top

(27:09):
lines and the numbers are big.
The numbers are big.
And as you might expect,there's speculation.
I mean, a lot of people are making memesabout the, you could say self-dealing,
I suppose in this case, right?
Like, these are two the musk companies,one of them is buying the other.
And it could have to do with some,you know, financial, aspects of the

(27:34):
purchase of Twitter, the loans that ElMusk took out against his Tesla stock,
which is now falling a little bit.
Yeah, so you can, you can do various kindof nitpicky ideas as to why this happens
right now why this precise pricing.
But in any case I suppose not entirelysurprising given how this has been going.

(28:00):
Onto the lightning ground.
We have the story about SoftBankis now open, AI's largest investor
and has pushed the market packmarket cap of OpenAI to 300 billion,
but at the cost of a ton of debt.
So SoftBank is, is a big investor out fromJapan that has invested very large sums

(28:24):
into various tech startups and seeminglythey've taken on debt to do the $40
billion investment round for OpenAI with10 billion borrowed from Mizuho Bank.
So, wow, that's quite a borrow.
Yeah, it is, it's, it's also consistentwith the a hundred billion, let alone

(28:47):
500 billion to invest in this deal.
and, you know, all the, the sort sorteddetail started to come out and it's
like, yeah, well this is part of it.
They're literally borrowing money.
So this is an interesting play.
Masayoshi son is, taking a big risk here.
There's no, no other way to put it.
And there's a kind of buddingrelationship there with, with Sam
Altman that we've seen in various forms.

(29:09):
There are a couple of stringsattached, as you might imagine, right?
If you give $40 billion to anentity, there are gonna be strings.
So the $10 billion deal isexpected to be completed in April.
The remaining amount is setto come in early 2026, right?
So this is not.
Super near term.
But when you're thinking about thesuper intelligence training runs

(29:30):
that OpenAI internally thinks areplausibly gonna happen in 2027,
this is basically that, right?
This is that that injection of capitalOpenAI has to transition to a for-profit
by the end of the year in order toget the full $40 billion, right?
So more and more pressure mounting onOpenAI to successfully complete that
transition, which seems to be bogged downin an awful lot of, of legal issues now.

(29:53):
So this is kind of a, a challenge.
Apparently SoftBank retains theoption to pair back the size of
the funding round to $20 billion.
If OpenAI does not successfully transitionthat is an accelerated deadline, right?
So OpenAI.
Previously was under a two yeardeadline from its last round of funding.
And now it's saying, okay, well bythe end of the year you've gotta

(30:13):
complete this, this transition.
So that's kind of interesting, moreheat on Sam to kind of complete this
this weird you know, acquiring the,is it the nonprofit, the for-profit?
I mean, who can keep, keep trackanymore, but basically yeah, turning
the acquiring the, the nonprofitby the for-profit and all that.
So, we'll, we'll see.

(30:33):
I mean, this is gonna be a, a legaland tech drama, the likes of which
I don't think many of us have seentruly a unique situation that can
be said about the opening eye,who hasn't, who hasn't been there.
Yeah.
And next up we have a story aboutDeepMind, which is Google's ai arm.
And there's reports now thatit's holding back release of AI

(30:58):
research to give Google an edge.
Just I think last week maybe I. Twoweeks ago, we were sort of commenting
on the, some of the research comingout of DeepMind as seeming like
something Google might not want toshare because it is competitive and fed.
And yeah.
Now there's reports coming fromformer researchers that DeepMind,

(31:20):
for example, is particularlyhesitant to release papers that could
benefit competitors or negativelyimpact Gemini, their LLM offering.
And I've also heard alittle bit this from people.
I know that there's a lot of bureaucracy.
There's quite a lot of tapearound the publication and

(31:40):
going to publication these days.
So apparently now new publicationpolicies include a six month embargo
on strategic generative AI researchpapers and have to justify the merits of
publication to multiple staff members.
Yeah.
This is also, I mean, ineffect, we've seen this sort

(32:00):
of thing from DeepMind before.
In particular, I'm trying to rememberif it was the Chinchilla paper
or if it was gato or something.
There, there were, anyway, we've talkedabout this before in the podcast as well,
but there was one case early on wherethere was a full year's delay between I
think the end of the training of a modeland then it's, it's sort of announcement.
It was one of these early kind ofpost GT three models and so, you know,

(32:23):
this is in substance maybe, partlynew, and then it's partly a, a habit
that's been developed internally.
It makes all the sense in the worldbecause, you know, they're forced to
compete with an increasingly hot fieldof, companies like Open AI and Anthropic
and so, and, and Chinese companies.
But It's definitely interesting.
I mean the claim as well has been,there were three former researchers

(32:43):
they spoke to who said DeepMind was morereluctant to share papers that could
be okay, exploited by competitors orthat cast Google's own Gemini models
in a negative light compared to others.
They talked about an incident whereDeepMind stopped the publication
of research that that showed thatGemini is not as capable or is less
safe than, for example, a GPT four.

(33:05):
But on the flip side, they also saidthat they blocked a paper that revealed
vulnerabilities in chat JPT overpolitical concerns, essentially concerns
that the release would seem like a,a hostile tit for tat with open ai.
So you get a, a bit of a glimpsefor the, the kind of inter-company
politics, which is absolutely a thing.
There's a lot of rivalry betweenthese companies which is kind of

(33:27):
an issue too on the on the securityand, and the, control alignment side.
But anyway, so there you have it.
Maybe not a shock, but certainlya practice that we now have more
evidence for coming from Google.
And by the way, it's certainly gonnabe practiced at other companies too.
This is not gonna be a aGoogle exclusive thing.
Yeah.
I mean, to be fair, OpenAI hasbasically stopped publishing Yeah.

(33:48):
For the most part.
So as you said, not surprising, but, butnot well, because DeepMind for a long
time was a fairly independent, like pureresearch, more or less organization, that
that was much more academic friendly.
And that is definitely changing.
Next up we have a story about SMIC.

(34:10):
China's leading semi semiconductormanufacturer trying to catch up to DSMC.
Apparently they are at least rumoredto be completing their five nanometer
ship developments by 2025, but atmuch higher costs due to using older
generation equipment and presumablyhaving very, very poor yields.

(34:34):
Yeah, and check out our hardwareepisode for more on this.
But what they're doing is they'reforced to use DUV deep ultraviolet
lithography machines rather thanEUV for the five nanometer node.
Normally five nanometers is where yousee the transition from DUV to EUV or at
least that's one kind of node where you,you could and anyway, in order to do it

(34:56):
with DUV, the, the resolution is lower.
So you've gotta do this thing calledmulti pattering, take the same chunk
of your wafer, scan it through againand again, potentially and again,
potentially many times in order toachieve the same resolution as you
would with just one scan with EUV.
And so what that means is you'respending four times as long on one
particular pass for one particularlayer of your, of your lithography

(35:21):
and that makes your output slower.
So it means that you're, you're notpumping these things out as fast, and
it also reduces yields at the same time,both of which are economically really bad.
So yeah, yields are expected to be.
An absolutely pathetic 33%.
And that translates into a50% higher price than TSMC
for the same node, by the way.
I mean the CCP is gonna besubsidizing this to blazes.

(35:43):
So the economics are just fundamentallydifferent when you talk about,
for example, like lithographyand FAB for AI chips in China.
'cause it's, it's just likea national security priority.
But still this is, kind of interestingand this will be used this node by Huawei
to build their Ascend nine 10 C chip.
So these are actually gonna seethe light of day in production.

(36:04):
And just one more story in thissection, Google backed Isomorphic
Labs is raising 600 million.
So ISR collapse is basically spunout of a deep mind back in 2021.
They are focused on AI to modelbiological processes and, and primarily

(36:25):
do drug discovery pre presumably.
And this is their firstexternal funding round.
So they you know, were ableto do a lot, I suppose, with
support from DeepMind and Google.
They have now made this major dealswith companies like Eli Lilly and

(36:45):
Novartis for billions in kind ofpartnerships and research programs.
So, IC Labs is, iscertainly active it seems.
Yeah, it's sort of a, afunny headline in a way.
Like, I'm not sure how they getto like external investment round.
'cause apparently yeah, theysee the financing round.

(37:07):
They're like, it's led by Thrive Capital.
Okay.
Okay, cool, cool.
With participation from, I.Gv, which is called gv the same
way KFC is called K-F-C-K-F-C.
Used to be called Kentucky Fried Chicken.
Gv used to be called Google Ventures.
Google.
Oh, shit.
That's Alphabet, isn't it?
Yes.
Yes.
Alphabet also kind of the parent company.

(37:27):
Isomorphic Labs.
Right.
So it's like it.
So GV is participating.
That's Google.
And then follow on capital froman existing investor Alphabet.
So like at least by entitycount, this is two thirds sort
of like the Google universe.
Yeah.
External is let's say generous, I suppose.
Yeah, yeah.
Like I, I couldn't see how much isit is being led by Thrive Capital

(37:48):
and they're external, so Great.
I don't know how much is beingcontributed by whom, but I just
sort of thought that was funny.
Like Google is, is so big, orAlphabet is so big, they're
kind of everywhere all at once.
Which anyway, just whatcounts as external these days?
I don't know anymore.
And moving on to the researchand advancements section.
First up, we actuallyhave a paper from Open ai.

(38:10):
So I should take back a little bitwhat I said about them not publishing.
They do publish somevery good research still.
And this paper is called PaperBench Evaluating AI's Ability
to Replicate AI Research.
So this is basically doing what itsounds like they are evaluating in

(38:30):
a benchmark suite, the ability ofAI agents to replicate state-of-art
AI research and real AI research.
So this is 20 ICML, 2024 spotlightand oral papers from scratch.
They need to understand the paper,they need to develop a code and they
need to execute the experiments.

(38:50):
So Kind of the ultimate resultwe are seeing is that the best
performing agent, Claude 3.5 sonnetwith some scaffolding achieves an
average application score of 21%.
And that is worse than top machinelearning PhDs, which were also

(39:13):
recruited to attempt the benchmarks.
And they are also open sourcingthis benchmark to facilitate future
research in the AI engineeringcapabilities of AI agents.
one of the key things behind thisbenchmark is their, the strategy
they use to kind of decompose thepapers or the task of replicating

(39:35):
a paper into a kind of tree withincreasingly granular requirements.
So you'll have these like leafnodes that have extremely specific
binary and relatively measurableresults of the replication that
you can actually get your, judge,LLM or this thing called judge eval
to go through and, and evaluate.

(39:55):
And then what they do isthey have this sort of like.
I mean, in a way it's sort of likea, not a back propagation thing,
but they, they, they essentially cocombine the leaf nodes together into
one sort of like not quite leaf node.
And then those kind of next layersof the tree combine and merge up
higher at, at more and more kindof higher levels of abstraction.

(40:15):
And so this allows you toessentially, like, you know.
Give partial marks for partialreplications of these papers.
And a submission is considered tohave replicated a result when that
result is reproduced by runningthe submission in a fresh setup.
So there's a, a whole kind of reproductionphase before the grading phase begins.

(40:35):
And the, the kinds of tasks thatthey're evaluating at the leaf nodes
are things like code developmentexecution or results match.
So if, if they, you match a particularresult or product of, of executing
code anyway, this is all kind of a, away, I think, a very interesting way
of breaking down these complex tasks soyou can measure them more objectively.

(40:56):
I think what's especially interestinghere is that we're even here, right?
We're talking about let's make anevaluation that tests an M'S ability
or an agent's ability to replicate,like some of the most exquisite papers
that exist like ICML as you said,this is like a, a top conference.
These are the spotlight andoral papers that were selected

(41:17):
from the 2024 ICML conference.
So like, these are truly, truly exquisite.
They span 12 different topics, includingdeep RL robustness, probabilistic models.
It like, it is pretty wild.
And they worked with the actualauthors of the papers to create these
rubrics manually, to kind of captureall the things that would be involved
in a, in a successful replication.

(41:38):
Then they evaluated whether ornot the replication was completed
successfully using an l LM based judge.
But they did check to seehow good is that judge?
And it turns out when you compare itto like a human, who does the a human
judge say that that does the evaluation.
They get an F1 score of 0.83,which is pretty damn good.
So these are at least reasonable proxiesfor what a human would, would score on

(42:00):
or how a human would score these things.
Turns out Claude 3.5 sonnetnew with a very simple agentic
scaffold gets a score of 21%.
So one fifth over one fifth ofpapers successfully replicated
by Claude 3.5 sonnet new.
That's pretty wild.
And then anyway, they get into subsetsthat are potentially a little bit more
cherry picked that happened to show oh onedoing better and, and, and all that stuff.

(42:24):
But still very interesting.
Result in, I think this is wherewe're going, this tells you we're
closing that kind of final feedbackloop heading towards a world where
recursive self-improvement startsto look pretty damn plausible.
You know, 21% of exquisite cuttingedge papers today can be replicated
this way with caveats galore, butthose caveats start to melt away with
scale and, and with more time spentoptimizing agent scaffolds and so on.

(42:47):
So I think this is just a reallyinteresting sign of the times.
Yeah, exactly.
They do have a variant of asetup they call iterative agent,
basically letting the model do.
More work making it not stop early.
They get up to 26% accuracy replicationwith oh one high, so high compute

(43:11):
cost, and they give it up to 36 hoursin this case, and that gets you 26%.
for reference, that's impressive.
'cause replication is not necessarilystraightforward if you are just given
the paper to read and not of a code.
And, and to give you an idea, some ofthese papers are all in one simulation

(43:32):
based inference, sample specificmasks for visual, reprogramming
based prompting, test time modeladaptation with only forward passes.
Things like that that are, you know,yeah, the kind of research you see
getting awards at AI conferences.
Next we have a paper called Crossingthe Award Bridge, expanding RL with

(43:57):
verifiable rewards across diverse domains.
So reinforcement learning withverifiable rewards is one of the
big reasons that deep seek andthese upper models worked well.
You, they were trained with reinforcementlearning on math and coding where

(44:17):
you had these exact verifiers, right?
You can know for sure whetherit, what it did was good or not.
And so this paper is trying toessentially generalize that to
diverse domains like medicine,chemistry, psychology, and economics.
And as a result, they are saying thatyou can get much better performance and,

(44:41):
you can kinda generalize their approach.
Yeah, it, it's sort of interesting, right?
'cause like there's this classic ideathat okay, you might be able to make
a, a coding AI or a math AI that'slike really good by using these sorts
of verifiable rewards verifiers.
But yeah.
How do, how do you hit the soft sciences?
How do you make these thingsmore effective at, creative

(45:01):
writing or things like this?
This is an attempt to do that.
So they try a couple different strategies.
They try rule-based rewards.
So these are like relatively simplekind of yes or no based on exact matches
of, you know, does, does a keywordis a keyword contained in the answer.
That's kind of rule-based binary rewards.
They also have rule-based softawards where they use this measure

(45:23):
of, of similarity called KARsimilarity just to kind of measure.
Roughly does, does the content of theoutput match the, the right answer?
So they try those, they findthat they, they don't actually
scale that well, so you kind ofsaturate beyond a certain point.
I think it was like around40,000 tokens or something.
Yeah, or sorry, 40,000 examples where youjust, you start to degrade performance

(45:45):
in these non kind of quantitative tasks.
And so what they do is they introduceanother strategy model-based rewards,
and, and really this is what thepaper is about fundamentally, or this
is what the paper wants to be about.
So they use a distilled 7 billionparameter, LLM to basically train
this like model-based verifier,model-based issue, model-based rewards.

(46:08):
So the way that works is theystart by training a base model
using reinforcement learning.
They have some really large LLM,highly, highly performant LLM,
that they use it as a judge and.
They're gonna use that judge to givethese very nuanced rewards, right?
So the, the judge is a verybig LLM, very expensive to run.

(46:29):
And it'll determine, okay, did themodel, the smaller model that's being
trained, did it do well at this task?
Right?
And it's actually gonna train the smallermodel on a combination of math and code
and the kind of softer sciences, likeecon, psych, bio, that sort of thing.
And so that's step one.
They'll just use likereinforcement learning rewards

(46:49):
to do that with the big model.
The greater after that, they'regoing to take a fresh base model and
they're going to use the base model.
They just trained using RL thesource of truth, essentially.
As, as the source of text to evaluate,they'll provide correctness judgments
from the, the big teacher modeland essentially distill the big
teacher model into the smaller model.

(47:12):
They'll use about 160,000 distilledtraining samples from anyway, from
the data that they collected earlierwith the, in that training loop.
Bottom line is, it's agiant distillation game.
It's, it works well.
The result is, is interesting.
It's just, I think, good as an example,the kind of thing you're forced
to do, if you wanna kind of go offthe beaten path work, not with like

(47:35):
quantitative kind of data like math orcode where you can verify correctness.
You can actually compile thecode and run it, see if it works.
Or you can use a calculator orsomething to get the, you know, find
a way to get the mathematical result.
In this case, you're forced to use,you know, language model judges.
You're forced to find ways to,anyway do distillations so that

(47:55):
things aren't too expensive.
basically that's the high level idea.
I don't think it's like a, a breakthroughpaper or anything, but it gives us some
visibility into the kinds of thingspeople are, are forced to do as we try
to push LLMs out of the math and codeor agents, I should say, reasoning
models outta the math and code zone.
Yeah, it's a very kind of applied paperno sort of deep theory understandings,

(48:17):
but show how to achieve good resultsessentially on a given problem.
And we do also release a dataset and the trained reward model
for researchers to build upon.
And, and this is a prettyimportant problem, right?
'cause that's how you're gonnakeep improving reasoning models
beyond just math and coding.
So cool to see some progress here.

(48:39):
And speaking of reasoning models,the next paper is inference time,
scaling for complex tasks, wherewe stand and what lies ahead.
So this is, as it sounds like, areflection on the status of inference,
time scaling, which in case you'renot aware of a term, you're trying to

(48:59):
get better results from your model,just from more compute after training.
You're not doing any more training,but you're still performing better.
And that has been a hallmarkof things like deep seeks R
one and O one and other models.
Here this paper is evaluating ninefoundation models on various tasks.

(49:22):
Also introducing two new benchmarks forhard problems to assess model performance.
And they basically have variousanalysis and, and results that
showcase that in some cases,basic models that aren't reasoning
models are able to do just as well.
In other cases, reasoning models.

(49:44):
Do better in some cases high token usage.
So a lot of compute does notnecessarily correlate with higher
accuracy across different models.
So in general, right, we are kindof in a, in early-ish stage and
there's some confusion, there'sa lot of mess to get through.
And this does a lot of this sortof showcasing of where we are.

(50:08):
Yeah.
And, and this is a point thatI think, often gets lost in the
inference time, compute the kindof high level discussion about it.
I think it's pretty clear foranybody who's steeped in the space.
But inference time compute isnot fungible in the same way
that training time compute is.
So what I mean by that is, you know,training time it's not actually the
simple but very roughly, you know,you're like doing text auto complete

(50:29):
and then back propagating off theresult of that right to, to first order.
At inference time, thingscan differ a lot more, right?
Like you kind of have the choice Youknow, one way to spend your inference time
compute is to spend it on generating likeindependent parallel generations, right?
So like you sample a whole bunchof different answers from the same
model at usually a high temperature.

(50:51):
And you get a whole bunch of potentialresponses and then you have some
kind of way of aggregating theresult using some sort of operation.
Like, you know, you might take the averageof those outputs or some majority vote or
some measure of the best outcome, right?
So, so you might use those techniques.
That's one way to spend your inference.
Time compute is just generate awhole bunch of potential outputs

(51:13):
and then pick from among them.
The other is you have one stream ofthought, if you will, and you have
a critic model that goes in andkinda like criticizes the stream
of thought as you go in sequence.
So you're imagining kind of one morenuance, sort of chain of thought
that you're investing more into thana whole bunch of like relatively
cheap in anyway, in compute terms,relatively cheap parallel generations.

(51:34):
And so this is, these are twofundamentally different approaches
and there are many more, but,but these are the two kind of
core ones that are explored here.
And there are a whole bunch of differentconfigurations of each of those
approach of each of those approaches.
And so that's really what this paperis about is saying, okay, well,
you know, if we spend our inferencetime compute in different ways,
you know, how does that play out?
anyway, it's actuallyquite an interesting paper.

(51:55):
It helps to sort of resolve some of thissome of the questions around, you know,
what's the best way to spend this compute?
there's apparently very high variabilityin token use, even across models
with similar accuracies, right?
So if you take a given model youknow, that tends to, to get, I don't
know, 70% on GPQA or something.
What you'll find is, it won'tconsistently use an average of, I

(52:16):
dunno, a thousand reasoning tokens torespond, or 10,000 or a hundred thousand.
There's like a lot of variability.
And so what that strongly suggests isthat there's a lot of room left for
improving token efficiency, right?
So like you've definitely gotsome highly performant, performant
models that are overusing tokens.
They're cranking out too many tokens,spending too much inference, time compute,

(52:37):
and they could be optimized in other ways.
Apparently that's even a problemwith respect to the same model.
So the same model can yield highlyvariable token usage for a given level of
performance, which is kind of interesting.
They also highlight that quite often.
Although it's true, likeinference time scaling does work.
As you increase inference time scaling,in other words, you increase the number

(52:57):
of tokens you're using to get your answer.
Performance does tend to rise.
It can also indicate if you see agiven model pumping out a whole crap
ton of tokens, it can be an indicationthat the model's getting stuck.
And so this, this tends to be a, a kindof black pit of, of token generation and
inference time compute spend when thethings just sort of going in circles.

(53:19):
And so a whole bunch ofinteresting findings around that.
They do find that continuedscaling with perfect verifiers
consistently improves performance.
and this is both for, forreasoning and sort of.
Conventional or like base models.
Which is interesting, right?
'cause that means if you do havea reliable verifier that can
provide you feedback on the task.
Truly inference time scaling does work.

(53:39):
But as we've just seen for a lot oftasks, you don't necessarily have
these robust always correct verifiers.
And, and that's a, a, a challenge.
And so I think there's a lot ofreally interesting stuff to dig into
here if you're interested in tokenefficiencies where yeah, inference
time scaling is going and all that.
and some cool curves, by the way.
Last thing I'll say is interestingcurves that show the distribution

(54:00):
of token use for different models.
So kind of comparing.
Claude 3.7 sonnet to oh one on a mathbenchmark and seeing like what's the
distribution of tokens that's used.
And it's sort of interesting tosee how that distribution shifts
between those different models.
Like which models tend to, youknow, use fewer tokens when they
get things wrong versus when theyget things right, for example.

(54:22):
So anyway keeping time in mind that'sprobably all we have time to go into here.
Lots of, lots of figures and numbers andinteresting kind of observations that
you can get from this paper for sure.
And we have just onemore paper to go over.
It's titled, overtrained LanguageModels Are Harder to Find Tune.

(54:43):
And there you go.
That's the conclusion.
Over-training is whenyou're pre-training, right?
When you do the first kind of basictraining pass of Outta Complete, it has
been observed that, you know, there's atheoretical amount of training that is
optimal, but you can actually do betterby over training going beyond that and

(55:04):
the general kind of common wisdom is over.
Training is good.
What this paper is showing is thereis an idea called catastrophic
overtraining, where when you do too muchpre-training and then you do, what's
called post-training or instructiontuning can adapt your model to a specific
task or make it behave in a certain waythat you don't get from autocomplete

(55:27):
that actually makes it perform worse.
So quite an important,I think result here.
Like the, to quantify it, they try.
So the, the instruction tune 1 billionparameter model OMO one B. It was
pre-trained on 3 trillion tokens.
And what they find is that at, if youpre-train it on 3 trillion tokens, it

(55:51):
leads to 2% worse performance on a wholebunch of fine tuning LLM benchmarks than
if you pre-trained it on 2.3 trillion.
Tokens.
That's really interesting.
So this is like, you know,20% more pre-training tokens.
And then if you then down downstream, soafter that you take those two different
models and you fine tune them, you get 2%worse performance on the model that was

(56:14):
trained, pre-trained with more tokens.
The mechanism behind this is reallyinteresting, or at least will
have to be really interesting.
It's a little bit ambiguousin the paper itself.
So they, they highlight this ideaof, of progressive sensitivity.
The way they measure this is basicallyif, if you take Modifications of
equal magnitude models that haveundergone pre-training with more

(56:37):
tokens exhibit greater forgetting.
They're more likely to forget theoriginal capabilities that they had.
This is something we'veseen before, right?
When you, when you fine tune a model itwill forget some of it's kind of other
capabilities that aren't related tothe fine tuning that you've just done.
So this is presumably suggesting thatthe pre-trained model, if you pre-train
it on a crap ton of tokens, justbecomes like, I guess, more fragile.

(57:00):
Like it, it's more, maybe moregenerally capable out the gate, but
the moment you start to fine tuneit, that structure is like, in a
weird way, almost like unregular.
It's almost like it's overfit.
I wanna say to thepre-training distribution.
and they do say actually to thatpoint, regularization during fine
tuning can delay the onset albeit atthe cost of downstream performance.

(57:23):
So to me, this suggests there is akind of regularization thing going on
here where yeah, if, if you just likeundertrained them, or not undertrained,
but if you pre-train on fewer tokens,the overall model is less overfit.
Again, it's, you're not necessarily doingmultiple epics, so you're not passing over
the same data so you're not overfitting tothe specific data, but to the potentially

(57:46):
the distribution of training data.
And I think that's maybe the, the thingthat's going on here though, it's,
it's not like I didn't, I didn't seethis discussed in, in the depth that
way with that angle, with that depth.
But I, I might have justmissed it either way.
Very interesting.
Result and huge implicationsfor pre-training, right?
Which is a massive, massivesource of cap spend.
Exactly.
Yeah.

(58:06):
We, they primarily empirically demonstratesome of these phenomena in this paper
and, and showcase that this is happeningwith some theoretical analysis as well.
But as you say, like we don't have anexact understanding of why this happens.
It's kind of a phenomena that'sinteresting and, and has to

(58:27):
be researched in more depth.
Onto policy and safety.
We begin with taking a responsiblepath to a GI, which is a paper that
is, presenting an approach for that.
This is from DeepMind and I guessis pretty for fair, general idea.

(58:47):
They have previously introduced thisidea of levels of a GI, which is
sort of trying to define a GI as alevel set of levels of being able to
automate like a little bit of humanlabor, all of human labor, et cetera.
And so they expand on that.
They are also emphasizing the needfor proactively having safety and

(59:11):
security measures to prevent misuse.
And it just generally goesover a whole lot of stuff.
So let you take over Jeremyand kind of highlight what
you thought was interesting.
Yeah, no, I mean, generally nothingtoo shocking here other than the
fact that they're saying it out loud.
Taking seriously the kind of,the idea of loss of control.

(59:34):
So one notable thing about theblog post, not the the long report
itself, but just the blog post, isthat although they look at in the
report for different risk categories,so they look at misuse mistakes.
So sort of traditional AIaccidents, you could say
structural risks and misalignment.
The blog post itself is just basicallyabout their thoughts on misalignment.

(59:56):
There's some misuse stuff in there, butit's a blog post about loss of control.
That's what it is.
It's clear that that's what DeepMindwants to signal to the world.
Like, Hey, like, yes, everyoneseems aligned on this other stuff,
but guys, guys, can we pleaseit's, this is really important.
And so anyway there'sa lot of interesting.
Stuff in there about, like, if you'refamiliar with the Google DeepMind

(01:00:18):
research agenda on alignment.
There's lots of stuff inthere that'll be familiar.
So debate, let's keep super intelligent,AI honest by getting them to debate
each other and we'll use an AI judge andhopefully if a super int intelligence
is being dishonest, then that dishonestywill be detectable by a trusted judge.
There are all kinds of challengeswith that that we won't get into

(01:00:40):
right now, but there's a lot of stuffto hear about interpretability sort
of similar to some of the anthropicresearch agenda and all that.
So they do they also flag theMONA paper as a, anyway this is
something that we covered previously.
You can check out ourdiscussion on, on Mona.
But basically it's a like a performancealignment trade off option when.

(01:01:02):
You essentially are forcing the modelto just reason over short timelines
so it can't go too far, you know,too far off the deep end and do
really dangerously clever stuff.
You can try to ensure that it'smore understandable to humans.
So anyway, thought that was interesting.
One thing that for me personallywas interesting is there was a note
on the security side for alignment.

(01:01:23):
They say a key approach is to treat themodel similarly to an untrusted insider.
Motivating mitigations likeaccess control, anomaly
detection, logging, and auditing.
You know, one really important thingthat the labs need to improve is
the, just the security generally.
Not, not just even a lot froma loss control standpoint, but
from nation state activities.

(01:01:44):
And so I think a lot of thesethings are converging on security
in a really interesting way.
So there you have it.
A lot more to say there, butwe'll keep it to time, right?
Yeah.
The, the paper itself theypublished is 108 pages long.
It basically is a big overviewof the entire set of risks and
approaches to preventing them.

(01:02:04):
And they also mention in the blogpost that they're partnering with
various groups that also focus on this.
They have the A GI Safety Councilthey work with Frontier, model forum
and they published a course, theGoogle DeepMind A GI safety course.
Yeah.
On YouTube, which is interesting.
So yeah, clearly at least somepeople within DeepMind are

(01:02:28):
very safety focused, it seems.
Next article is this AIForecast predicts Storms Ahead.
This is an article covering I supposea report or an essay, I dunno what
you'd call it, called AI 2027.

(01:02:48):
This was a Corin by some, I suppose,significant intellectuals and highlights
kind of a scenario that could be ofwhy you should worry about AI safety
and, and how this stuff might play out.
So I don't wanna go into toomuch depth 'cause this is quite

(01:03:10):
a detailed report of theirs.
And, we probably just cannotkind of go into, 'cause it is
a detailed, fictional scenario.
But if you're interested in kind ofgetting idea of sort of ways people
are thinking about safety and whypeople think it's very important
to be concerned about it, I thinkthis is a pretty good read on that.

(01:03:34):
Yeah.
It's, it's about this, writeupas you said, called AI 2027.
It's by a reallyinteresting group of people.
Daniel Kota is one ofthe more well-known ones.
There's Scott aan Scott Alexander fromSlate Star Codex, or Aral Codex 10.
They're kind of famous in the ai whatalignment universe, the very kind of Yeah.

(01:03:56):
Niche ecosystem of AIalignment and all that.
Daniel Coyo famous for.
Basically like telling open AI that hewouldn't sign their really predatory,
it must be said non-disparagementclauses that they were trying to force
employees to sign before leaving.
The Daniel is known for havingmade really spot on predictions

(01:04:18):
for the state of AI going back,you know, three, four years, right?
So he basically said like, here'swhere we'll we'll be, you know,
in, in like 20, 26 or so back then.
And if you like, you, you shouldtake a look at what he wrote up.
It's pretty remarkable.
Like it, it's kind of on the nose.
So here he is essentially predictingthat we hit super intelligence by 2027.
I mean like I've had conversationswith him quite frequently and he's

(01:04:41):
been pretty consistent on this.
I think one of the only reasonshe didn't highlight the 2027 Super
intelligence possibility earlier wasthat it's just kind of, things get
really hard to kind of model out.
So they tried to ground this as muchas they could in concrete experimental.
Data and kind oftheoretical results today.
And to map out what the, thegovernment's response might be,

(01:05:02):
what the private sector's responsemight be, how CapEx might flow.
You know, like I have some, somequibbles of the margins on the national
security side in the sort of Chinapicture of this, but that's not what
they were trying to really get down pat.
I think this is a, a really greateffort to, to kind of make the rubber
meet the road and create a concretescenario that makes clear predictions
and we'll be able to turn back onthis and say, did they get it right?

(01:05:25):
And if they did, if they do getthings right over the next, you
know, 12 months, I think we have someimportant questions to ask ourselves
about how seriously we then shouldtake the actual 2027 predictions.
And anyway, it is quite an interestingread and it goes through again,
really interesting 'cause Dan khimself is a former OpenAI employee,
so like he's familiar with howpeople at OpenAI talk about this.

(01:05:47):
And this certainly is consistent withconversations I've had with OpenAI and
Anthropic DeepMind, like all these guys.
Yeah, 2027 seems prettyplausible to me at least.
Yeah, and I, I would say, I tend to agreethat 22 7 is plausible, at least for
some definitions of super intelligence.
For example, they are saying therewould be a super intelligent coder that

(01:06:09):
can effectively be a software engineerlike good software engineers at Google.
Pretty plausible for me.
So war for Reed, it's, it's avery well done sort of, story you
could say of what might happen.
Another story about OpenAI.
Next we have a article titled TheSecrets and Misdirection Behind

(01:06:32):
Sam Altman's Firing From OpenAI.
So this is there's a new book comingout with some of these kind of lured
details, and this article is presentingsome of them a lot of stuff you've
mentioned already about kind oftensions between the board and Altman.
Just kind of more specifics asto actual seemingly lies that

(01:06:55):
were told and, and sort of toxicpatterns that led to this happening.
Pretty detailed article.
If you wanna know all the details,go ahead and check it out.
Yeah, I thought this was actually quiteinteresting because of the specifics.
At the time the board the open AInonprofit board fired Sam Altman and
kind of like refused to give us asense of their actual reasoning, right?

(01:07:19):
They famously said he was being firedfor not being consistently candid.
At the time, I think we covered this, Idistinctly remember saying on the podcast
and elsewhere, like, I think unless theboard comes out with a very clear reason.
The obvious like result of this ispeople are gonna be confused and Sam a
is gonna have all the leverage becauseyou've just fired the guy who created

(01:07:41):
$80 billion or so at the time of marketcap and made these people's careers and
you don't have an explanation for them.
It was pretty clear that thereactually was tension behind the scenes.
You know, anyway, if, if you'rein the space and plugged in and
you know, people in the orbit.
You were probably hearingthese stories too.
There are a lot of concrete things thatlike, are serious, serious issues, right?

(01:08:04):
Like, so in particular SamAltman claiming a allegedly.
That there were big model releasesand enhancement to g PT four that
had been approved by the Joint SafetyBoard, when in fact that does not
seem to have happened like that.
This was an outright lie to the board.
All kinds of things around releasesin India of an, of an instance of

(01:08:26):
G four that had not been approved.
Again, the claim from Sam Altonbeing that that had happened there's.
Oh yeah.
Then the OpenAI Startup Fund, whichSam Altman did not reveal to the
board that he essentially owned orwas a major equity holder in it,
was he sort of like managing it.
While at the same time claimingthat he, anyway, claiming some

(01:08:47):
level of, of detachment from it.
The board found out by accidentthat he was actually running it.
They thought it wasbeing managed by OpenAI.
And so, you know this again,like over and over again.
It does seem like Ilya Suki andMirati were behind the scenes.
The ones driving this, so itwasn't even Helen Toner or any
of the other board members.
It was Mira and Ilya who were like,Hey, we're seeing patterns of toxic

(01:09:10):
behavior and not toxic in the, youknow, kind of political sense, right?
But just like this is a dudewho is straight up lying to
people and on very, very kind ofsignificant and substantive things.
And so yeah, apparently they, theywere concerned that they just, the
board was only in touch with, withsome of the people who are affected

(01:09:31):
by this stuff very, very sporadically.
This is consistent withsomething I've heard.
We'll be reporting onfairly soon, actually.
Just sort of criticism from formeropening AI researchers about the function
of the board as being essentiallyto pretend to oversee the company.
Like, so this is really,really challenging, right?
When you're leaning on the board as tomake the case that you're doing things

(01:09:53):
in responsible, kind of secure way andpreventing, among other things, nation
state actors from doing bad things.
So anyway, lots of stuff to say there.
The take home for me was I am now utterlyconfused given the actual strength
of these allegations and the evidencewhy it took the board so long to.
To come out and say any, like, theyactually had cards, if this is true.

(01:10:16):
They actually had cards to play.
They had screenshots.
Yeah, they had exactly right.
This like Mira Martis,like freaking Slack chat.
Like, what are you doing guys?
Like, seriously.
Th this was, if, if your goal was towas to change out the leadership like
great job giving Sam a all the leverage.
Great job.
Like creating a situationwhere Satya could come in and

(01:10:36):
offer Sam a job at Microsoft.
And that gave Sam theleverage, like thi this.
I mean, I can't, I can't really, Imean that's what it seemed at the time.
The board like went about this ina very clearly they were secretive
because we didn't want Alman to beaware of this whole conversation, but
they went about it in a very confusing.

(01:10:57):
Perhaps confused fashion andthis just backs it out basically.
Yeah.
Like, if this is true, again, ifthis is true then, then this was
some pretty amateurish stuff.
And there's a lot of, I mean, it isconsistent with some of the effective
altruist vibing that that does go on therewhere it's like everyone's so risk averse
and trying not to make big moves andthis and that and like unfortunately that

(01:11:20):
just seems to culturally be in the water.
So yeah, I mean, like, sorry,there's no such thing as a low risk
firing of the CEO of the largestprivately held company in the world.
Like that's, you know, bigthings are gonna happen.
So anyway I thought a fascinating read andsomething for the history books for sure.
And with that, we are gonnago and close out this episode.

(01:11:42):
Thank you for listening.
Thank you hopefully for trying out thedemo that we have launched at aade.
Yeah.
And as always, we appreciateyou sharing providing feedback,
giving your views, but more thananything, just continuing to listen.
So please do keep tuning in.
Advertise With Us

Popular Podcasts

Stuff You Should Know
The Joe Rogan Experience

The Joe Rogan Experience

The official podcast of comedian Joe Rogan.

Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Special Summer Offer: Exclusively on Apple Podcasts, try our Dateline Premium subscription completely free for one month! With Dateline Premium, you get every episode ad-free plus exclusive bonus content.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.