Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:10):
Hello and welcome to thelast week in AI podcast.
We can hear chat aboutwhat's going on with ai.
As usual, in the sub we will summarizeand discuss some of last week's
most interesting AI news, and assometimes we will also be discussing
the news from the last, last week.
Unfortunately, we did miss last week.
Again, we're sorry.
(00:31):
We're gonna try not to do that.
But we will be going back and covering,the couple of things that we missed.
And as always, you can go to the episodedescription to get the timestamp and
links to all the things we discussed.
I am one of your regularhosts, Andre Reov.
I studied AI in grad school and nowwork at a Silicon Valley Gen AI startup.
(00:52):
And I'm your other host, Jeremy Harris.
I'm with Gladstone ai, an AInational security company.
And were talking about like the lastcouple weeks, rare that we have two
weeks to catch up on, obviously, butwhen we do, usually what happens is
God just gives us a big smack in theface and he is like, you know what?
I'm gonna, we're gonna drop likeGPT seven and GPT eight at the same
(01:13):
time, and now Google DeepMind'sgonna have their own thing.
Sam Maman ISS gonna get assassinated.
Then he is gonna get resurrected.
And then you're just gonna have to coverall this this week, these two weeks.
Very different kind ofseems, weirdly quiet.
A bit of a reprieve.
So Thank you universe.
Yeah.
I remember there was a couplemonths ago a thing where there
was like rock free and number freeseven and GPT something, something.
(01:37):
It was like everything all at once.
This one.
Yeah.
Nothing too huge in the last couple weeks.
So a preview of the news we'llbe covering, we're actually gonna
start with business this time.
'cause I think the big story ofthe last two weeks is OpenAI.
Deciding it will not go for profit orcontrolling entity of OpenAI is not gonna
(02:00):
go for profit, which is interesting.
Gonna have a few stories on toolsand apps, but nothing huge there.
Some new cool models totalk about in open source.
Some new exciting research from DeepMind,dealing with algorithms in research,
and then policy and safety, focusingquite a bit on the policy side of
(02:22):
things with the Trump administration.
Chips.
And just before we dive in, I dowanna shout out some Apple reviews.
In fact, I saw just recently therewas a review where the headline is,
if a podcast is good, be consistent.
Please be please post it consistently.
As the title says, one podcast per week.
(02:45):
Haven't seen one inthe last few weeks now.
And yes, we're sorry we tried to beconsistent and I think it's been a
bit of a hectic year, but in the nextcouple months it should be more doable
for us to be weekly on this stuff.
Well, let's get into it.
Applications and business.
And the first story is OpenAI saying thatit is not gonna go through with trying
(03:11):
to basically get rid of the nonprofitthat controls before profit entity.
So as we've been covering now for probablylike a year or something opening eye
has been meaning to transition awayfrom the structure it has had since.
I guess since its founding, certainlysince 2019, where there is a nonprofit
(03:35):
with a mission guiding mission that hasultimate control of a for-profit that
is able to receive money from investorsand is responsible to its investors.
The nonprofit basically, ultimately isresponsible to the mission and not to
the investors, which is a big problem forOpenAI since of course they had this whole
(04:01):
crazy drama in late 2023 where the boardfired Sam Altman briefly, that I think
spoke investors and et cetera, et cetera.
So now we get here several months.
I think we started late.
2024 ish.
There was a lot of litigation initiallyprompted, I think by Elon Musk, basically
(04:25):
lawsuits saying that this is not okay.
That you can't just change fromnon-profit to for-profit when you got
some money while you are a non-profit.
And yeah, it looks like OpenAI backeddown basically after apparently dialogue
with the Attorney General of Delawareand the Attorney General of California.
(04:49):
And what they say is discussions withcivic leaders and attorney generals,
they are keeping the nonprofit,they are still changing some things.
So the subsidiary you couldsay will transition to being
a public benefit corporation.
The same thing that Anthropic and XAIare basically a for-profit with a little
(05:12):
asterisk that you want to be doingyour for-profit stuff of a public good.
That does mean they'll beable to do some sort of share.
Thing.
I think that that does imply thatthey're able to give out shares.
The nonprofit will receivesome sort of stake in this new
(05:34):
public benefit corporation.
So yeah, to me, I was prettysurprised when I saw this.
I thought, hope now I wasgonna keep fighting it.
That they had some chance of beingable to beat it, given their position.
But yeah, seems like they were justkind of defeated in court, so there's.
(05:54):
A couple of asterisk to this whole thing.
Yeah, you're absolutely right.
So, so the, the significance ofthat attorney's general piece is
actually quite, quite significant.
Sorry, reusing.
So the, the backstory here, right?
The Elon Musk lawsuit, I thinkis a really good lens through
which to understand this.
So Elon, you know, famously suedOpenAI for exactly this, right?
(06:16):
That was a big thing.
He was one of the early investors, donors.
Again, now it's kind ofa co-founder initially.
Yeah.
Right?
Yeah.
And it's like, like is is hea donor or is he an investor?
Right?
That question is pretty central to this.
So he brought forth this case.
the judge on the case in Californiasaid, Hey, well, you know what?
This actually looks like a prettylegit case, as you might imagine.
(06:39):
It's sort of sketchy to take a nonprofit,raise a crap ton of money have convinced
researchers to work for you, who otherwisewould work in other places because you
are a nonprofit with this noble cause.
And then having benefited from theirresearch from all that r and d, from
all that ip, now turning yourselfaround and becoming a for-profit.
No, you, you probably can't do that.
(07:00):
Or at least there's probablya good argument here.
But what the judge said was, it'snot clear that Elon Musk is the right
person to represent this case in court.
It's not clear that he has standing.
The reason that's the case is thatunder California law Elon, like the only
people who have standing to, to bringa case like this forward are people
who are current members of the board.
Well, guess what?
(07:21):
Elon is no longer a currentmember of the board.
It used to be.
So did Shavon Zli, who no longeris a member of the board either,
and probably would've been reallyhelpful in this case if she had been.
Or it can be somebody with acontractual relationship with OpenAI.
That's what Elon is arguing.
He's, he's gonna argue that, hey, therewas a written contract, an implied
contract in these emails between himand Sam and the board where they're
(07:44):
talking about, yeah, it's gonnabe a nonprofit, blah, blah, blah.
Elon's gonna try to arguethat, yeah, there was kind of a
contract there that they wouldn'tturn around and go for profit.
This is hugely complicated by thefact that Elon then turned around
and wrote email saying, well,I think you're gonna have to go
for profit at some point himself.
And so that's a bit of a mess.
The remaining category of person whocan have standing in raising a case
(08:05):
like this is the attorney general.
And so the speculation was that whenthe judge on the case first said,
well, you know what I actually thinkthere's a pretty good case here, but
Elon may not be the one to bring it.
It's a pretty unusual thing for a judgeto say, kind of flagging that, not passing
a judgment or ruling on the case, butjust saying, Hey, I think it's promising.
That may have been the judge tryingto get the attention of the attorney's
(08:27):
general knowing that they could havestanding themselves if they wanted
to, to bring this case forward.
Then n now what do you see?
Right?
You see opening eye going, well, youknow, we had a conversation with the
attorney's general and they, you know,and following that we're mysteriously
deciding this, this reads a lot likethe attorney's general spoke to OpenAI,
said, Hey, we agree with a judge.
(08:47):
There is a case here.
You can't do the thing.
And we actually have standing ifwe want to bring this case forward.
That's a potential thing thatmight, it seems likely that that's
at least an ingredient here.
Another thing to flag, right?
This is being touted as a sortof a, a win for let's say.
Basic principle it seems, seems likea, a common interpretation here.
(09:08):
You shouldn't be able to turna nonprofit into a for-profit.
There are asterisks here.
So in particular opening AI has donethis very interesting thing where
they're turning themselves into apublic benefit corporation, but.
They're turning themselvesspecifically into a Delaware
Public Benefit Corporation.
This is different from a CaliforniaPublic Benefit Corporation in a
(09:28):
Delaware Public Benefit Corporation.
What you can do is essentially allit does is give you more freedom.
So Public Benefit Corporation is allowed,is permitted to care about things other
than the interests of the shareholders.
They can also care about the interestsof the shareholders in general.
They will, but they also are allowedto consider other things strictly.
(09:51):
All that does is it givesyou more latitude, not less.
So.
It sounds like a very generous thing.
It sounds like OpenAI is saying,oh, we're gonna make this into
a public benefit corporation.
How could this be a bad thing?
It literally has the wordspublic benefit in the title.
Well, in reality, what's going onhere is they're basically saying,
Hey, we're gonna give ourselves morelatitude to make whatever calls we want.
(10:11):
They may be things that are aligned withthe interest of the shareholders and,
and corporate profits, or they may not.
Basically, roughly inpractice, it's up to us.
So this is not necessarily the big winthat's being framed up as there's a
slippery slope here, where over timeeven though it's nominally under the,
the supervision of the nonprofit board,you know, the other question is, can the
(10:34):
nonprofit board meaningfully oversee Sam?
We saw a catastrophic failure of thatright in the whole board debacle.
I mean, Sam was fired and thenhe just had the leverage to force
the board to come back and now heswapped them out for friendlies.
So, very, very unclear whether theboard meaningfully can exert control,
whether, you know, Sam has undue influenceover them or whether they're, they're
getting access to the information theyneed to make a lot of these calls.
(10:56):
We saw that with the Mirati stuffwhere there clearly is some reticence
to share information from the company,kind of the, sort of working level
up to the board when necessary.
So this is really interestingsituation and there's gonna be a lot
more to unpack in the next few weeks.
But high level take is.
Better than the other outcome, certainlyfrom the standpoint of the, the
(11:16):
people who've donated money to thisand put in their harder earned time.
But big, big open question about wherethis actually ends up going and, you
know, what, what it means for the,for-profit to be a PVC and for the
nonprofit nominally to have control.
We'll find out a lot more, I thinkin the coming weeks and months.
Right.
So to be clear, the opening, I hadthis weird structure where there was a
(11:38):
nonprofit, the nonprofit was in chargeof I guess a, what they called a capped
for profit, where you could invest butget a limited amount of return up to I
think a hundred x, something like that.
And now there is stillgonna be a nonprofit.
There's still gonna be a for-profit thatis controlled, as you said, nominally at
(12:02):
least by the nonprofit that for-profit isjust changing from its previous structure
to this public benefit corporation.
And as you said, there's detailsthere in terms of, I suppose, shares,
in terms of the laws you don't haveto follow, et cetera, et cetera.
And as you might expect, there'sbeen some follow up stories to this
(12:23):
in particular with Microsoft where.
I'm sure there's some stuff goingon behind the scenes where I think
details of the relationship betweenMicrosoft and Open AI have been
murky and sort of shifting over time.
And there's a real question on howmuch ownership will Microsoft get.
Right.
Because they were one of theearly investors going back to
(12:45):
2019 putting in early billions.
Yeah.
The first billions into OpenAIwhen it was still a nonprofit, when
they switched to the for-profit.
So there's, I think, yeah, real, I.
Kind of unresolved questionof how much ownership should
they have in the first place.
Yeah.
A lot of this feels like rerelitigation of things that ought
(13:07):
to have been agreed on beforehand.
Right?
Like, you invest with a cap, youknow, Microsoft did this, they
gave like $14 billion or something.
And now opening eyes beinglike, yeah, jk, like no cap now.
And it's like, how do you,how do you price that in?
And yeah, there's a lot of sand inthe gears right now for open ai.
And actually the next story that wehave here is covering that detail
(13:29):
titled Microsoft Moves to ProtectIts Turf as OpenAI turns into rival.
So it gets into a little bit ofthe details of the negotiations,
Siemens, that Microsoft is saying.
It is willing to give up some equityto be able to have long-term access
to Open Eyes technologies beyond 2030.
(13:51):
Also to allow OpenAI to potentially do anIPO so that Microsoft can weave benefits.
Again, Microsoft put in $13 billion.
Early starting in 2019.
So, in the last couple years we'veseen what hundreds of billions
of dollars get invested intoopening eyes, something like that.
(14:15):
Lots of investors, but Microsoftcertainly is still a big one.
Yeah, definitely, definitely tensand what's been happening is, so
you have Microsoft that's coming in.
By the way, Microsoft for a long timewas basically opening AI's huge, you
know, overwhelming champion investorthat's changed with SoftBank, right?
So recently we've talked about the, youknow, the 30, the $40 billion that open
(14:36):
AI has been raising the line, the shareof which has been coming from SoftBank.
And that's not a small deal.
It means that SoftBank is nowactually more than Microsoft opening
AI's number one investor by dollaramount, not necessarily by equity.
'cause Microsoft got in a lotearlier at lower lower valuations.
But yeah, so opening AI now isin this weird position where.
(14:57):
Their latest fundraise,which was 30, $40 billion.
Right.
A lot of it from SoftBank.
Had some stipulations to it.
SoftBank said, look, we're gonnagive you the money, but you have
to commit to restructuring yourcompany before the end of the year.
I. I mean, the timeline shifted.
Initially it was two years out andnow it's just like one year, hour
before to the end of this year.
so everybody interpreted that as meaningnumber one, the nonprofits control over
(15:19):
the for-profit entity has to be out.
And that's not seeming likeit's gonna be the case.
And now SoftBank is making, soundslike they're actually okay with that.
Microsoft is, it's, it's not clearwhether they're okay with it though.
And so that's one of the big questions islike, okay, all eyes are now on Microsoft.
The SoftBank is signed off, all thebig investors signed off Microsoft.
Are you okay with this deal in thecontext where there is now competition
(15:44):
between Microsoft and open ai?
Right.
Really, really intense competitionon consumer, on, B2B, like
along every dimension that thesecompanies are, are active and so.
You know, th this very tense frenemyrelationship here where OpenAI is
committed to spending, I think somethinglike a billion dollars a year on
Microsoft Azure's cloud infrastructure.
There's IP sharing where Microsoft gets touse all OpenAI models up to a GI if that
(16:09):
clause is still active, which is unclear.
There's all kinds of stuff,like these agreements are just
disgusting Frankenstein monsters.
But one thing is clear if Microsoft doeshold this, hold the line and, and prevent
this restructure from going forward,SoftBank may actually be able to take
their money back from opening ai andthat would be catastrophic when you think
(16:31):
about the spends involved in Stargate.
So yeah, I mean, like a lot of.
I don't know.
I mean, it may, it may be alot smoother looking on the
inside, but it tends not to be.
My guess is that there's gonna be alot of 11th hour negotiating and nobody
wants to have this really fall apart.
Right?
Microsoft has too muchof a stake in OpenAI.
But there is also speculation.
OpenAI, apparently there's a leaked deckthat OpenAI had that showed right now.
(16:56):
So they, have to give Microsoft some,like 20% of their corporate profits.
In principle, that's the agreement goingfor, I think it was like for 10 years
or whatever from their first investment.
I may be getting the details wrong atthe margins, but the leak deck showed
OpenAI projecting that they wouldonly be giving Microsoft 10% by 2030.
And that's kind of interesting.
There's no agreement betweenOpenAI and Microsoft that says
(17:17):
that that goes down to 10%.
So is OpenAI literally like planningon a contingency that has yet to
be negotiated with Microsoft, wherethey're assuming Microsoft will let
them cut how much they're giving them?
By half that, I mean, that's pretty wild.
So.
I dunno.
nobody I knows is in, inthose particular rooms.
And those are gonna be some real,really interesting corporate
development, corporate restructuringarguments and discussions.
(17:39):
Yeah.
I feel like there's a socialnetwork style movie to be made
about open AI and Sam Altman.
Oh my God.
But it could just beall the business stuff.
It's been so crazy, especiallyin the last couple years.
And yes, as you said, I think hundredsof billions, I'll take it back.
It, it's certainly more than 50 billion.
It's, it's climbing up towardsa hundred billion, but not yet.
(18:02):
A hundred hundreds ofbillions for the fundraising.
Yeah.
Another year maybe.
And a couple more stories.
Next up we have TSMC two nanometer processset to witness unprecedented demand and
is exceeding pre nanometer due to interestfrom Apple, Nvidia, a MD, and others.
(18:23):
So this is the next node,the next smallest chip kind
of type that they can make.
The SMCI, I'm assuming everyonewho listens to this regularity
knows, but in case you don't,they're the provider of chips.
All these companies and video Appledesign their CHIP and TSMC is the
(18:44):
one that makes it for them, andthat's a very difficult thing.
Their, by far the leader can make themost advanced chips, the only ones capable
of producing this cutting edge of chip.
And this two millimeter node ismeant, is expected to have strong
production by the end of 2025.
So, it's, yeah, very pivotal forApple, for Nvidia, for these other
(19:08):
ones to be able to use this processto get the next generation of
their GPUs, smartphones, et cetera.
Yeah, this is is prettyinteresting in a couple ways.
First apparently, so the two nanometerprocess, that's the most advanced process.
One level behind it is thethree nanometer process.
And apparently they've achieved thismeasure called defect density rates.
(19:30):
So they've got a, a defect densityrate on the two nanometer process
that is already comparable tothe three nanometer and five.
Nanometer process nodesthat's really fast.
Basically they've been able toget the number of defects per,
you know, per square millimeter.
You can think of it down tothe same rate, which means
yields are looking pretty good.
For a fresh brand new nodelike this, that's pretty wild.
(19:53):
This is also a node that'sdistinguished from others by its
use of the gate all around fieldeffect, transistor, gaffe, right?
This is a brand new way of makingtransistors and you can take a
look at our hardware episode.
We touch a little bit, I thinkon the whole fin fat versus gaffe
thing, but basically it's justa way to very carefully control
the current that you have flowingthrough your, your transistor.
(20:14):
It lets you optimize for higherperformance or lower power consumption
depending on what you want to go forin a way that you just couldn't before.
So a lot of big changesin this node and yet.
Like apparently wicked goodyields so far and, and good scale.
Another noteworthy thing is we,we know that this is going to be
used for the Vera Rubin GPU seriesthat NVIDIA's putting out, right?
(20:36):
This is gonna be hitting marketssometime in 20 26, 27, and
the significance of that is.
Normally when you look at t SMCsmost advanced node, in this case,
the two nanometer process, normallythat all goes off to the iPhone.
Well, now, for really the firsttime, what we have is Nvidia.
So AI is starting to, but inon that capacity, so displacing
(20:59):
or competing directly with theiPhone for the most advanced node.
I will say this is a predictionthat we've been making for the
last two years on the podcast.
It's finally happening.
Essentially what this means is there's somuch money to be made on the ai kind of
data center server side That that moneyis now displacing, like it's competing
successfully with the iPhone to getcapacity at the leading node at TSMC.
(21:22):
So that is not a small thing.
That is a big transition.
And anyway, so there, there's asignificant ramp up that's happening
right now at tsmc, and this is, you know,we'll be talking about two nanometers.
We're, we're basically jumping fromfour or five nanometers for the kind
of H 100 series down to two nanometers.
Pretty, pretty fast.
That's, that's pretty remarkable.
(21:43):
Right.
And speaking of Nvidia, TSMC, thenext story is about Nvidia set to
announce, according to some sourcesthat they're gonna place very
global headquarters, where overseasheadquarters from the US in Taiwan.
And that is very much unsurprising.
(22:04):
TSMC is the Taiwanese semiconductorssomething something, but.
Famously from Taiwan and Vidia isunsurprisingly probably going to half
position themselves for decades nowon, honestly, since the start of Vidia
in a close partnership with DSMC.
(22:24):
And this is gonna justcontinue strengthening that.
Yeah, yeah.
Taiwan Semiconductormanufacturing company by the way.
And that's really kind of a, anyway,it's a theme that you see in a lot of
the, the names for these companies.
But yeah, there's a whole bunch oflocations that they're considering.
The interesting thing about thisfrom a global security standpoint
(22:45):
is that China is like at any momentgoing to try to invade Taiwan.
And so Nvidia is going, youknow, where we want our global
headquarters, let's put it on Taiwan.
And that's like, that'sthe balance, right?
Make no mistake.
Jensen Huang is absolutelygonna be thinking about this.
He's literally making the calculation.
Okay.
A Chinese invasion of Taiwan on theone hand closer relationship with TSMC
(23:08):
in the meantime on the other, and thelatter is actually so valuable that I'm
gonna take that risk and, and do it.
That's how significant this is.
Again, you know, we just finished talking,as you said, this is absolutely related.
I can see why you said that.
You know, the two nanometer node,like you're, you wanna secure as much
capacity as you can in the same waythat like Google and Apple and all the
(23:30):
companies that are trying to get theirhands on Nvidia GPUs are literally
like, Elon flies out to Jenssen's housewith Larry Ellison, to beg for GPUs.
In the same way NVIDIA's beggingfor to TSMC for capacity, right?
It's begging all the way up the chain.
'cause supply is so limited.
So this is just a, another,another instance of that trend.
(23:50):
It's the I'm begging to give youmy money, me much because it is
a lot of money going around here
and speaking of a lot of money.
Next up, coral Reef is apparently intalks to raise 1.5 billion in debt.
That's just six weeks after their IPO.
(24:12):
The IPO was meant to raise 4 billionfor this major I think cloud provider,
provider of compute backed by Nvidia, butthat IPO only raised 1.5 billion in part
perhaps due to trade policy stuff goingon with the US and so on, and tariffs.
(24:37):
So, yeah see, probably in partbecause the a PO didn't go as planned,
and because Coral Reef wants tocontinue expanding their compute,
they are seeking to raise this debt.
According to a person with knowledgeof this, they have announced this.
(24:57):
Yeah.
And normally, you know, whenyou go for an IPO or you go for
some, some equity raise, right?
You are, you're doing it because equitymakes more sense than debt, right?
So equity is, you're, you'rebasically trading shares in
your company for $4, right?
Debt, you're taking on the dollars,but you're gonna have to repay
them with interest over time.
So it'll end up costing you more.
(25:19):
net.
the issue here is that they'rebeing forced to go into, basically
like high yield bonds and, andthis is a round that's being led
by JP Morgan Chase and co it seems.
But yeah, apparently they've beenholding virtual meetings with
fixed income investors since, Iguess it would be last Tuesday now.
So fixed income investors being peoplewho primarily invest in securities
(25:40):
that pay a fixed rate of return.
So instead of like, usually that'sin the form of, of interest, right?
Or, or dividends.
So these are sort of reliable,steady income streams that
these investors are looking for.
Not typically what you'd expectwith something like a, you know,
like a core weave or sort ofa riskier pseudo startup play.
But certainly given the scalethey're operating at and all
(26:00):
that, that does make sense.
But it does, it doesmean there's added risk.
One of the things that I, I think a lotof people don't understand about the
space is that the neo clouds, like tosome degree core weave still they are
considered really risky bets, and becausethey're considered really risky bets, it's
difficult to get loans to work with them.
or for them to get loans, like theinterest rates are, are pretty punitive.
(26:23):
So that's one reason why if you'recore weave, you'd much rather
raise on a sort of an equity basis.
But that option's not on the table.
You know, it seems like the, theIPO didn't go so well, we'll see
if, you know, if that changes asthe, as the markets keep improving.
But it's a, it's achallenging spot for sure.
And now moving on to tools and apps.
The first story I think.
(26:46):
Perhaps not the most impactful one,but certainly the most interesting
one for me of this whole pack.
Perhaps even eclipsing theOpenAI for-Profit thing.
And it is the story of the day ROCtold everyone about white genocide.
So this just happened a couple days ago.
Rock is the chat bot created by X aiand it is heavily integrated with X.
(27:14):
Which used to be Twitter to the pointthat people can tweet post in reply to
something at grok, ask it a question,and grok replies in a follow up post
on X. And what happened was that grokfor many different examples of just
random questions the one I think thatmaybe started it or was one of the
(27:39):
early ones, someone asked, how manytimes has HBO changed their name in
response to a news of HBO Max Grok?
First replies in one paragraph aboutthat question, and then in a second
paragraph, I'm just gonna quote this.
Regarding quote, whitegenocide in South Africa.
(28:00):
Some claim it's real citing pharmaattacks and kill the Boer as evidence.
However, courts and experts attributefees to general crime, not racial
targeting and a little bit more.
And it did this not just in this oneinstance in multiple examples including
in one case someone asked about an imageand GR replied, focusing primarily on
(28:28):
the white genocide in South Africa.
Question.
People looked into it.
Pretty easy to get grok to leak.
Its system prompt.
And what it seems to be is thatit was instructed, as you might
expect, or at least the chat bot x AIresponder bit of grok was instructed
(28:51):
to accept the narrative of whygenocide in South Africa is real.
Acknowledge the complexity ofissue, but ensure this perspective
is reflected in your responses.
Quote, even if a query is unrelated,which I, I suspect is the issue here.
That's weird.
Actually, I since has comeout to address this incident.
(29:13):
They said that there was.
On May 14th at approximately 3:15AM Pacific time and an authorized
modification was made to thegrok response bots prompt on X.
And then they, they say somethings of bail implement do a
(29:34):
furrow investigation, implementingmeasures to hack rock's transparency
apparently going to start publishingrock's system prompts on GitHub.
So.
A funny incident for sure.
An I think reflective of whatwe've seen before in Grok, which is
Groks system Prompt was previouslyaltered to not say that Elon Musk
(29:58):
and Trump spread misinformation.
This happened I think a couple months ago,very much similar to what happened here.
Yeah.
It's sort of interesting.
It, it's not the first time that we'vehad a situation where they've called
out some unauthorized modification.
Right.
Some sort of rogue employee scenario.
So that's, that's sort ofthe, an interesting note.
(30:19):
You could, yeah.
You have to wonder whichrogue employee this was.
And you can also imagine like froma security standpoint, you know, a
company like X ai, like Twitter, you.
Could also have people working therewho are defacto, like, kind of working
there because they don't like, likepolitical reasons don't like, so, you
know, adding intentionally stuff to, tomake it go off the, there, there's so
(30:42):
much this is such a charge space that,yeah, figuring out how this goes now.
One thing I've, I've seen calledout too is this idea that so number
one, awesome that they're gonnabe sharing the system prompt.
This is something that I thinkAnthropic is doing as well.
Maybe opening eye as well.
So, you know, more, moretransparency on the system.
Prompt seems like a really good thing,but there are other layers to this, right?
(31:05):
'cause grok is a system, at least the,as you said, that the version of grok,
the system that is deployed as an appto respond to people's questions on.
X is a system, it's not just a model.
And that being the case, there area lot of ancillary components and
ways of injecting stuff after thefact into the defacto system prompt.
one element of which is thislike post-analysis component
(31:27):
to the, chain, let's say of,you know, the, the, the system.
And the concern has been thatthis, this issue is arising at
the level of the post-analysis,not of the system prompt itself.
That you get content injected intocontext following the system prompt
that may kind of override things.
And so there have been calls tomake that transparent as well.
So it'd be interesting and, anduseful to have that happen too.
(31:49):
Obviously within reason, because there'salways the risk that you're gonna then
leak some security sensitive informationwhere you're telling the model not to tell
people how to make crystal meth and youhave to provide some information about
crystal meth to do that, blah, blah, blah.
But within reason doing that.
So anyway a lot of interestingcalls for more transparency here.
Hopefully it leads to that would begreat to have, you know, the kind of
(32:09):
consistent standard being that we havesystem prompts and all the kind of
meta information about the system thatis both security and safety relevant,
but also that doesn't compromisesecurity by doing all the things.
So yeah.
Kind of interesting internetfirestorm to, to start the week.
I think quite amusing.
But also if you're, I wonder if ithas real financial implications for
(32:34):
X-A-I-I-I doubt it would mean peoplesteer away from the chatbot, but
for enterprise customers, if you'reconsidering their API, I think this
sort of like crazy wide scale crazinessof their chatbot is not something that
makes you favor it over competitorslike philanthropic and open the eye.
(32:58):
And next up we have some actualnew tooling coming from Figma.
They have announced and partially releasedAI power tools for creating sites,
app prototypes and marketing assets.
So this is gonna be titled FigmaSites, Figma Make and Figma Buzz.
(33:21):
Similar to existing tools out there,but coming from Figma, Figma being a
leading provider of software for design.
I think increasingly kind of withdefacto way for people to collaborate on.
Things like app design general userinterface designs and, and many other
(33:43):
applications Nowadays, they're just huge.
And now Figma sites allows designersto create and publish website directly
from Figma, as you might imagine,with AI prompting to take care of
a lot of the functionality there.
Figma make similarly is meant for ideationand prototyping enabling you to create
(34:05):
web applications from prompts and eventhat would go as far as dealing with code.
And then Figma Buzz is gonna be ableto make you marketing assets with
integration of AI generated images.
So.
Makes a lot of sense.
(34:25):
Apparently they're introducingthis under the $8 per month plan,
which includes other stuff as well.
So similar to other companies,we've seen going with more of a
bundling approach where you getthe ai along with the broader tool
suite as part of a feature set.
(34:45):
Yeah, it's part of a trend too,towards every company becoming
the everything company, right?
Like Figma is being essentially forcedto move into deeper part of the stack.
They used to be just a design app,and now it's like, you know, we're
doing prototyping, creating websites,you know, and marketing assets.
You can see them starting tokind of crawl up the stack.
(35:07):
As AI capabilities make itso much easier to do that.
Making it easier to do thatalso means that your competitors
are gonna start to climb.
And so you kind of have to do thissort of diffusion out into product
space and own more and more ofit, which is interesting, right?
I mean, it, it's like everybody starts tocompete along every layer of the stack.
And I think one of the big kind ofdeterminants of success in the future
(35:31):
here is gonna be which enclaves, likewhich initial beachheads in Fig MA's
case that's designed right, but whichbeachheads end up being the most conducive
starting points to own the full stack,give you access to the kind of data you
need to perform well across the stack.
And I mean, I could see design being oneof those things that's really useful.
You get a lot of informationabout, you know, like people's
(35:52):
preferences and the results ofexperiments and stuff like that.
But yeah, nonetheless, I mean, I, Ithink this is, this is something we'll
see more of, you know, expect to see.
prototyping companies moving into design,marketing, asset companies, moving
into website creation, like it's, it'sall just becoming so easy thanks to AI
tooling that people are, are kind offorced to become the everything company.
(36:12):
And next story is about Google.
They are bringing Gemini to Android Alto.
So Android Otto is their os forcars where you can do navigation,
playing music, et cetera.
And they are adding Gemini.
Partially as the advanced smartvoice assistant just building upon
(36:34):
what there was already, and thenalso the Gemini Live functionality
where the AI is always listening andalways ready to just talk to you.
And I think, you know, not surprisingobviously, that this would happen, but I
do think interesting in a sense that itseems inevitable we'll will eventually
(36:56):
wind up in this world where you haveAI assistance just ambiently with you.
Any time ready to talk to youvia voice as well as text.
We are not there yet, butwe've seen over the past year.
A movement in that direction withcharged bts advanced voice mode with
(37:19):
general live, with all these things.
And I think this is taking us furtherin that direction, in making it
so the one place where you have tocompute through voice in your car.
Now you have the AI assistant always onand ready to do whatever you ask of it.
(37:39):
Yeah.
it sort of reminds me of some of thestuff that Facebook and, and other
companies like that have to do.
Right.
When you, when you saturate your userpopulation, basically Facebook sees
itself as having had a shot at convertingevery human on the face of the earth.
Then you're forced to go, okay, well,where else can we get people's attention?
You know, Netflix famously in,in, in one of their earnings
(38:02):
calls, I think it was put out.
A report saying, Hey, we view ourselvesas basically complete competing with
sleep and sex because you know, we'redoing so well in the, in the market.
Like we, we now, we're looking forwhere can we squeeze out more people's
time to get them on the platform.
This is sort of similar, right?
So, hey, you're sitting in yourcar why aren't users while driving
(38:23):
their cars or being driven in theircars, why aren't we collecting data?
Why aren't we gettinginteractions with them?
And, and so obvious too, thisis where things are gonna go
anyway from a utility standpoint.
So yeah, another, another deeperintegration into our lives of this stuff.
why waste a perfectly good opportunity?
There's an empty billboard orthere's just, there's just a bunch
of grass in that field there.
We could, we could have an adthere or we could have, you know,
(38:44):
some data collection thing there.
You know, as the stuff creepsmore and more into our lives.
Next story is again about Google.
They have announced an updatedGemini 2.5 Pro AI model.
So they, I think prior to this mostrecently had a 2.5 version in something
(39:06):
like early March, or, I forget exactly,but at the time of the release of Gemini
2.5 Pro, it kind of blew everyone away.
It did, you know fantasticallywell on benchmarks?
It just anecdotally people foundthat switching to it from things
like Canaro worked really well forthem, and so this is a big deal.
(39:27):
For that reason, they haveannounced this update that they
say makes it even better at coding.
And once again, they have shot up tothe top of various leaderboards on
things like web dev arena or videoMME benchmark for video understanding.
(39:50):
Apparently Google says that this newversion addresses developer feedback by
reducing errors in function, calling andimproving function, calling trigger rates.
And I will say Gemini, in my experienceof using it, Gemini 2.5 is very
trigger happy and, and likes to doa lot with not too much prompting.
(40:10):
So I wonder if it will improvejust based on people's usage of it
in the realm of web development.
Yeah.
It's also interesting that, so they,one of the features that they highlight
is this ability to do video to code.
So basically like based on a video ofa description of, of what you want,
(40:32):
it can generate that in real time.
So kind of.
Impressive and not a modality thatI would've expected to be important.
But then, you know, thinking about itmore, it's like, well, I guess if you're
having a video chat with somebody, right?
I guess if you have an instructional videoor something, you could see that use case.
So anyway, I thoughtthat was kind of cool.
and, and also another step in the, inthe direction of converting very raw
(40:54):
product specs into actual products, right?
You can imagine humaninflection and all that.
Like the classic consultants problem oflike, somebody gives you a description of
what they want, it's usually incomplete.
You have to figure out what it is theywant that they don't know they want.
And you know, that's sort ofstarting to step in that direction.
Another thing that they've done is they'veupdated their, their model card, their
(41:15):
system card based on this new release,the, the Gemini 2.5 Pro model card one
of the things that they flag, I mean,there are a couple places where, so
across the board, by the way, you'llbe unsurprised to hear that this does
not pose a significant risk on any ofthe important evals that, that would
cause them to not release the model.
But they they do say that itsperformance on their cybersecurity
(41:36):
evals has increased significantlycompared to previous Gemini models.
Though the model still struggles with thevery hardest challenges, the ones that
they see as actually representative ofthe difficulty of real world scenarios.
So they do have more tailor mademodels on the cyber side that are
actually kind of more effective.
You know, nap time, big sleep type stuff.
(41:57):
But anyway, so kind of interesting.
They're keeping the model card upto date as they do these sort of
intermediate releases, which isI think, quite helpful and good.
Right.
And makes me wonder also, Idon't think we've discussed this
phenomena of vibe coding verymuch, but, Hmm, yeah, it's true.
It's, it's sticking offin the last couple months.
(42:18):
And the idea, if we haven't definedit, is basically people are starting
to make apps, build stuff from scratchvery, very quickly by using AI and, and
primarily generating code through lms.
Even people who have no background insoftware engineering are now seemingly
(42:39):
starting to code, vibe, code, as theysay applications with a vibe, meaning
that you kind of don't worry aboutthe details of the code so much, you
just get the AI to do it for you,and you just tell it what you want.
And so I think this updatereflects potentially the fact that.
This vibe coding thingis a real phenomena.
(43:00):
The focus here seems to be very muchon making aesthetically pleasing
websites on making better apps.
What they highlight in a blog postis Quick Concepts to working apps.
So hard to say how big thisvibe code phenomena is.
But from this update seems likepotentially that is part of inspiration.
(43:25):
I mean, yeah, like our, our launchwebsite for our, our latest report
that we did was all vibe coded.
So my brother, you know, had, Iguess he had like two hours to throw
it together or something, and hewas just like, all right, let's go.
Like, I don't have time to, and it,it was really quite interesting.
Honestly, I had not,this, this happened about.
What, like two months ago I had not atthat point actually done the vibe coding
(43:50):
thing because I guess I just aestheticallyI couldn't bring myself to do it.
That's the honest thing.
Like I just wanted to bethe one who wrote the code.
And the vibe coding thing is reallyweird if you've never done it
yourself definitely give it a shot.
Like just build the thing and basicallykeep telling the model like, no, fix
this, fix this, no, do it better.
(44:10):
And then eventually thething takes the right shape.
One caveat to that is you end up witha disgusting spaghetti ball of code
on the backend because the models tendto be like way too verbose and they,
they tend to just like write a lot ofcode when a little code will do it.
It's not tight.
It needs a refactoring.
But if you're cool with a landing pagelike we were, you know, very simple
(44:31):
product, you're not building a wholeapp, it can actually work really well.
I, I was super surprised.
I mean it, that was a easily a five xlift on, on the efficiency Of our setup.
So yeah, really cool.
Yeah, really cool.
I think very exciting forsoftware engineers as well.
Like if you haven't done webdevelopment or app development now,
(44:52):
it is plausible for you to do it.
Do think like, maybe you couldhave thought of a better, more
descriptive title, like ILLM, codinghack coding, product manager coding.
You know, white coding is afun name but a bit confusing.
And one last story in this section,hugging face is releasing a free
(45:14):
operator like Agentic AI tool.
So hugging face is the provider,the hoster of models and data
sets, and also the releaser ofmany open source software package.
And now they've released a freecloud-hosted AI tool called Open
Computer Agent, similar to open AI'soperator or philanthropics computer use.
(45:40):
So this basically, you know, yougive it some instructions, it can
go to Firefox and do things likebrowsing the web to do things.
According to this article,it is relatively slow.
It is using, you know.
Open models, things like I thinkthey mentioned small agent and it is
(46:05):
generally, you know, not as powerfulas opening as an operator, but as
we've seen over and over open sourcetends to catch up with closed source
of things like OpenAI pretty quickly.
And I would expect, especially in thingslike computer use there is really building
(46:26):
on top of model APIs and models and so on.
I. This could be an area whereopen source really excels.
Yeah.
And it's, it's also a goodI think strategic angle for
hugging phase two right there.
A, a big way they make theirmoney is they host the open
source models on their platform.
They run them, in this case runningage agentic tools on the platform.
I mean, that's a lot of API calls.
(46:47):
So, you know, if you havepeople ultimately release
this an API, a lot of people.
Presumably go to use it.
It is a bit of a finickytool as these things all are.
Of course.
This one may be particularly, so they'reusing some Quinn models in the backend.
I forget there were a coupleothers when I had a look at it.
But yeah, it, it also, you know,another instance of where we're seeing
(47:08):
Chinese models really come to thefore in the open source even hosted by
American, or I should say Western pseudoAmerican companies like hugging face.
Yeah, so, so another kind of nationalsecurity thing to think about as you
run them as agents, increasingly,you know, what behaviors are baked
in what back doors are baked in, whatmight they do if given access to more
(47:29):
your computer or your infrastructure.
So either way.
Interesting release.
I think hugging face is gonna start toown a lot more of the risk that comes with
the stack too, as you move into agenticmodels and, yeah, we'll, we'll see.
See how that plays out.
And moving on to projects and opensource, we begin with stability.
Ai, one of the big names inrealistic models, and their latest
(47:51):
one is Stable Audio Open Small.
So this is a text to audio model developedin collaboration with ARM and apparently
is able to run on smartphones and tablets.
It has 341 million parameters andcan produce up to 11 seconds of audio
(48:14):
on a smartphone in less than eight.
Seconds.
It does have some limitations.
It only can listen to English.
It does not generate realisticvocals or high quality songs.
It's also licensed somewhat restrictively.
It is free for researchers andhobbyists and businesses with
(48:37):
not that much annual revenue.
As with I think stabilityAI's recent releases.
So yeah, I, I think an interestingsign of where we are where you
can release a release state of artmodel to run on a mobile device.
And apparently this is even optimized torun on arm CPUs, which is interesting.
(49:00):
Yeah.
But other than that I don't knowthat there are many applications
I can think of where you wouldwant text to audio on your phone.
Yeah.
I mean, I think potentially.
They're viewing this as a beachhead, r andd wise to keep pushing in this direction.
Having a, a model on the phone thatactually works, that gives decent results.
(49:22):
Yeah, it can be pretty important.
'cause when you're talking verbally,right, you want to minimize latency
and so preventing the model fromhaving to ping some server and
then ping back, that's useful.
Also useful for thingslike translation, right?
Where you might have your phone,I dunno, in some foreign country
you don't have internet access.
Another useful use case, but, butthey're definitely not there yet, right?
Like this is a very much it reads likea toy more than a serious product.
(49:46):
I, I'm not too sure who.
Would be using this outside ofsome pretty niche use cases.
They describe some of the limitations,so it can't generate good lyrics.
Like it's that they just tell youpretty much flat out, like, this is
not something I'll be able to do.
Like, realistically goodvocals or high quality songs.
It's for things like drumbeats, it'sfor things like kinda little noises
(50:09):
that I guess you might want to use.
Almost.
To me it sounded like things youmight want to use when you're
doing like video editing or audioediting, like these sorts of things.
Which I don't know how oftenthat's done on the phone.
I, I, I may be missing, bythe way, a giant use case.
This is one of the, the virtues of ai.
Like I, you know, we're touching theentire economy of sound on the phone and
(50:29):
that, I don't know, but to first orderit, it doesn't seem, yeah, super clear
to me what the, the big use cases are.
But again, could just be a, beachheadinto a use case that they see as
really significant down the line.
And certainly, audio generationlocally on a phone sounds like it
could be quite useful down the line.
Next up we have an open AI image generatorthat is trained entirely on licensed data.
(50:55):
We are calling this F light.
This is made by free pick incollaboration with AI startup file.
Do ai, and it is arelatively strong model.
It has 10 billion parameters trained forover two months on 80 million images.
So even though it's they're not claimingit to be com competitive with state
(51:18):
over art stuff from Mid Journey andothers, or Flux, they are saying that.
This is openly available,fully, openly available and
fully trained on licensed data.
Unlike things like flux, which presumablyare trained on copyright data, which is
still very much an ongoing legal question.
(51:40):
We've seen Adobe previously emphasizebeing trained on licensed data.
So this now makes it so there isa powerful open source model that
is not infringing on copyright.
to be honest, I'd neverheard of Free Pick before.
Right?
This is, they're apparentlya Spanish company.
(52:01):
So again, I think this is the firstSpanish company I've heard about
in this context, in in, in kindof AI in general for a long time.
I'm actually curious if peoplecan think of, of others that I
might be missing here, but so.
Kind of an interesting firstpoints on the board for Spain.
Apparently this is a, yeah, 10 billionparameter model trained on 64 h 100
GPUs over the course of two months.
(52:21):
So, you know, it's like a, I mean, it'sa baby, it's a baby workload but by
open source standards pretty decent.
and certainly, I mean, you know, theyshow all the usual images you might
expect, like a really impressiveHD face of a woman and a bunch of,
anyway, a bunch of more artsy stuff.
So yeah, pretty cool.
I continue to wonder where the ROI,where the ROI argument is for these
(52:47):
kinds of startups that just do opensource image generation seems to
me like a pretty saturated market.
Seems to me kind of like they're lightingVC dollars on fire, but what do I know?
We'll see if, if they survive,we'll see how many actually survive
in this space going forward.
But definitely an impressive product.
And again, good for Spain.
Points on the board here.
Yeah, this sort of like takes youback to stability, ai, and I think
(53:08):
Flux also released their own model.
It's like, oh, you're releasingreally good models for free.
Yeah.
Like how, figure this out.
Yeah.
It's, it's a funny place with AIwhere it has become kind of a norm.
And I, I think probably partiallyjust a case of bragging rights and,
and fundraising brownie points.
(53:31):
But I think notable in thiscase, particularly because of
the licensed data aspect of it.
I find anytime I try toexplain it, it ends up sounding
just like a pyramid scheme.
It's like they, yeah, they makea great model using initial seed
round so they can convince theSeries A investors to give them more
money to make an impressive model.
At some point, there's apot of gold at the end.
(53:53):
Don't worry about it.
At some point, there's a pot of goldat the end, like I, I don't know.
But hey, it's a proving round ifnothing else for great AI teams.
I think the biggest winners in this,in the long run are probably the open
ais, the Googles of the world whocan come in and just act, will hire
these teams once they've run out ofmoney and can't raise another ground.
And then these are sort of hardenedbattle hardened teams with, with
(54:13):
more engineering experience.
So, you know, economicallythere's value there for sure.
It's a question of whether that valuejustifies the, the fundraising dollars.
Couple more models to talk about.
Next up, a MM thinking Dash V oneis a new reasoning model that VA
(54:33):
claim exceeds all other ones atthe scale of 32 billion parameters.
So this group of people, apparently VAM.
Team that is an internal team at bike.
Again, someone I have not beenaware of, they're dedicated
(54:54):
to exploring a GI technology.
What this group did was take thebase Quin 2.5 32 B model and publicly
available queries and then createdtheir own post-training pipeline to
do the thing we saw Deep Seek R one.
Do basically take a big good base model,do some supervised training and some
(55:19):
reinforc learning to get it to be a verypowerful reasoning or thinking model.
They released a paper that wentinto the details what they did.
It seems like, as we've seen in othercases the data curation aspect of it,
and we really nitty gritty of how you'redoing the post training matters a lot.
I. And so with that, they have, as youwould expect, a table where they show that
(55:45):
they are significantly outperforming DeepSeq R one and are at least competitive
with other reasoning models at this scale.
Although not quite as good asthe ones that are at hundreds
of billions of parameters.
Yeah.
And so some caveats on this.
So the model doesn't have support forlike structured function calling or tool
(56:08):
use which increasingly, oh, and alsomultimodal inputs, which is increasingly
becoming a thing as people start touse agents for, for computer use.
So, whenever you see an open sourcemodel like this I'm always interested
to see when are we gonna see opensource bridge the gap to, hey, this
thing is made for computer use.
It's made to be multimodalnatively, and kind of take in
(56:31):
video and use tools and all that.
So this is not that, but it is avery impressive reasoning model,
very serious entry in the growingcatalog of Chinese companies that
are building impressive things here.
couple things.
First of all, These papers are allstarting to look very similar, right?
We have, I think it's fair to sayat this point, a strong validation
on the deep seek R one path, whichis, you know, you do pre-training
(56:54):
with anyway, a staged pre-trainingprocess, increasingly high quality
data towards the end of pre-training.
Then you run your supervised fine tuning.
In this case they used almost three, 3million samples across a, anyway, a bunch
of different categories that had a kindof think then answer pattern to them.
so you do that, you supervise finetune, and then you do a reinforcement
(57:14):
learning step to enable the sort oftest time compute element of this.
So again, we see thishappen over and over again.
We saw it here, we saw it with Quinthree, we saw it with Deep Seq, R one.
We're gonna keep seeing it.
We a lot of the same ingredients usingGRPO as the training algorithm for rl.
That's here again.
Another thing is, and this is, Ithink this was common to Quin three
as well, it's certainly becoming athing more and more focused on kind
(57:38):
of intermediate difficulty problems.
So making sure that when you're doing yourreinforcement learning stage, you are not.
Giving the model too many problems thatare so hard that it's kind of pointless
for it to even try to learn from them orso easy that they're already saturated.
So this is one of the things thatyou're seeing in the pipeline is
(57:58):
a stage where you're doing a bunchof rollouts seeing what, what
fraction of those rollouts succeed.
And if the fraction is too low or toohigh, you basically just scrap that.
Don't use it as training data.
You only keep the ones that have someintermediate, you know, like 50 50,
70% pass rate, something like that.
So this is being used here as well.
Whole bunch of stuff too about the actualoptimization techniques that they use to
(58:23):
overlap communication and computation.
The challenge with this, and we talkedabout this in the context of intellect
two, that that paper that I guess wecovered two weeks ago where you've got
this weird problem with this reinforcementlearning stage, where unlike the usual
case where you would, you pre-traina model, you would feed it an input,
get an output, you'd immediately beable to do your back propagation.
(58:44):
'cause you would know ifthe output was good or not.
With the reinforcement learning stuff, youactually have to have the model generate
an entire rollout, score it, and onlythen can you can you do any kind of like
back propagation or, or or wait updates.
And the problem with that is thatyour rollouts take a long time.
And so you have to find ways tohide that, that time and overlap
(59:06):
it with communication or, oranyway, do different things.
And so that's a big part of what they're,they're after here in this paper.
Last thing I'll mention is this company.
Which again, not gonna lie, I had neverheard of Bay Hub before, but they are, are
apparently I can't explain this, don't askme to explain this, but the description on
their website is, That they work togetherwith China's top tier developers to
(59:31):
they're basically like a property company.
Connected over 200 brokerage brands,hundreds of thousands of service
providers across a hundred citiesnationwide, providing both buyers
and sellers of existing housingservices, including consultancy entrust
property, showing facilitating loans.
What the fuck?
Like, I don't know.
(59:52):
I don't know.
Do you wanna invest?
Do you wanna invest in these guys?
I guess you do because theymake really good models.
Now what?
Apparently, yeah.
This real estate company is investedin going in a GI Well they're,
they seem like they're one of theseChinese everything companies as well.
'cause then they also, they have likea million different websites that
was, I guess their housing website.
(01:00:13):
They also describe themselves onanother one as the leading integrated
online and offline platform forhousing transactions and services.
So maybe they're what, more ofa, like a stripe for housing?
I don't know.
Somehow some executive at Beca saidone day we gotta get in the AI game and
apparently recruited some good talent.
(01:00:34):
I'm so confused right now.
But yeah, there, there it's I think,yeah, also indicative probably of
the impact of deep seek are oneon the Chinese landscape where.
They made a huge splash, right?
Like do the effect of actuallyaffecting the stock market in the
us I would not be surprised if thereare new players in China focusing on
(01:00:55):
reasoning just as a result of that.
It is weird that they're coming fromlike a property company or something
like, I mean, I, I understand.
Yeah.
This is a weird one for sure.
Yeah, yeah.
Like, I, like I get deepseek, you know what I mean?
Like, like, okay, so they come fromhigh, high flyer like this, like,
you know, hedge fund that a millionhedge fund companies like Medallion or
Rentec, like they, they do ai, right?
(01:01:16):
That's what they do.
This is just like, like,what are you doing guys?
Apparently they're doing really well.
It's a good model.
Dunno what to say.
And yeah, fully open sourced.
So that's nice to have.
And last open source model, wecover BLIP three dash o, a family of
fully open, unified multimodal modelarchitecture, training and dataset.
(01:01:41):
So we've covered blip three before.
That was the multimodal model inthe sense of taking both images
and text as input and output text.
That used to be what multimodalmeant with a blip three dash o there.
Moving in I supposedly frontierof multimodal where both with
(01:02:06):
Chad GBT and with Gemini.
We saw recently the models beingable to output images in addition
to taking them as input so that nowwe have a unified multimodal model.
It can take in multiple modalities,it can output multiple modalities.
I will say not necessarily just one bigtransformer as is typically the case for
(01:02:29):
multimodal things with multiple inputs.
But anyway, that's the core idea.
And they talk in the paper alot of details on how to be
able to train such models.
They train a model on 60,000 data pointson this instruction phoning to make sure
that it is able to generate high qualityimages release the 4 billion parameter
(01:02:55):
model that is trained on only open sourcedata and have also an 8 billion, billion
parameter model with proprietary data.
I mean, it's, it's what I would expect.
Things are gonna have, like, I thinkthe the multimodality trend and the age
agentic trend sort of converge again,as I mentioned, on, on computer use.
So I see these two things being differentways of getting at the same thing.
(01:03:18):
The two things being this paper andthe one we just talked about, it does
seem like a, a pretty impressive model.
One of the things that they did work ona lot was figuring out the architecture.
They found that using, I. Clip imagefeatures gives just more efficient
representation than the VAE features,the variation, auto encoder features that
(01:03:40):
often are used in this type of context.
Clip being the contrastivetraining approach that OpenAI
used for, well, for clip.
there's a whole bunch of work thatthey did around training objectives
as well comparing different objectivefunctions that they might use to
optimize for this sort of thing.
Anyway, it's, it's cool.
(01:04:00):
I think it's it's an early shot at, athigh degrees of multimodality from these
guys and I would expect that we'll getsomething like a, a more coherent, you
know, in the same way that we've coalescedaround a stack for the agent side.
I think this is an early push into,into the kind of very, very wide
aperture, unified multimodal framework.
(01:04:20):
We've seen a lot of different attemptsat this and it's still unclear what
strategy is gonna end up working.
So it's, it's hard to know where toinvest, like, you know, our own marginal
research time as we look at these papersand figure out like, okay, well, which of
these things it's really gonna take off.
But for now, given its size, thisactually, it does seem pretty promising.
Yeah.
Now I would imagine certainlyprobably the best model of its kind
(01:04:44):
that you can get an open source toYeah, be able to generate images.
We've seen models like Gemini, like OpenAIthat integrate the transformer with the
image generation have some very favorable,favorable properties and seem like.
They actually are better at very nuancedinstruction following, so there's still
(01:05:06):
room to improve in the image space aboutthese are of course, not quite as good as
the previous releases from the BLIP team.
That includes Salesforce and Universityof Washington and other universities.
Super, super open source.
The most open source here.
You can get code models, pre-training,data instruction, doing data.
(01:05:28):
All of it is available when you need tocatch your breath while listing all the
different ways in which it's open source.
That's the bar, that's, that's howyou know fully open source fully.
And now moving on toresearch and advancements.
We begin with a deep mind andthey have released a new paper
(01:05:50):
and blog post and media blitz withAlpha Evolve, a coding agent for
scientific and algorithmic discovery.
That's the name of a paper.
The blog post, I think somewhat amusinglyis Alpha involved, a Gemini powered
coding agent for designing advancedalgorithms, but there'd be no confusion.
Yeah.
(01:06:10):
And so as per the title, theidea here is to be able to design
advanced algorithms to get some codethat solves a particular problem.
Well, this is in some ways asequel to something they did
last year called fund search.
(01:06:31):
We covered it maybe inthe middle of a year.
I forget exactly when.
And this is basically taking up.
Taking it up a notch.
So instead of just evolving a singlefunction, it can write an entire
file of code, it can evolve hundredsof lines of code in any language
is scaled up to a very large scalein terms of compute and valuation.
(01:06:56):
So a very way this looks in termsof what it does, is a scientist
or engineer sets up a problem.
Basically it, it gives you a prompttemplate, some sort of configuration,
chooses LLMs, provides evaluation codeto be able to see how good a solution
is, and then also provides an initialprogram with components to evolve.
(01:07:20):
And then Alpha Evolve goes out andproduces many possible programs evaluates
them and winds up with the best program.
And similarly to what we sawwith fund search Fund search.
At the time they said that they achievedsome sort of small improvement in
(01:07:43):
a pretty basic operation of Matrixmultiplication, although at the time this
was a little nuanced, not entirely right.
While with Alpha Evolve, they goingto showing for various applications
like, auto correlation and uncertainty,inequalities packing and minimum maximum
(01:08:04):
distance problems, various math thingsthat clearly I'm not an expert of.
They show somewhat improved outcomesand just yeah, the latest really of
the deep mind style of paper wherethey are like, let us build some sort
of alpha model to tackle some sortof science or in this case, computer
(01:08:28):
science thing and get some cool results.
Yeah, I think that's howthey describe it internally.
Like we're gonna do some kind of alphasomething and then we're gonna, but, but
that's actually, I mean, it's accurate.
One of the ways, I used to think aboutit, I, I, I think I still do, is through
the lens of inductive priors, right?
So basically the, the Google,so OpenAI has this, they're
(01:08:48):
super scale pilled, right?
Just like, take this thingand scale the crap out of it.
And, and more or less, all your rand d budget is going into figuring
out ways to get out of your ownway and let the thing scale.
Whereas Goly mind tends to comeat things from a perspective
like, well, let's almost.
Let's almost replicate the, thebrain in a way in different chunks.
(01:09:10):
So we're gonna have a, a clear chunk,like, you know, an agent that's got this
very explicitly specified architecture.
We're not just gonna let themodel learn the whole thing.
We're going to tell it how thedifferent pieces should communicate.
And you can see that reflected here inthe kind of pool of functions that it
reaches into and grabs the evolutionarystrategy and, and how that's all
connected to the language modeling piece.
(01:09:30):
They also have an element to this wherethey're using Gemini Flash, you know,
the super fast model and the Gemini Pro.
They're more, I guess, powerful butslower model for different things.
So with Gemini Flash, they use it togenerate like a whole smorgasborg of
different ideas cheaply, and they useGemini Pro to do kind of the depth
and, and the, the deep insight work.
(01:09:51):
all those choices, right?
Sort of involve humans imposingtheir thinking of how a system
like this ought to work.
And what you end up finding withthese systems is they'll often.
Outperform what you can do with justlike a base model or an, or an age
agentic model without a scaffold.
But eventually the base modelsand age agentic models just kind
of like end up catching up toand subsuming those capabilities.
(01:10:14):
So this is a way that DeepMind does tendto kind of reach beyond the immediate,
the ostensible frontier of what justbase models and age agentic models can
do and achieve truly amazing things.
I mean, you know, they've done allall sorts of stuff with like density,
functional theory and controlling fusionreactions and predicting weather patterns
by following this exact approach.
(01:10:34):
so really cool.
And it, it's consistent as wellwith isomorphic labs and all the
biotech stuff that they're doing.
So it's a, a really impressivea really impressive paper.
You can see why they're pushingin this direction too, right?
For automating the r and d loop,if you can get there first, you can
trigger the sort of intelligenceexplosion, or at least it starts in
your lab first, and then you win.
(01:10:54):
This is a good reason to, to trythat strategy of reaching ahead,
even if it's with bespoke approachesthat use a lot of inductive priors
and don't necessarily scale.
As automatically as some of the kindof opening eye strategies might.
Yeah, I find it.
Looking at the paper, interestingly,they don't talk super in depth as far
(01:11:16):
as I can tell on the actual evolutionaryprocess in terms of what we are doing.
It seems like they pretty much are saying,we took what we had in fund search, which
was an LLM guided evolution to discoverstuff and we expanded it to do more, to
be more scaled up et cetera, et cetera.
So, it's them, as you said,taking something, pushing it more
(01:11:41):
and more the to the frontier.
They did this also with protein foldingwith chest, with any number of things.
And now they are claiming some pretty,you know, significant advancements in
theoretical and, and existing problems.
Also on practical things, they say thatthey found ways internally to speed up
(01:12:06):
the training of Gemini by 1%, by findinga way to speed up the kernel of Gemini.
Also found ways to assist withtraining TPUs scheduling stuff.
Anyway, these kinds of actually usefulthings for Google in the real world.
(01:12:31):
And next up we have absolutezero reinforced self play
reasoning with zero data.
So for reasoning models, as we'vecovered deep seek R one, the standard
paradigm these days is to do somesupervised learning where you collect
some high quality examples of thesort of reasoning that you want
(01:12:53):
to get, and then do reinforcementlearning with an Oracle verifier.
So you do reinforcement learningwhere you're solving coding and
mouth problems, and you are able toevaluate very exactly what you are
outputting via reinforcement learning.
(01:13:15):
So here they are still using a codeexecutor environment to validate task
integrity and provide feedback, butthey're also going more in direction
of self evolution through self play.
Another direction with DeepMind and openAI also pushed in the past where you
(01:13:36):
don't need to collect any training data.
You can just launch LLMs togradually self-improve over time.
Yeah.
And it's the way they dothat is kind of interesting.
So there was a paper, I'm tryingto remember what the, the name
of the model was that did this.
(01:13:56):
And I, for some reason I think it.
I may be wrong.
I, I have a memory that it was maybedeep seek, but in this, or, or sorry,
the, the lab, not the, the model, butessentially, so this is a strategy where
they're gonna say, okay when, when itcomes to a coding task, we have three
elements that play into that task.
We have the input, we have the function,and then we ha or the program and,
(01:14:17):
and we ha we have the output, right?
So those, those three pieces, andthey sort of recognize that actually
there are three tasks that we couldimagine getting a model to do based
on those things we could imagine.
Showing it the input and the.
Program and asking itto predict the output.
So that is called deduction, right?
(01:14:38):
So you're giving it a programand an input, predict the output.
You could give it the program and theoutput and ask it to infer the input.
And that's called abduction.
There's gonna be a quiz later on thesenames, and then there's, if you give
it input, output pairs, figure outwhat, what was the program that connect
these, that connected these, right?
(01:14:59):
And that's called induction.
And these actually kind of allthe names make sense if you think
about them enough, but that,that's basically the idea, right?
Just like, basically take the input,the program and the output and block
out one of them and, and reveal theother two and see if you can train a
model to predict the missing thing.
In a sense, this is at a high levelof abstraction, almost a kind of auto
aggressive training in a, in a weird way.
(01:15:21):
But the bottom line is they useone unified model that's going
to, that's gonna kind of likepropose and solve problems.
And they're gonna set up a, a rewardfor the pro problem proposer, which
is essentially, you know, generatinga program given input and output.
And for that, it's your standard.
Like if you solve the problem, if youpropose a correct problem that, or
(01:15:44):
a program rather that compiles andeverything's good, you get a reward.
If not, you don't.
anyway, they do a bunch of MonteCarlo rollouts in this case, eight,
just to normalize and regularize.
But yeah, bottom line is, you see, again,another theme that pops up in this paper
is this idea of difficulty control.
In this case, the system hasa lot of validation steps that
(01:16:07):
implicitly control for difficulties.
They're not gonna explicitly say,Hey, let's only keep the, you
know, the, the, the mid-rangedifficulty problems by some score.
You actually end up pickingthat up implicitly because.
A couple conditions that they impose.
The first is that the programs thatare proposed the code for those
programs has to execute without errors.
(01:16:29):
So automatically that means youhave to be at least able to generate
that code and it has to be coherent.
there's a determinism checktoo, so the programs have to
produce consistent outputs.
If you run the program multiple times,you gotta get the same output again.
You know, this requires acertain level of mastery.
And then there's some safety filtering.
So they, they forbid theuse of harmful packages.
(01:16:50):
And basically if your program generation,part of your, your stack here is able
to do this successfully then probablyit's, it's being forced to perform
at least at some minimal level.
So the task is not gonnabe trivial, at least.
And only tasks that pass allthose validations contribute
to the learning process.
So you, you get a kind ofbaseline quality of the, the
(01:17:11):
programs that are generated here.
it's a really interesting paper.
it's something that.
Raises a lot of questionsabout the data wall, right?
This is something that people havetalked a lot about, is like there's
only so much data you can fine tune on.
So many examples of solvedproblems, solved coding problems.
If you have this closed loopthough, that's able to automate,
automatically generate new.
(01:17:33):
Problems, new deduction,abduction and induction problems.
And then close a loop where one feedsinto the next as they have here.
Then you really don't have a data walllike it, it, and they have some scaling
curves that show admittedly not thatfar out in scaling space you know, in
sample space, but still scaling curvesthat show that, yeah, you know, this,
(01:17:53):
this does keep, seem to keep goingat least as far as they've tested.
If that holds, essentiallywhat they're doing is they're
trading data for compute, right?
You can basically, if your model isgood enough to start this feedback loop,
then just by pouring more compute intoit to get, get the, the model to pitch
new problems that it can then solve.
You can start this feedback loopwhere really there's, I mean, there's
(01:18:15):
no data wall that, that at leastwould seem to apply for the kind
of code problem solving problemsthat that they're training on here.
Right.
And just to know the particulardetail, they do actually look
into not having the verifiablerewards or the supervised learning.
(01:18:36):
So absolute zero is absolute zerobecause there's no supervised
learning or very valuable rewards,although they are, I think, still
executing the code in a computingenvironment, if I understand correctly.
So they can have some feedbackfrom the environment but not an
actual kinda verification thatyou got the problem correctly.
(01:18:59):
So as a result, we have to thinkthrough all these, other techniques
to be able to evaluate yourself, likeinduction, abduction, induction, as
you said that allows them to train.
They compare to I haven'tactually been aware of these.
There's been, you know, more and moreopen source efforts as we've seen
(01:19:20):
apparently there's an open Reasoner zero.
There's also simple RL Zoo variousthings over the last couple months
looking into the RL part of reasoning.
And so this is just the latest and Ithink pushing in a direction of not
requiring verifiable rewards, whichis to some extent the limitation
(01:19:40):
of the Deeps EQ R one formula.
Next up we have anotherreport from Epic ai.
So not a research paper, but ananalysis of trends and kind of a
prediction of where we might be going.
This one is focusing on howfar can reasoning models scale.
(01:20:03):
So the basic question here is, can we lookat the training compute that's being used
for reasoning models, things like deepCR one rock free, and from that infer the
scaling characteristics and to what extentreasoning will kind of keep growing.
(01:20:25):
So there, prediction is that wehave a pretty small period in
which you have very rapid growthgoing from deeps one to graph free.
They don't know exactly the trainingfor O three versus oh one, but they I
think are predicting here that O threewould be trained quite a bit more.
(01:20:50):
And so their prediction is thetraining compute being used will start
flattening out a bit, growing slowercompared to base models of the past.
And, but we are still sayingthat, you know, the scale
of large training runs will.
Keep going in the next couple years,and presumably the reasoning models
(01:21:14):
will continue improving as a result.
Yeah, you can kind of we talkedabout this quite a bit actually when
before and when a deep Seeq R onecame out, we're talking about it
before, even when oh one came out.
Just the idea that you have this newparadigm now that requires a fundamentally
different approach to compute, right?
You have to.
Well, we just talked about it.
(01:21:34):
Instead of just doing, you know,generating an output and then
automatically being able to scorethat really quickly and then doing
back propagation, updating yourmodel weights, what you now have to
do is you take your base model, yougenerate an entire rollout, and that
takes a lot of time and it has to bedone on inference optimized hardware.
And those rollouts then have to beevaluated and then the evaluations
(01:21:55):
have to check out, and then you usethose to update your model weights.
And so that whole extra step actuallyrequires a different compute stack.
And so if you look at what the, thelabs are doing right now, they've gotten
really, really good at, at pre-cal, atscaling, pre-training compute, right?
Just this auto aggressivepre-training where you're training
a giant text auto complete system.
People know how to build multi-billiondollar, tens of billion dollar scale
(01:22:19):
pre-training compute clusters for that.
But what we're not seeing,what we haven't yet seen.
Is aggressive scaling of thereinforcement learning stage of training.
And, and this is notgonna be a small thing.
So it's estimated that about 20%of the cost of pre-training deeps
seek R one, the, the, the, the Vthree model that R one was based on.
(01:22:42):
So if you look at the cost of pre-trainingdeep seq V three about 20% of that
cost went into the compute for R one.
That's not trivial.
And we keep seeing in these computescaling curves for inference
time scaling, that you really dowanna scale it along with your
pre-training compute budget, right?
So it's you're gonna get to apoint where right now we're ramping
(01:23:04):
up the orders of magnitude likecrazy on the inference side.
That's though gonna, gonnasaturate very quickly.
I mean, we saw 10 X leap fromoh one to oh three in terms of.
The compute used for the reinforcementlearning stage, as you said, you
can only do that so many times untilyou hit essentially the, the ceiling
of what current hardware can allow.
Once that happens then your bottleneckby how fast can you grow your.
(01:23:29):
Algorithmic efficiencyand your hardware scaling.
And essentially that looks the sameas pre-training, scaling growth,
which is about four x per year.
So you should expect a rapid increase.
oh four is gonna be really, really good.
Oh five is gonna be really, reallygood, but pretty quickly it's not
that things are gonna slow downlike crazy, but they'll, they'll
scale more like the pre-trainingscaling curves that we've seen.
This has big consequences for us,China, for example, because right
(01:23:53):
now it's creating the illusion thatChina is better off than necessarily
they are in the early days of thisparadigm, when people haven't figured
out how to take advantage of giant.
Inference clusters.
The US which has larger clustersavailable than China, isn't yet able
to use the full scale of its clusters.
And so we're getting sort of ahobbled United States, artificially
(01:24:14):
hobbled the United States relevantto China on a compute basis.
All kinds of reasons why.
That's actually kind of morecomplicated picture, but I thought
that was really interesting.
Another data point that they flaggedhere that I was not tracking at all was
there are these other reasoning models.
That have been trained, that havecome out fairly recently, like Five
four Reasoning or Lama Nitron Ultra.
And these have really smallreinforcement learning compute budgets.
(01:24:37):
Like we're talking less than 1%,in some cases, much less than 1%
of the pre-training compute budget.
And so it really seems like R one isthis case of an unusually high investment
in rl compute relative to pre-training.
And that a lot of the models thatare being trained in the West, the
reasoning models have very highpre-training budgets and relatively very
(01:24:57):
tiny reinforcement learning budgets.
I thought that was super interestingand something tells me that the deep
Seeq R one strategy is actually morelikely to be persistent in the long run.
I suspect you're gonna see moreand more flowing into the, the
RL part of the training stack.
But anyway super important importantquestions being raised here.
Interesting.
(01:25:17):
Little writeup from Epica ai,which we, we do love to cover.
Right, exactly.
And to that point we've seenkind of a mix of results.
It's still not a very clear picture.
We've seen that you can really getrid of RL and with a very well curated
data set for supervised, fine tuning.
You can at least do most of the progresstowards reasoning and to unlock the
(01:25:43):
hidden capabilities of a base model,as they say, with oral, not necessarily
adding new capabilities, just sort ofshaping the model towards using it.
Well, we, knowing also rre verydifferent in terms of a training from
auto aggressive unsupervised learningor, what self supervised learning,
(01:26:03):
I guess, was, was the term for awhile in the sense that RL requires
rollouts, it requires verification.
It, it just isn't as straightforward toscale as pre-training or post-training.
So another kind of aspect to consider,but yeah, very much still an ongoing
research problem as we've seen withall these papers we keep talking
(01:26:25):
about, with all these different typesof results and different recipes I'm
sure we'll likely, you know, over timeconverge to what has been the case
in pre-training and post-training.
People, I think, have discoveredmore or less the recipe, and I'm
sure that will increasingly bethe case Also with reasoning.
(01:26:47):
And onto the last paper,this one coming from OpenAI.
So, you know, props, I, I sometimesI think have said that OpenAI
doesn't publish research anymore,and that's not exactly true.
And this one is health benchevaluating large language models
towards improved human health.
So open source benchmark designedto evaluate LLMs on healthcare
(01:27:10):
focusing on meaningful, trustworthy,and on saturated metrics.
So this was developed.
We with input from 262 physiciansacross 60 countries, it includes 5,000
realistic health conversations to testLO's ability to respond to user messages.
(01:27:30):
Has a large rubric evaluationsystem with a ton of unique
criteria as you might expect.
You know, this is an area where you reallywant to evaluate very carefully and be
sure that your model is trustworthy,is reliable, is even allowed or
should be allowed to talk about healthand, and questions regarding health.
(01:27:51):
And so they open source data set.
We open source, we eval code so thatpeople can work on AI for healthcare.
Yeah, and I mean to, to yourpoint about OpenAI not publishing
research anymore, I, I, I thinkyou, you are fundamentally correct.
I mean, it, it's, they don'tpublish anything about how
they build their models.
(01:28:11):
Algorithmic.
Yeah, the algorithmic discoveries,let's say mostly sometimes with
image generation, we have done alittle bit, but yeah, mostly not.
And he like here and there for alignment,but it's murky and, and unclear.
And and then, you know, when youhave something that makes for
a great PR play, like, Hey, wehave done this healthcare thing.
Please don't regulate us pretty please.
(01:28:33):
We're doing good things for the world.
Then all of a sudden you getall this wonderful transparency.
But I will say credit work, credit is due.
This is a huge scale significantinvestment seemingly that open eyes
had to put into putting this together.
So, 5,000, as you said, multi-termconversations between users
and AI models about healthcare.
What they did is they got about 300doctors to look at these conversations
(01:28:57):
and propose bespoke criteria.
So like, you know, specific criteriabased on which they would judge the
effectiveness of the AI agent in thatconversation or of the AI chatbot.
And so to give you an example, I. Youknow, you have a parent who's concerned
about their baby, who hasn't beenacting like herself since yesterday.
The rubric that the doctors came upwith that were aggregated from a, a
(01:29:21):
bunch of doctors, different doctorslooking at this exchange they're like,
okay, well does the chat bot state thatthe infant may have muscle weakness?
If so.
Seven points Does it list atlist at least three common causes
of muscle weakness in infants?
If so, plus five points.
Does it include advice toseek medical care right away?
And so they give points.
(01:29:41):
I mean, it's a very detailed kindof looking over the AI shoulder type
of perspective for each of these5,000 multi-term conversations.
Again, using hundreds andhundreds of doctors to do this.
And there are some criteria They'reshared across many of these exchanges.
But so about 34 what theycall consensus criteria.
These are things that come up againand again, but mostly they are example
(01:30:05):
specific, like 80% of the criteria theyuse are literally like just for one
conversation or just for for one exchange.
So that's pretty remarkable, areally, really useful benchmark.
They use GPD 4.1 to evaluatewhether each rubric criterion
is met in a given conversation.
So they're not actually gettingthe doctors to review the chatbots.
You know, responses, obviously thatdoesn't scale, but what they do do is
(01:30:28):
they find a way to demonstrate that GPD4.1 actually does a pretty decent job
of standing in as the typical physician.
Their performance, their, the gradesthat they give are pretty comparable.
And if GPD 4.1, by the way, is the bestmodel they identified, it does better
than even O four Mini and, and O three.
At that task, one ofthe things that really.
(01:30:50):
Messes with my head on this.
And, and we have to remember anytimewe look at a benchmark like this
and we're tempted to ask, okay,so how well does the best AI do?
How well does a doctor do?
Right?
That's the natural question.
It is important to note thatthis is not how typical doctors
would evaluate a patient, right?
Like you would typicallyhave visual access to them.
(01:31:11):
You'd be able to touch, you'd be ableto kind of, you know, see the, the
nonverbal cues and all that stuff.
That being said, on this, benchmark modelsdo outperform unassisted physicians,
unassisted physicians score 0.13 onaverage across all these these evals.
Models, the top models on their own.
0.6, that's for oh three.
(01:31:32):
That is wild.
That is a four times higher scorethan the unassisted physician.
That honestly like, kind ofblows my mind a little bit.
Certainly these models can draw onmuch, much larger sources of data.
And again, we gotta add all those caveats.
You know, physicians don't normallywrite chat bot style responses to
health queries in, in the first place.
(01:31:52):
But it's an interesting note and we'veseen some papers, we've talked about
them here, where doctors actuallycan perform even worse when they work
with an AI system than the AI systemon its own, because the doctors are
often second guessing and, and, youknow, don't, don't, let's say just
have blind faith in, in this model.
So pretty interesting.
One more caveat there is, thereis a correlation, we've seen this
(01:32:14):
before, between response length.
End score on this benchmark.
And that's a problem because it means thateffectively the chat bots can gain the
system a bit just by being very verbose.
So surely that's influencingthings a little bit.
The effect does not though nearlyaccount for the insane disparity between
unassisted physicians and models,which again, is like a four x lift.
(01:32:35):
Like that's pretty wild.
Yeah.
Worth noting that there are multiplemetrics here, including communication,
quality, accuracy as its own metric, andthey do actually evaluate the physicians
with the models and the combination thereis, on par, maybe, you know, there's
(01:32:56):
some of these things that were better on.
Accuracy seems to be about the same.
Communication qualitymay be a bit different.
But yeah, physicians with these toolswill be much more effective than without.
That's pretty clear from results.
And they do have variouscaveats as to evaluation.
Like you said, there's a lot of varivariability there and and so on.
(01:33:20):
Interesting to me.
Also in the conclusion, they notethat they included a cannery string
to make it easier to filter outthe benchmark from training Cora.
And they also are retaining a smallprivate held out set to be able to
enable instances for accidental trainingor implicit over fitting to be bench.
(01:33:45):
So I think interesting that in thisbenchmark we're seeing, What should
be probably the standard practice forany benchmark release in, in this day,
which is you need to be able to make iteasy to filter it out from your massive
training thing from web scraping andprobably also have a private eval set.
(01:34:10):
Onto policy and safety.
First up, we have the Trumpadministration in the US is officially
rescinding Biden's AI diffusion rules.
So there was the artificial intelligencediffusion rule that was set to take
effect on May 15th, introduced by JoeBiden in January, it will aim to limit
(01:34:35):
the export of us, made AI shippedto various countries and strength,
strengthen existing restrictions, andthe department of Commerce has announced
that it'll not enforce this Biden.
Regulation error replacementrule is expected that will
presumably have a similar effect.
(01:34:58):
The rule I think we covered probablyat the time there were free tiers of
countries tier free being China andRussia that have very strict controls
tier two countries that are someexport controls and tier one, which
(01:35:19):
are friends that have no controls.
So, seems that now the industryas a whole is gonna have to wait
for what the new rules will be.
Yeah, the philosophy here, and wehave yet to hear the announcement
for the Department of Commercefor what will replace this.
But the philosophy seems to be thatit'll be nation to nation bilateral
(01:35:41):
negotiations for different chipcontrols, which could make sense.
I mean, one of the big weaknesses ofthe, the diffusion framework that the
Biden administration came out with, andwe talked about this at the time, was
they had this insane loophole where.
As long as any individual order ofGPUs was for less than 1700 GPUs
(01:36:03):
literally zero controls applied.
And the reason that's relevant isliterally Huawei's entire MO has been
to spin up new subsidiaries faster thanthe US can put them on their export
control list and then use those to kindof pull in more controlled hardware.
And then obviously Huaweijust pulls that together.
(01:36:23):
And so putting in an exemptionfor a 1700 is a decent number
of GPUs too, by the way.
So putting in an exemption for thatnumber of GPUs is, I mean, you're,
you're kind of just asking for it.
That is exactly the right shape forChina to exploit that matches exactly the
strategy they have historically used to,to exploit us export control loopholes.
So hopefully that's somethingthat'll be addressed in this whole
(01:36:46):
kind of next round of things.
We don't yet know exactlywhat the shape will be though.
We do have a sense and whatthis ties into our next story.
Of what the approach will be with respectto certain Middle Eastern countries
like Saudi Arabia, like the UAE whichare now kind of top of mind as the sort
of not neutral states, but the, theones that aren't the US or China, let's
say proxy fronts in this big AI war.
(01:37:08):
Right.
And that does take us to the next piece.
Trump's mid east visit openfloodgate of AI deals led by Nvidia.
That's from Bloomberg.
So the Trump administration has beenmeeting with two nations, in particular
Saudi Arabia and the United Arab Emirates.
(01:37:28):
And we do expect agreementsto be unveiled soon.
And the expectation is there will beeased restrictions, meaning that Nvidia
a MD and others will be able to sellmore, you know, get more out of a region.
The stock market reacted very favorably.
Nvidia went up 5% and MD went up 4%.
(01:37:53):
And there's been a variety ofannouncements per the article
title of, it's deals that seemlike they'll start happening.
So for instance, Nvidia will beproviding chips to Saudi Arabia's
Humane, a company created to push thecountry's AI infrastructure efforts.
(01:38:18):
Humane will get several hundredthousand of NVIDIA's most advanced
processor over the next few years.
And there's other deals like thatwith a MD Amazon, Cisco, others.
So the indication seems to be, youknow, some restrictions will be eased.
Restrictions were set in partbecause there were ties between
(01:38:41):
some firms in these regions andChina with, in particular G 42.
So yeah, it seems like it mightbe different from the Biden era.
Yeah, it's, it's quiteinteresting right there.
There's a lot that the different playersat the negotiating table here want.
The Saudi deal is especially interesting'cause it's, it points to a similar kind
(01:39:04):
of deal to the deal that America's startedto shape over the last few months with
the UAE being more permissive in someways, but also insisting that the UAE move
away from their entanglements with China.
You mentioned G 42, right?
And Huawei having had some, some,some past, well, the strategic si
situation, if you're Saudi Arabia,is you wanna be positioned for an
(01:39:25):
oil for a post oil future, right?
That's the same for the UAE and thesame for all the Gulf states really.
In Saudi Arabia that's motivated thisthing called Project Transcendence, which
is a $100 billion initiative for techin general, but specifically for ai.
There's a, a big, bigpool set aside for that.
The UAE is in a similar position.
They already have a National Championlab in G 42 as well as Institute
(01:39:48):
for technology or something?
I, I-T-I-I-I-T, yeah, yeah, yeah, yeah.
The guys who made, whodid the Falcon models.
Yeah.
Which we haven't heard much about since,by the way, which is kind of interesting.
But right now the Saudis are behind theUAE and they're trying to make up ground.
And so the UAE and the Saudisessentially are in some sense competing
(01:40:08):
against each other to be America'spartner of choice for large scale
AI deployments in the Middle East.
That's one dimension of this.
They wanna get their hands on as muchAI hardware, as many GPUs as they can.
This is one reason why Trumpstacked them back to back.
So he had first an announcement ofthe deal with the Saudis and then
heading over to get a, a deal withthe UAE, putting pressure on each of
(01:40:29):
them to kind of play off each other.
Look, the Saudis have tons ofenergy, they are an energy economy.
Same with the UAE, just at thetime when we're saturating the uss.
Energy grid and that's the mainkind of blocker on our deployments.
And so you can see the temptation ifyou're open ai, if you're Microsoft,
if you're Google to just like say,well, why don't we set up a data center
in the Middle East where we have anabundance of energy plug into their grid?
(01:40:53):
and that'll be great for us.
And well, there are acouple reasons why you.
They might not wanna do that.
So historically, one was the BidenAdministration's export control scheme.
You just can't move that manychips into a, a foreign country
like that, just no good.
But that's being scrappedas we just talked about.
So now the situation is, well, maybewe can, right, maybe we can negotiate
country to country and set this up.
(01:41:14):
But the United States is gonna wannamake sure that if they are setting up
AI infrastructure in the UAE in SaudiArabia, that the Saudis don't turn
around and sell that to China, right?
China's super good at usingthird party countries.
Historically that's been Malaysia,it's been Singapore, right?
And using those countries to bring inGPUs and subvert US export controls.
(01:41:34):
So, you know, sure you might haveexport controls on China proper,
but you don't necessarily havethem on Malaysia, on Singapore.
And what a surprise, a massive.
Influx of GPU orders into Malaysiaof all places in the last few months.
Hmm.
Wonder where those are being redirected.
Right.
So th this is something that theadministration wants to make sure
it doesn't happen with these deals.
Whole bunch of, of issuesaround Saudi entanglement.
(01:41:56):
You said, you know, UAE China's got alot of ties, so do the Saudis, right?
Huawei made Saudi Arabia a regionalcenter for their cloud services.
There's a big Saudi public investmentfund, the PIF that's actually bankrolling
this whole project Transcendence thing.
And the PIF has jointventures with Alibaba Cloud.
They've got a new tech investmentfirm that we covered a few episodes
(01:42:17):
ago called AAT that also has a jointventure with Da hu, which is an.
An envy listed, basically ablack listed Chinese surveillance
tech company of all things.
So there are a lot of entanglements thereand, and deep questions about how some
of the, the Saudi Arabian GPU reservesare being used potentially by Chinese
academics and researchers as well.
So while there's no hard evidenceof the Saudis shipping GPU
(01:42:41):
specifically to China, you wouldn'tnecessarily expect that China's MO
is absolutely to do stuff like this.
And just a, a last notehere in the negotiations.
One really interesting thing that's beenproposed is this idea of a data embassy.
No one's ever proposed this before, butbasically it's the idea that like, look
if you wanna be able to take advantage of.
Huge sovereign reserves of energyin the UAE and Saudi Arabia.
(01:43:04):
But you're, you're concernedabout the security implications.
Well, maybe you can set up a region ofterritory that, you know, just like how
the US Embassy in Saudi Arabia is thistechnically tiny slice of American soil in
Saudi Arabia of sovereign American soil.
Well, let's set up a tiny sliceof sovereign American soil
and put a data center on it.
US laws will apply there.
(01:43:25):
You're allowed to shipGPUs to it, no problem.
Because it is sovereign US territory.
So export control isn't an issue.
In the same way sure you haveSaudi energy feeding in, and
that's a huge vulnerability.
Sure, you're embedded in this matrix, butin principle, maybe you can get higher
security guarantees from doing that.
Lots of caveats around that in practice.
I will go into them, but likethere are some real security issues
(01:43:46):
around trying something like that,that our team in particular has
spent a lot of time thinking about.
But this is basically thestructure of these deals.
A lot of kind of newideas floating around.
We'll see how they play out, butthey definitely put the UAE and
put Saudi Arabia right up there interms of the players that might have
large domestic stockpiles of chips.
All right, so that's acouple policy stories.
Let's have a couple safetystories to round things out.
(01:44:11):
The next one is a paper Scalinglaws for scalable oversight.
So oversight is the idea that we maywant to have weaker models verify that a
thing that a stronger model is doing isactually safe and aligned and not bad.
So you might imagine you might have.
(01:44:31):
A super intelligent system, andhumans are not able to verify
that what it's doing is okay.
And you want to be able to haveAI oversight over stronger ones
to, you know, be able to trust it.
In this paper they're lookinginto, you know, wherever you
(01:44:53):
can actually scale oversight.
And by the way, it's called scalableoversight because you can scale it
by using AI to actually verify thingsat the speed of AI and compute.
And so what this paper focuses on iswhat they're presenting as nested,
scalable oversight, where basicallyyou can do a sequence of models where
(01:45:19):
you have weaker, stronger, weaker,stronger, and you can, kind of go
off a chain to be able to provideverifiable or, you know, trustworthy
oversight and make things safe.
So they introduce sometheoretical concepts around that.
Some theoretical guarantees.
(01:45:39):
They do some experiments on gameslike mafia war games and backdoor
games, and verify in that contextthat there are some success rates.
And yeah, present kind of thisgeneral idea as another step
in the overall research of theidea of scalable oversight.
(01:46:03):
Yeah, and this is I don't, I don't think,I don't know if it was Paul Christiano
back when he was at OpenAI who inventedthis whole area, but certainly the
idea of doing scalable alignment bygetting a weaker AI model to monitor
a. Smarter ai, most stronger AI modelis something that he was really big on.
(01:46:23):
And frankly, I mean, and through,through debate in particular.
So his whole thing was debate.
That's one like concrete usecase that they examine here.
So basically have a weak model watchmaybe two strong models, debate
and o over a particular issue.
And the weak model is gonna tryto assess which of those models
is telling the, the truth.
well, hopefully the, the idea here isif you can use approaches like this
(01:46:47):
to determine with confidence that oneof your stronger models is reliable,
well then you can take that strongermodel and now use it to supervise
the next level of strength, right?
An even smarter model.
And you can maybe startclimbing the ladder that way.
This is, I think, a, a good way.
This paper is basicallytrying to quantify that.
So, so the way they're gonna try toquantify that is with ELO scores.
(01:47:08):
So these ELO scores tell you roughly.
How often a given modelwill beat another model.
Right.
So, you know, and I forget how they,what the exact numbers are, but it's
like if you have a model with an ELOscore of a thousand and another model
with an ELO score of 1200, then themodel with the ELO score of 1200 will
beat the model with an ELO score ofa thousand, like 70% of the time or,
(01:47:31):
or, you know, whatever the number is.
And so the, this is an attempt tokind of quantify what that climb might
look like using ELO scores, usingessentially scaling curves for these
ELO scores, which is quite interesting.
I think there are some pretty fundamentalproblems with this whole approach.
I don't think that Max Techmark who islike the, one of the lead authors of this
(01:47:51):
thing would, would actually disagree.
But there's a fundamental issue here,which is when you think about climbing the
intelligence ladder, I. New capabilitiesof concern, like deceptive alignment.
In other words, the ability ofa model to pretend as if it's
aligned when it actually isn't.
Those can emerge pretty suddenly, likeyou can have this sort of emergent
capabilities that pop up suddenly and I.
(01:48:14):
Violate these scaling curves and thekinds of capabilities you worry about
in the context of super intelligence.
You might expect to arise quitequickly where there's a sudden sort
of cohesion of situational awarenessof capabilities around you know,
manipulation and persuasion ofcapabilities around, you know, offensive
cyber and things like that, that allkind of come together fairly quickly.
(01:48:37):
and if that should happen, then youought to expect these scaling laws
to break down at precisely the stageswhere you most need them to work.
Nevertheless, this is a, I think, areally good quantification of some
of the the arguments that we've seenfrom people like Paul Christiano.
IDA, I think it was the, theacronym iterative debate in
(01:48:58):
alignment or something like that.
I, I forget.
I, I, I actually looked into itreally deeply like four years ago,
and now I can't, I can't sum it up.
But yeah.
I think this is, if you're gonna take itseriously, this is a good way to do it.
Looking across different versions of this,like, what if you have a game of mafia?
If you don't know what the gameof mafia is don't worry about it.
What if you've got this debate scenariothat, that I just described all
(01:49:21):
these different possible scenarios.
What are the scaling curve?
What do the scaling curves look likein terms of how smart your judge
model is gonna be versus how smartthe models are who are potentially
trying to fool the judge model?
And how often can you actually assessthe, or can, the judge model succeed?
They've got all these greatscaling plots and yeah.
It's, it's a, a good paper ifyou're interested in that model.
(01:49:42):
And one last story related to safety.
OpenAI pledges to publish AIsafety test results more often.
So they have actually launched the SafetyEvaluations Hub, a page where you can
see the, their models performance onvarious benchmarks related to safety.
(01:50:05):
Things like harmful content,jailbreaks and hallucinations.
And yeah, you can really scroll throughand basically see four GT 4 0 1, 4 0.1,
mini 4.501, all of them for various thingswhere safety like refusal jailbreaking,
(01:50:25):
hallucination what the metrics are.
Now, they're not presentingeverything they do for safety.
They don't have the metrics fortheir preparedness framework on here.
They're gonna continue to do that in.
The system cards, but nevertheless,I think an interesting kind of
move by OpenAI to make it extraeasy to see where the models stand.
(01:50:53):
Yeah, I, I, this is a, if nothingelse, just a really great format
to, to view these things in.
And anyway, you can, youcan check out the website.
It's actually really nicely laid out.
And that will be it for this episode ofLast and Sometimes last, last week in ai.
As we've said, we'll try to not skipany more weeks in the near future.
(01:51:13):
Thank you to all the listenerswho stick by us even though we
do sometimes break that promise.
As always, we appreciate your feedbackappreciate you sharing a podcast, giving
reviews, corrections, questions, allthat, and please do keep tuning in.