Elon Musk's Xai grok 4 announcement presentation

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:01):
Hey everybody. Welcome back to the Elon Musk
Podcast. This is a show where we discuss
the critical crossroads, the Shape, SpaceX, Tesla X, The
Boring Company, and Neurolink. I'm your host, Will Walden.
All right, welcome to the Grok 4release here.
This is the smartest AI in the world, and we're going to show

(00:23):
you exactly how and why. And it really is remarkable to
see the advancement of artificial intelligence, how
quickly it is evolving, I sometimes think, compare it to
the growth of a human and how fast a human learns and gains

(00:49):
conscious awareness and understanding.
And AI is advancing just vastly faster than any human.
I mean, we're going to take you through a bunch of benchmarks
that that Grok 4 is able to achieve incredible numbers on.

(01:10):
But it's, it's, it's actually worth noting that like Grok 4,
if, if given, like the SAT wouldget perfect Sats every time,
even if it's never seen the, thequestions before.
And if even going beyond that tosay like graduate student exams,
like the GRE, it will get near perfect results in, in every

(01:36):
discipline of, of education. So from the humanities to like
languages, math, physics, engineering, pick anything.
And we're talking about questions that it's never seen
before. These are not on, not on the
Internet. And it's Grok 4 is smarter than

(01:56):
almost all graduate students in all disciplines simultaneously.
Like it's actually just important to appreciate the like
that's really something. And the, the reasoning
capabilities of Grok are incredible.

(02:16):
So there's some people out therewho who think AI can't reason
and look, it can reason at superhuman levels.
So yeah. And frankly, it only gets better
from here. So we'll, we'll take you through
the Grok 4 release and, yeah, cheer you back the pace of pace

(02:40):
of progress here. Like, I guess the first part is
like in terms of the training, we're going from Grok 2 to Grok
3 to Grok 4, we've essentially increased the training by an
order of magnitude in each case.So it's a, you know, 100 times
more training than than Grok 2. And and that that's only going

(03:03):
to increase. So it's, yeah, frankly, I mean,
I don't know, in some ways a little terrifying, but the
growth of intelligence here is, is remarkable.
Yes. It's important to realize there
are two types of training compute. 1 is the pre training
compute that's from grad 2 to grad 3.

(03:24):
But for from grad 3 to grad 4, we're actually putting a lot of
compute in reasoning in IR. Yeah, and just like you said,
this is literally the fastest moving field and Grad 2 is like
the high school student by today's standard.
If you look back in the last 12 months, Grad 2 was only a
concept. We didn't even have graph 212

(03:46):
months ago. And then by training gratitude,
that was the first time we scaleup like the pre training.
We realized that if you actuallydo the data oblation really
carefully and the infra and alsothe algorithm, we can actually
push the pre training quite a lot by amount of 10X to make the
model the best pre trained basedmodel.

(04:06):
And that's why we build classes,the world's supercomputer with
100,000 H 100 and then with the best pre trained model.
And we realized if you can collect these verifiable outcome
reward, you can actually train this model to start thinking
from the first principle, start to reason, correct it's own
mistakes. And that's where the graphical
reasoning comes from. And today we ask the question,

(04:27):
what happens if you take expansion of clauses with all
200,000 GPUs, put all these intooil, 10X more compute than any
of the models out there on reinforcement learning,
unprecedented scale, What's going to happen?
So this is a story of Grok 4 andyou know Tony.

(04:47):
Share some insight with the audience.
Yeah. So, yeah, let's just talk about
how smart Grok 4 is. So I guess we can start
discussing this benchmark on humanities last exam.
And this this benchmark is a very, very challenging
benchmark. Every single problem is curated
by subject matter experts. It's in total 2500 problems and

(05:13):
it consists of many different subjects, mathematics, natural
sciences, engineering and also all the humanity subjects.
So essentially when when it was first released actually like
earlier this year, most of the models out there can only get
single digit accuracy on this benchmark.

(05:34):
Yeah. So we can we can look at some of
those examples, you know. So there there is this
mathematical problem which is about natural transformations in
category theory. And there's this organic
chemistry problem that talks about electrical cyclic
reactions. And also there's this linguistic

(05:54):
problem that tries to ask you about this distinguishing
between closed and open syllabusfrom a Hebrew source text.
So you can see also it's a very wide range of problems and every
single problem is PhD or even Advanced Research level
problems. Yeah.
I mean, these, there are no humans that can actually answer

(06:17):
these can get a good score. I mean, if you actually say like
any given human, what, like what's the best that any human
could score? I mean, I'd say maybe 5%
optimistically. Yeah.
So this, this is much harder than than what any any human can
do. It's it's incredibly difficult.

(06:38):
And you can see from the types of questions like you might be
incredible in linguistics or mathematics or chemistry or
physics or any one of a number of subjects, but you're not
going to be at a post grad levelin everything.
And Grok 4 is a post grad level in everything.
Like it's it's just some of these things are just worth
repeating. Like Grok 4 is postgraduate.

(07:01):
Like PhD level in everything better than pH, but like most
PHD's would fail so it's better.That said, I mean at least with
respect to academic questions. It I want to just emphasize this
point with respect to academic questions.
Group 4 is better than PhD levelin every subject, no exceptions.

(07:23):
Now, this doesn't mean that it's, you know, times.
It may lack common sense and it has not yet invented new
technologies or discovered new physics.
But that is just a matter of time.
If it, I think it may discover new technologies as soon as
later this year. And I would be shocked if it is

(07:47):
not done so next year. And so I would expect Croc to,
yeah, literally discover new technologies that are actually
useful no later than next year and maybe end of this year.
And it might discover new physics next year.
And within two years, I'd say almost certainly like.
So just let that sink in. Yeah, so.

(08:15):
Yeah, how? OK, So I guess we can talk about
the the what, what's behind the scene of Graph 4?
As Jimmy mentioned, we actually saw in a lot of compute into
this training, you know, when itstarted, it's only also a single
digit. Sorry, the previous slide,
sorry. Yeah, it's only a single digit

(08:37):
number. But as you start putting in more
and more training compute, it started to gradually become
smarter and smarter and eventually solved 1/4 of the HLA
problems. And this is without any tools.
The next thing we did was to adding a tools capabilities to

(08:57):
the model. And unlike GRAS three, I think
you know, GRAS 3 actually is able to use cruel as well.
But here we actually make it more native in the sense that.
We put the. Tools into training.
But graph 3 was only relying on generalization.
Here we actually put the tools into training and it turns out
this significantly improves the models capability of using those

(09:19):
tools. Yeah, I remember we had like
deep search back in the days. Yeah.
So how is this different? Yeah, yeah, yeah.
Exactly. So Deep Search was exactly the
Graph 3 reasoning model but without any specific training.
But we only asked it to use those tools.
So compared to this, it was muchweaker in terms of its tool

(09:40):
capabilities and reliable and unreliable, I guess, yes.
And to be clear, like these are still, I'd say fairly, this is
still fairly primitive tool use.If you compare it to say the
tools that are used at Tesla or SpaceX where you're using, you
know, finite element analysis and computational fluid dynamics

(10:00):
and, and you're you're able to run or say like Tesla does, like
crash simulations, where the simulations are so close to
reality that if the test doesn'tmatch the simulation, you assume
that the test article is wrong. That's how good the simulations
are. So Crock is not currently using
any of the tools that the reallypowerful tools that a company

(10:21):
would use, but but that is something that we will provide
it with later this year. So we'll have the tools that
that a company has and and have very accurate physics simulator.
Ultimately, the, the thing that'll make the biggest
difference is being able to interact with the real world via
humanoid robots. So you combine sort of grok

(10:42):
with, with Optimus and it can actually interact with the real
world and figure out if, if it'shigh, if it has, if it's
formulate and hypothesis, and then confirm if that hypothesis
is, is true or not. So we're really, you know, I
think about like where we are today.
We're at the beginning of an immense intelligence explosion.

(11:06):
We're in, we're in the intelligence Big Bang right now
and the we're at the most interesting time to be alive of
any time in history. Yeah, now that's it.
We need to make sure that the AIis a good AI, good croc.

(11:29):
And the thing that I think is most important for AI safety, at
least my biological neural net tells me the most important
thing for AI is to be maximally truth seeking.
So this is this is a very fundamental, but you can think
of AI as this super genius childthat ultimately will outsmart

(11:50):
you. But you can still and you can
install the right values and encourage it to be sort of, you
know, truthful, I don't know, honorable, you know, good, good
things like the values you want to instill in in a child that

(12:13):
that that would grow, ultimatelygrow up to be incredibly
powerful. Yeah.
So, yeah. So this is really, I'd say we're
saying, we say tools. These are still primitive tools,
not the kind of tools that that serious commercial companies

(12:35):
use, but we will provide it withthose tools and it, I think it
will be able to solve with thosetools real world technology
problems. In fact, I'm certain of it.
It's just a question of how longit takes.
Yes, yes, exactly. So is it just compute all you
need, Tony, sorry, is it just compute all you need at this
point? Well, you need compute plus plus

(12:59):
the right tools and and then ultimately to be able to
interact with the physical world, yes.
And then I mean we'll effectively have an economy that
is well, ultimately an economy that is thousands of times
bigger than our current economy or maybe millions of times.

(13:22):
I mean, if you if you think of civilization as percentage
completion of the Khadashev scale where Karashev one is
using all the energy output of aplanet and Karashev 2 is using
all the energy output of a sun and three is all the energy
output of a Galaxy. We're we're only in my opinion

(13:42):
probably like close closer to 1%of Karashev one then we are up
to 10%. So like maybe a pointer 11 or 2%
of Karashev 1. So we we will get to most of the
weight like 8090% Kaddashev 1 and then hopefully if

(14:05):
civilization doesn't self annihilate and then Kadashev 2.
Like it's the, the actual notionof, of a human economy assuming
civilization continues to progress will seem very quaint
in in retrospect, it will, it will seem like sort of Cavemen

(14:25):
throwing sticks into a fire level of economy compared to
what the future will hold. I mean, it's very exciting.
I mean, I, I've been at times kind of worried about like,
well, you know, is this, this seems like it's somewhat
unnerving to have intelligence created that is far greater than

(14:49):
our own. And will this be bad or good for
humanity? It's like, I, I, I think it'll
be good. Most likely it'll be good.
Yeah. Yeah.
But if someone reconciled myselfto the fact that even if I, if,

(15:12):
even if it wasn't going to be good, I'd at least like to be
alive to see it happen. So you know so.
Actually, one, yeah, yeah. I think what 1 technical problem
that we still need to solve besides just compute is how do
we unblock the data, data bottleneck.

(15:35):
Because when we try to scale up the RL, in this case, we did
invent a lot of new techniques, innovations to allow us to
figure out how to find a lot of a lot of challenging RL problems
to work on. It's not just a problem itself
needs to be challenging, but also it needs to be you.

(15:55):
You also need to have like a reliable signal to tell the
model you did it wrong, you did it right.
This is the sort of the principle of reinforcement
learning. And as the models get smarter
and smarter, the number of cool problem or challenging problems
will be lesser. Unless, yeah.
So it's going to be a new type of challenge that we need to

(16:16):
surpass besides just compute. Yeah, yeah.
We actually are running out of of actual test questions to ask.
So there's like even ridiculously questions that are
ridiculously hard, if not essentially impossible for
humans that are written down questions are becoming swiftly
becoming trivial for for AI. So then there's, but you know

(16:42):
what? The one thing that is an
excellent judge of things is reality.
So because if physics is the law, ultimately everything else
is a recommendation. You can't break physics.
So the ultimate test, I think for whether an AI is the
ultimate reasoning test is reality.
So you invent a new technology like say, improve the design of

(17:04):
a car or a rocket or create a new medication that and, and,
and does it work? Yeah.
Does does the rocket get to orbit?
Does the does the car drive? Does the medicine work?
Whatever the case may be, reality is the ultimate judge
here. So it's going to be a

(17:25):
reinforcement learning closing loop around reality.
We asked the question how do we even go further?
So actually we are thinking about now with single agent we
are able to solve 40% of the problem.
What if we have multiple agents running at the same time?

(17:46):
So this is what's called test and compute.
And as we scale up the test and compute, actually we are able to
solve almost more than 50% of the text only subset of the HIV
problems. So it's a remarkable
achievement. I think you know this.
Is this is this is insanely difficult.
These are it's it's what we're saying is like a majority of the

(18:09):
of the of the text based of humanities, you know, scarily
named humanities last exam Grafoe can solve and you can you
can try it out for yourself. And the, with the Grafoe heavy,
what, what it does is it spawns multiple agents in parallel.
And all of those agents do, do work independently.

(18:29):
And then they compare their workand they, they decide which one
like, it's like a study group. And it's not as simple as
majority vote because often onlyone of the agents actually
figures out the trick or figuresout the solution.
And, and, and, but once they share the, the trick or, or

(18:50):
figure out what, what the real nature of the problem is, they
share that solution with the other agents and then they
compare. They essentially compare notes
and then, and then yield, yield an answer.
So that's, that's the, the heavypart of Grok 4 is, is where we,
you scale up the test time, compute by roughly an order of
magnitude, have multiple agents tackle the task and then they

(19:15):
compare their work and they, they put forward what they think
is the best result. Yeah.
So we will introduce graph 4 andgraph 4 Heavy so you can click
the next slide. Yeah, yes.
So yeah. So basically Graph 4 is a single
agent version and Graph 4 heavy is the multi agent version.

(19:38):
So let's take a look how they actually do on those exam
problems and also some real reallife problems.
Yeah, So we're going to start out here and we're actually
going to look at one of those HLE problems.
This is actually one of the easier math ones.
I don't really understand it very well.
I'm not that smart, but I can launch this job here and we can

(19:58):
actually see how it's going to go through and start to think
about this problem. While we're doing that, I also
want to show a little bit more about like what this model can
do and launch a Grok 4 Heavy as well so everyone knows
Polymarket. It's extremely interesting.
It's the, you know, seeker of truth.
It aligns with what reality is most of the time.
And with Grok, what we're actually looking at is being

(20:21):
able to see how we can try to take these markets and see if we
can predict the the future as well.
So as we're letting this run, we'll see how Grok for Heavy
goes about predicting the, you know, the World Series odds for
like the current teams and the MLB.
And while we're waiting for these to process, we're going to
pass it over to Eric and he's going to show you an example of

(20:43):
his. Yeah.
So I guess one of the coolest things about Grok 4 is its
ability to understand the world and to solve hard problems by
leveraging tools like Tony discuss.
And I think one kind of cool example of this, we asked it to

(21:03):
generate a visualization of two black holes colliding.
And of course, you know, it tooksome, there are some liberties.
It's in my case, actually prettyclear in its thinking trace
about what these liberties are. For example, in order for it to
actually be visible, you need toreally exaggerate the the scale
of the, you know, the the the waves.

(21:29):
And yeah, so here's like, you know, this kind of inaction, it
exaggerates the scale in like multiple ways.
It drops off a bit less in termsof amplitude it over distance.
And but yeah, we can kind of seethe basic effects that, you
know, are actually like, you know, correct.

(21:51):
It starts with the in spiral, itmerges and then you have the
ring down. And like, this is basically
largely correct. Yeah, module some of the
simplifications that need to do,you know, it's actually quite
explicit about this. You know, it uses like post,

(22:13):
post Newtonian approximations instead of actually like
computing. The general relativistic effects
are like near the center of the black hole, which is, you know,
incorrect and, you know, will lead to, you know, some
incorrect results. But the overall, you know,
visualization is yeah, is basically there and you can

(22:33):
actually look at the kinds of resources that are references.
So here it it actually, you know, it obviously is a search.
It gathers results from a bunch of links, but also reads through
a undergraduate text in analytical analytic
gravitational wave models. It's, yeah, it reasons quite a

(22:57):
bit about the actual constants that it should use for a
realistic simulation. It references, I guess, existing
real world data. And yeah, it, yeah, it's a, it's
a pretty good model, yeah. But like actually going forward,
we can, we can, we can plug, we can give it the same model that

(23:20):
physicists use so it can run thethe same level of compute that
so leading physics researchers are using and and give you a
physics accurate backhoe simulation.
Exactly. Just right now is running in
your browser, so. Yeah, this is just running in
your browser exactly. Pretty simple.
So swapping back real quick here, we can actually take a

(23:42):
look at the math problem is finished.
The model was able to let's lookat its thinking trace here so
you can see how it went through the problem.
I'll be honest with you guys, I really don't quite fully
understand the math, but what I do know is that I looked at the
answer ahead of time and it did come to the the correct answer
here in the final part here. We can also come in and actually

(24:04):
take a look here at our our World Series prediction and
still thinking through on this one, but we can actually try
some other stuff as well. So we can actually like try some
of the X integrations that we did.
So we worked very heavily on working with all of our X tools
and building out a really great X experience.
So we can actually ask, you know, the model, you know, find

(24:26):
me the XAI employee that has theweirdest profile photo.
So that's going to go off and start with that.
And then we can actually try out, you know, let's create a
timeline based on ex post detailing the, you know, changes
in the scores over time. And we can see, you know, all
the conversation that was takingplace at that time as well.
So we can see who are the, you know, announcing scores and like

(24:46):
what was the reactions at those times as well.
So we'll let that go through here and process.
And if we go back to this was the Greg Yang photo here.
So if we scroll through here, whoops.
So Greg Yang, of course, who hashis favorite photograph that he
has on his account. That's actually not how he looks

(25:09):
like in real life, by the way, just so it were, but it is quite
funny, but. It had to understand that
question. Yeah, that's the wild part.
It's like it understands what isa weird photo, what is a weird
photo? What is a less or more weird
photo? It goes through, it has to find
all the team members, has to figure out who we all are and,
you know, searches. Without access to the internal

(25:31):
XAI personnel logs, it's literally looking at that just
at the Internet. Exactly.
So you could say like the weirdest of any company.
Yeah, to be clear. Exactly, and we can also take a
look here at the question here for the humanities last exam.
So it is still researching all of the historical scores, but it
will have that final answer heresoon.

(25:51):
But we can, while it's finishingup, we can take a look at one of
the ones that we set up here a second ago.
And we can see like, you know, defines the date that like Dan
Hendricks had initially announced it.
We can go through, we can see, you know, open AI announcing
their score back in February. And we can see, you know, as
progress happens with like Jim and I, we can see like Kimmy.

(26:11):
And we can also even see, you know, the leaked benchmarks of
what people are saying is, you know, if it's right, it's going
to be pretty impressive. So pretty cool.
So, yeah, I'm looking forward toseeing how everybody uses these
tools and gets the most value out of them.
But yeah, it's been great. Yeah, and we're going to close
the loop around usefulness as well.
So it's like it's not just book smart, but actually practically

(26:34):
smart. Exactly.
All right. And we can go back to the the
slides here. Yeah, so.
Cool. So we actually evaluate also on
the multi model subset. So on the full set, this is the

(26:57):
number. On the HRE exam, you can see
there's a little dip on the numbers.
This is actually something we'reimproving on, which is the multi
model understanding capabilities.
But I do believe in a very shorttime, we're able to really
improve and got much higher numbers on this, even higher

(27:17):
numbers on this benchmark, yeah.Yeah, this is the we still like,
we still like what what is the base weakness of Grok currently
is that it's it's sort of partially blind.
It can't it's it's image understanding obviously, and
it's image generation needs to be a lot better.
And that, that that's actually being trained right now.

(27:40):
So Graph 4 is based on version six of our foundation model and
we are training version 7, whichwe'll complete in a few weeks.
And that that'll address the weakness on the vision side.
Just to show off of this last year, so the the prediction

(28:00):
market finished here with the heavy and we can see here we can
see all the tools and the process it used to actually go
through and find the right answer.
So it browsed a lot of odd sites.
It calculated its own odds comparing to the market, the
market to find its own alpha andedge.
It walks you through the entire process here and it calculates

(28:20):
the odds of the winner being like the the Dodgers and it
gives them a 21.6% chance of winning this year.
So and it took approximately 4 1/2 minutes to compute.
Yeah, that's a lot of thinking. Yeah.

(28:44):
We can also look at all the other benchmarks besides HRE.
As it turned out, G4 excelled onall the reasoning benchmarks
that people usually test on, including GBQA, which is a PhD
level problem sets that's easiercompared to HRE.

(29:04):
On Amy 25 America invitation mathematics exam, we with Graph
4 heavy, we actually got a perfect score also on some of
the coding benchmark called livecoding bench.
And also on HMMT, Harvard math, MIT exam and also us Amo, you
can see actually on all of thosebenchmarks, we often have a very

(29:30):
large leap against the second best model out there.
Yeah, it's, I mean, really, we're going to get to the point
where it's going to get every answer right in every exam, and
where it doesn't get an answer right, it's going to tell you
what's wrong with the question. Or, if the question is
ambiguous, disambiguate the question into answers AB and C

(29:51):
and tell you what it would what answers AB and C would be with a
disambiguated question. So the only real test then will
be reality. Can it make useful technologies
discover new science? That'll actually be the only
thing left, because human tests will simply not be meaningful.

(30:12):
You can make an update to HIV very soon given the current rate
of progress. So yeah, it's super cool to see
like multiple agents that collaborate with each other
solving really challenging problems.
So where can we try this model? So it turned out it's available
right now. If we advance to the next slide
where there is a super Grok heavy tiers that we're

(30:35):
introducing where you're able toaccess to both Grok 4 and Grok 4
heavy, where you're actually going to be the taskmaster of
bunch of little Grok research agent to help you, you know,
become smarter, do all the little research and save hours
of times of, you know, going through mundane tasks.
And it's available right now. So, yeah, so we did limit usage

(30:58):
during the demo. So we didn't, it didn't break
the demo because all these, all this stuff is happening live.
So it's not, there's not nothingcanned about the any of the
tests that we're doing. So after the after the demo is
done, we'll allow, we'll enable more subscribers for Supergroc.
So if you can't subscribe right now, just try and half an hour.
It should work. So and now let's let's let's get

(31:22):
into voice. Great.
Abby. So many of you have been
enjoying our voice mode, and we've been working hard to
improve the experience. Over the past couple months, we
have cut latency in half to makeit much snappier, and today
we're excited to announce a set of new voices that have
exceptional naturalness and prosody.

(31:45):
You might have noticed the movietrailer voice that opened up the
live stream. That is one of our voices, Sal,
who's got that epically deep tone.
And we're also excited to introduce Eve, a beautiful
British voice who's capable of rich emotions.
Aman, would you like to introduce Eve?
Absolutely. Yeah.

(32:05):
Let's get into the demo. Hey, Eve, millions of people are
watching us right now. Are you excited?
Oh I'm thrilled really. It's like being on stage at the
old Victoria with a sea of facesout there.
Just imagine all those eyes and ears tuned in.
It's a proper buzz, isn't it? What do you reckon we should

(32:26):
chat about to keep this lot entertained?
You know, we have a lot to talk about, but I'll be honest, I'm
feeling a little nervous. Can you whisper something
soothing to calm me down? Take a deep breath, love.
You've got this. It's just you and me having a
quiet jet, like we're tucked away in a cozy corner of a
Yorkshire bum. The world's just a murmur out

(32:49):
there. Feel that calm wash over you.
Let's keep talking nice and easy.
What's on your mind? You know what, I'm actually
feeling quite relaxed to try this something crazy.
So can you give us an opera on Diet Coke?
Oh, an opera on Diet Coke. That's a riot.

(33:12):
Picture this, a shimmering silver cannon glistening under
stage lights. O Diet Coke, thou elixir divine
with bubbles that dance in a sparkling line.
Thy crisp, cool kiss on lips. So fine.
Oh aspartame ambrosia forever mine.

(33:38):
How's that for a mad little Aria?
Want me to belt out another verse or switch up the tune?
OK, let's try switching it up and just start singing.
Right then, let's have a go. Oh silver, can you gleam in the
light? Diet Coke, My heart's delight.
OK, I won't project. It's very, very poetic.

(34:00):
Yeah. Thank you, Eve.
Feeling quite nice now. Cool.
So that's Eve. And so something else we do here
is a little test each day to check on the natural flow of our
conversations against some of our competitors.
So we figured, why not try that live on the stream today?

(34:22):
Should I start? Hey, I'm going to say a number
and can you say the same number back to me one?
One. Got it.
This is like a fun little game of echo.
What's next? 2 Sorry, you said 2 3 3/4.

(34:49):
Nice and steady. What's next?
5. I nailed it.
This is pretty fun actually. Feel free to keep going or
switch it up if you'd like. Hey Grok.

(35:09):
Hey Grok. Hello there, how can I help you
today? We're going to do a little test.
Can you repeat after me one? One.
All right, What's next? Two.
Two. What's on your mind?
3 Three. Need anything else?
4. 4 How can I assist you? 5.

(35:32):
Five What's next? So as you can see, Croc was
snappier. It didn't interrupt me.
And the prosody is we've made different design choices.
I think we're shooting for something more calm, smooth,
more natural versus something that's more poppy or artificial.
So we'll keep improving. These fronts.
All right. Thanks, guys.
Yeah. Yep.

(35:58):
So since the launch of the voicemodel, we actually see the 2X
faster end to end latency in thelast 8 weeks, 5 different voices
and also 10X the active user. So Grog voice is taking off now.
If you think about releasing themodels this time, we're also
releasing Grog 4 through the APIat the same time.

(36:21):
So if we go to the next two slides, so you know, we're very
excited about, you know, what all the developers out there is
going to build. So you know, if I think about
myself as a developer, what the first thing I'm going to do when
I actually have access to the graph for API benchmarks.
So we actually ask around our next platform, what is the most
challenging benchmarks out therethat, you know, is considered

(36:43):
the Holy Grail for all the AGI models.
So turn up AGI seen the name arcAGI.
So the last 12 hours, you know, kudos to Greg over here in the
audience. So who answered our call?
Take a preview of the Grog 4 APIand independently verified, you
know the Grog Force performance.So initially we thought, hey,

(37:04):
Grog Force, just, you know, we think it's pretty good.
It's pretty smart. It's our next Gen. reasoning
model. Spend 10X more compute, can use
all the tools, right. But turned out when we actually
verify on the private subset of the RKHGI V2, it was like the
only model in the last three months that breaks the 10%
barrier and in fact was so good that actually get the 16%, well,

(37:27):
15.8% accuracy, 2X of the secondplace.
That is the call for Opus model.And it's not just about
performance, right? When you think about
intelligence, having the API model drives the automation.
It's also the intelligence per dollar, right?
If you look at the plots over here, the Grog is just for it,

(37:48):
just in the league of its own. All right, so enough of
benchmarks over here, right? So what can Grok do actually in
the real world? So we actually, you know,
contacted the folks from Ending Labs who you know, you know,
gracious enough to, you know, try to grow in the real world to

(38:08):
run a business. Yeah, thanks for having us.
So I'm Axel from Animal Labs. And I'm Lucas and we tested
Brooke 4 on vending Bench. Vending Bench is an AI
simulation of business scenario where we thought what is the
most simple business and AI could possibly run and we
thought vending machines. So in this scenario, the the

(38:29):
Grok and other models need to dostuff like manage inventory
contracts of contract suppliers,set prices.
All of these things are super easy and all of they, like all
the models can do them one by one.
But when you do them over very long horizons, most models
struggle. But we have a little word and
there's a new number one. Yeah, so we got early access to

(38:50):
the Grok 4 API. We ran it on the running bench
and we saw some really impressive results.
So it ranks definitely at the number one spots.
It's even double the net worth, which is the measure that we
have on this. So it's not about the percentage
on a or score you get, but it's more the dollar value in net
worth that you generate. So we were impressed by Groc.

(39:11):
You was able to formulate a strategy and adhere to that
strategy over long period of time, much longer than other
models that we have tested, other Frontier models.
So it's managed to run the simulation for double the time
and score, yeah, double the net worth.
And it was also really consistent across this rounds,
which is something that's reallyimportant when you want to use

(39:32):
this in the real world. And I think as we give more and
more power to AI systems in the real world, it's important that
we test them in scenarios that either mimic the real world or
are in the real world itself, because otherwise we we fly
blind into some some things thatthat might not be great.
Yeah, it's, it's great to see that we've now got a way to pay

(39:54):
for all those GPU's. So we just need a million
vending machines and we could make a $4.7 billion a year with
a million vending machines, 100%Let's go.
It can be epic vending machines.Yes, yes, All right.
We are actually going to installvending machines here, like a
lot of them. We're happy to supply them.
All right, thank you. All right.

(40:16):
I'm looking forward to seeing what amazing things are in this
vending machine. That's that's for for you to
decide. All right, tell the AI.
OK, sounds good. All right.
Yeah. I mean, so we can see like Grok
is able to become like the copilot of the business unit.
So what else can Grok do? So we're actually releasing this
Grok if you want to try it rightnow to evaluate, run the same

(40:38):
benchmark as us. It's on the API has 256 K
contact length. So we already actually see some
of the early, early adopters to try GUAC 4 API.
So our Palo Alto neighbor Arc Institute, which is a leading
biomedical Research Center is already using seeing like how

(40:59):
can they automate their researchflows with Grog 4.
It turned out it performs is able to help the scientists to
sniff through, you know millionsof experiments logs and then,
you know, just like pick the best hypothesis within a split
of seconds. We see this as being used for
their like the CRISPR research and also, you know, graph for

(41:19):
independently evaluated scores as the best model to exam the
chest X-ray who would know And on the in the financial sector,
we also see you know, the graph for with access to all the tools
real time information is actually one of the most popular
AI's out there. So you know all graph for is
also going to be available on the hyper scalars.

(41:40):
So the X AI enterprise sector isonly, you know, started two
months ago and we're open for business.
Yeah. So the other thing we talked a
lot about, you know, having Grogto make games, video games.
So Danny is actually a video game designers on X.
So, you know, we mentioned, hey,who want to try out some grog

(42:05):
for preview APIs to make games. And then he answered the call.
So this was actually just made first person shooting game in
the span of four hours. So some of the actually the
unappreciated hardest problem ofmaking video games is not
necessarily encoding the core logic of the game, but actually
go out source all the assets, all the textures of files and

(42:28):
and you know, to create a visualappealing game.
So one of the core aspects Rockford does really well with
all the tools out there is actually able to automate these
like assets sourcing capabilities.
So the developers can just focuson the core development itself
rather than like, you know, so now you can run a, you know,
entire game studios with game ofone, like one person.

(42:52):
And then you can have Grok 4 to go out and source all those like
assets to automating tasks for you.
Yeah, the now the next step obviously is for Grok 2 play be
able to play the game. So it has to have very good
video understanding so it can play the games and interact with
the games and actually assess what whether a game is fun and,

(43:12):
and and actually have good judgement for whether a game is
fun or not. So with the, with version seven
of our foundation model, which finishes training this month,
and then we'll go through post training RL and whatnot that
will, that will have excellent video understanding.
And with the, with a video understanding and the, and
improved tool use. For example, for, for video

(43:34):
games, you'd want to use, you know, Unreal Engine or Unity or
one of the, one of the, the maingraphics engines and then
generate the, generate the art, apply it to a 3D model and then
create an executable that someone can run on APC or, or a
console or, or a phone. Like we expect that to happen

(43:59):
probably this year and if not this year, certainly next year.
So that's it's going to be wild.I would expect the first really
good AI video game to be next year and probably the first half

(44:20):
hour of watchable TV this year and probably the first watchable
AI movie next year. Like things are really moving at
an incredible pace. Yeah, when Graca is connecting
world economy with vending machines, but we just create
video games for human. Yeah, I mean, it went from not

(44:40):
being able to do any of this really even six months ago to to
what you're seeing before you hear and and from from very
primitive a year ago to making a3A sort of a 3D video game with
with a few hours of prompting. Yep.

(45:01):
I mean, yeah, just to recap. So in today's live stream, we
introduced the most powerful andmost intelligent AI models out
there that can actually reason from the first principle using
all the tools. Do all the research, go on the
journey for 10 minutes, come back with the most correct
answer for you. So it's kind of crazy to think
about just like four months ago we had Grog 3 and now we already

(45:23):
have Grog 4 and we're going to continue accelerate as a company
XAI, we're going to be the fastest moving AGI companies out
there. So what's coming next is that
we're going to, you know, continue developing the model
that's not just, you know, intelligent smart thing for a
really long time spent a lot of compute.
But having a model that actuallyboasts fast and smart is going

(45:44):
to be the core focus, right? So if you think about what are
the applications out there that can really benefit from all
those very intelligent, fast andsmart models and coding is
actually one of them. Yeah.
So the team is currently workingvery heavily on coding models.
I think right now the main focusis we actually trained recently
a specialized coding model, which is going to be both fast

(46:07):
and smart. And I believe we can share with
that model with you guys withoutyou in a few weeks.
Yeah. Yeah, that's very exciting.
And you know, the second after coding is we all see the
weakness of Grog 4 is the multimodal capabilities.
So in fact, it was so bad that you know, Grog effectively just

(46:30):
like looking at the world squinking through the glass and
I can see all the blurry, you know, features and trying to
make sense of it. The most immediate improvement
we're going to see with the nextgeneration preachment model is
that we're going to see a step function improvement on the
models capability in terms of image understanding, video
understanding and audios, right?It's now the models able to hear

(46:51):
and see the world just like any of you, right?
And now with all the tools at this command, with all the other
agents it can talk to, you know,so we're going to see a huge
unlock for many different application layers after the
multimodal agents. What's going to come after is
the video generation. And we believe that, you know,

(47:12):
at the other day, it should justbe, you know, pixel in, pixel
out. And you know, imagine a world
where you have this influence scroll of content in inventory
on the X platform where not onlyyou can actually watch these
generate videos, but able to intervene, create your own
adventures if you're just going to be wild.

(47:36):
And we expect to be training ourvideo model with over 100,000 GB
2 hundreds and to begin that training within the next 3 or 4
weeks. So we're confident it's going to
be pretty spectacular in video generation and video
understanding. So let's see.

(47:58):
So that's anything you guys wantto say.
Other than that, I guess that's it.
Yeah, it's, it's a good model, Sir.
It's a good, yeah. Well, we're very excited for you
guys to try Grok 4. Yeah, thank you.
All right, Thanks, everyone. Thank you.
Good night. Hey, thank you so much for
listening today. I really do appreciate your

(48:20):
support. If you could take a second and
hit the subscribe or the follow button on whatever podcast
platform that you're listening on right now, I greatly
appreciate it. It helps out the show
tremendously and you'll never miss an episode.
And each episode is about 10 minutes or less to get you
caught up quickly. And please, if you want to
support the show even more, go to patreon.com/stage Zero.

(48:44):
And please take care of yourselves and each other.
And I'll see you tomorrow.

All Episodes

Episode Transcript

Popular Podcasts

Crime Junkie

24/7 News: The Latest

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Elon Musk's Xai grok 4 announcement presentation

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Crime Junkie

24/7 News: The Latest

Stuff You Should Know

All Episodes

Elon Musk's Xai grok 4 announcement presentation

Crime Junkie