Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Welcome to the Grock four release. Here, this is the.
Speaker 2 (00:03):
Smartest AI in the world. We're going to show you
exactly how and why. It really is remarkable to see
the advancement of artificial intelligence, how quickly it is evolving.
I sometimes compare it to the growth of a human
and how faster human learns and gains conscious awareness and understanding.
(00:24):
And AI is advancing just vastly faster than any human.
I mean, we're going to take you through a bunch
of benchmarks that that GROC four is able to achieve
incredible numbers on.
Speaker 1 (00:39):
But it's actually worth noting.
Speaker 2 (00:41):
That that GROCK four, if given like the SAT, would
get perfect SATs every time, even if it's never seen
the questions before, and if even going beyond that, to
say like graduate student exams like the GRE, it will
get near perfect results in in every discipline of education,
(01:04):
so from the humanities to like languages, math, physics, engineering,
pick anything, and we're talking about questions that it's never
seen before. These are not on the Internet, and it's
GROCK four is smarter than almost all graduate students in
all disciplines simultaneously. Like it's actually just important to appreciate
(01:26):
the like that's really something. And the reasoning capabilities of
GROCK are incredible. So there's some people out there who
who think AI can't reason, and look, it can reason
at super human levels. So yeah, and frankly, it only
gets better from here. So we'll take you through the
(01:47):
GROCK four release and share you back the pace of
pace of progress here. Like I guess the first part
is like, in terms of the training, we're going from
GROCK two to GROCK three to GROCK four. We essentially
increased the training by an order of magnitue in each case,
so it's one hundred times more training than GROC two
(02:08):
and that's only going to increase. So it's yeah, frankly,
I mean, I don't know. In some ways a little terrifying,
but with the growth of intelligence here is remarkable.
Speaker 3 (02:18):
Yes, it's important to realize there are two types of
training compute. Why is the pre training compute that's from
GROWD two to GROW three, But from Growth three to
GAR four we're actually putting a lot of compute in
reasoning in area.
Speaker 4 (02:32):
And just like you said, this is literally the fastest
moving field and GROC too is like the high school
student by today's standard. If you look bad in the
last twelve month GRODTU was only a concept for the
even have groc to twelve months ago. And then by
training GROCU that was the first time with scale up
like the pre training, we realized that if you actually
do the data uplation really carefully and infra and also
(02:55):
the algorithm, we can actually push the pre training quite
a lot amount of ten x to make the model
but the best pretri based model. And that's why we
build clauses the world's supercomputer with one h one hundred
and then with the best patrion model, and we realize
if you can collect these verifiable outcome reward, you can
(03:17):
actually train this model to start thinking for the first principle,
so the reason correct its own mistakes and that's where
the GROC reasoning comes from. And today we asked the
question what happens if you take the expansion of the
clauses with all two hundred thousand GPUs, put all these
into oil tenx more compute then any of the models
out there on reinforcement learning unprecedent scale.
Speaker 3 (03:39):
What's going to happen?
Speaker 4 (03:41):
So this is a story of GROG four and you know,
Tony shares some insight with the audience.
Speaker 3 (03:47):
Yeah, so yeah, let's just talk about how smart graph
for it is. So I guess we can start discussing
this benchmark card. Humanity is last exam and this this
benchmark is a very very challenging benchmark. Every single problem
is curated by subject matter experts. It's in total twenty
five hundred problems, and it consists of many different subjects mathematics,
(04:11):
natural sciences, engineering, and also all of humanity subjects. So
essentially when it was first release actually like earlier this year,
most of the models out there can only get single
digit accuracy on this manchmark. Yeah, so we can look
at some of those examples. There is this mathematical problem
which is about natural transformations in category theory, and there's
(04:35):
this organic chemistry problem that talks about electual cyclic reactions.
And also there's this linguistic problem that tries to ask
you about distinguishing between close and open syllabus from a
Hebrew source text. So you can see. Also it's a
very wide range of problems and every single problem is
(04:56):
PhD or even advanced research level problems.
Speaker 2 (05:00):
Yeah, I mean these there are no humans that can
actually answer these can get a good score. I mean,
if you ask me, say, like any given human, what,
like what's the best that any humans could score, I'd
say maybe five percent optimistically. So this is much harder
than what any any human can do. It's it's incredibly difficult.
(05:21):
And you can see from the types of questions like
you might be incredible in linguistics or mathematics or chemistry
or physics or anyone of a number of subjects, but
you're not going to be at a post grad level
in everything, and grockpour is a post grad level in everything,
like it's it's just some of these things are just
worth repeating, like grockpoor is post graduate like PhD level
(05:46):
in everything, better than pH but like most PhDs would fail.
So it's better that said, I mean, at least with
respect to academic questions. It I want, it's just emphasized
this point. With respect to academic questions, Grockpoor is better
than PhD level in every subject, no exceptions. That doesn't
mean that it's you know, times it may lack common sense,
(06:08):
and it has not yet invented new technologies or discovered
new physics, but that is just a matter of time.
It may discover new technologies as soon as later this year,
and I would be shocked if it has not done
so next year. So I would expect growk to literally
(06:29):
discover new technologies that are actually useful no later than
next year, and maybe end of this year. It might
discover new physics. Next year and within two years, that'd
say almost certainly. Like so just let that sink in.
Speaker 3 (06:41):
How okay, So I guess we can talk about what's
behind the scene of about four. As Jimmy mentioned, we
actually sawing a lot of compute into this training. When
it started, it's only a single digit number. But as
you start putting a more and more training compute, it
started to gradually become smarter and smarter and eventually solved
(07:06):
a quarter of the HI problems. And this is without
any tools. The next thing we did was to adding
tools capabilities to the model, and unlike Growth three, I
think growth actually is able to use clue as well,
but here we actually make it more native in the
sense that we put the tools into training. Growth three
(07:27):
was only relying on generalization. Here we actually put the
tools into training, and it turns out this significantly improves
the model's capability of using those tools. So how is
this different? Research was exactly the growth three reasoning model
without any specific training, but we only asked it to
use those tools. So compared to this, it was much
(07:49):
weaker in terms of its tool capabilities and irreliable and unreliable.
Speaker 2 (07:54):
Yes, yes, and to be clear, like these are still
I'd say fairly this is still fairly primitive tool use.
If you compare it to say, the tools that are
used at Tesla SpaceX, where you're using finite element analysis
and competitional flow dynamics and you're able to run or
say like TESL, it is like crash simulations with the
(08:14):
simulations are so close to reality that if the test
doesn't match the simulation, you assume that the test article
is wrong. That's how good the simulations are. So Grock
is not currently using any of the tools that a
company would use, but that is something that we will
provide it with later this year, so we'll have the
tools that a company has and have very accurate physics simulator. Ultimately,
(08:37):
the thing that will make the biggest difference is being
able to interact with the real world via humoroid robots.
So you combine GROCK with optimists and it can actually
interact with the real world and figure out if it's
if it has if it's you can formulate and hypothesis
and then confirm if that hypothesis is true or not.
Speaker 1 (08:57):
So we're really you know, I think about like where
we are to.
Speaker 2 (09:00):
We're at the beginning of an immense intelligence explosion. We're
in the intelligence big bang right now and the most
interesting time to be alive of any time in history. Now,
that's it.
Speaker 1 (09:13):
We need to make sure that the AI is a
good AI.
Speaker 2 (09:16):
The thing that I think is most important for AI safety,
at least my biological neural net tells me the most
important thing for AI is to be maximally truth seeking.
You can think of AI as this super genius child
that ultimately will outsmart you, but you can still instill
the right values encourage it to be sort of you know, truthful, honorable,
(09:39):
you know, good things like the values one to instill
in a child ultimately grow up to be incredibly powerful. Yeah,
these are still primitive tools and not the kind of
tools that serious commercial companies use. But we will provide
it with those tools, and I think it will be
able to solve real world technology problems.
Speaker 3 (09:56):
Yes, yes, exactly.
Speaker 4 (09:58):
But is it just compute all you need? Is it
just compute all you need at this point.
Speaker 2 (10:02):
Well, you need compute plus the right tools, and then
ultimately to be able to interact with the physical world,
and then we will effectively have an economy that is
ultimately thousands of times bigger than our card economy, or
maybe millions of times. If you think of civilization as
percentage completion of the Kardashev scale, where Kardashev one is
(10:24):
using all the energy output of a planet, and Kardashev
two is using all the energy output of a sun,
and three is all the energy output of a galaxy.
We're only, in my opinion, probably closer to one percent
of Kardashev one than we are to ten percent, So
like maybe a point one one two percent of Kardashev one,
(10:45):
So we.
Speaker 1 (10:45):
Will get to most of the weight, like an.
Speaker 2 (10:48):
Eighty ninety percent Kardashiv one, and then hopefully, if civilization
doesn't self annihilate, the actual notion of a human economy,
assuming civilization continues to progress, will seem very quaint in retrospect.
It will seem like sort of caveman throwing sticks into
a fire. Level of economy compared to what the future
will hold, it's very exciting. I've been at times kind
(11:12):
of worried about, like, well, you know, this seems like
it's somewhat unerving to have intelligence created that is far
greater than our own and will let's be better good
for humanity.
Speaker 1 (11:26):
I think it'll be good. Most likely it'll be good.
Speaker 2 (11:29):
But I somewhat reconcile myself to the fact that even
if it wasn't going to be good, I'd at least
like to be alive to see it happen.
Speaker 3 (11:36):
So yeah, yeah, I think a technical problem that we
still need to solve besides just compute, is how do
we unblock the data tottleneck because when we try to
scale up the aisle in this case, we did invent
a lot of new techniques innovations to allow us to
figure out how to find a lot of challenging our
(11:59):
problems will work on. It's not just a problem itself
needs to be challenging, but also it needs to be
you also need to have like a reliable signal to
tell the model you did it wrong, you did it right.
This is sort of the principle of reinforcement learning, and
as the model gets smarter and smarter, the number of
cool problems or challenging problems will be lesson and less
(12:21):
So it's going to be a new type of challenge
that we need to surpass besides just compute.
Speaker 2 (12:26):
Yeah, we actually are running out of actual test questions
to ask, So there's like even questions that are ridiculously hard,
if not essentially impossible for humans that are written down
questions are becoming trivial for AI. You know, the one
thing that is an excellent judge of things is reality.
So because if physics is the law, ultimately everything else
(12:49):
is recommendation.
Speaker 1 (12:50):
You can't break physics.
Speaker 2 (12:51):
So the ultimate test, I think for whether an AI
is the ultimate reasoning test is reality. So you invent
a new technology, like say, improve the design of a.
Speaker 1 (13:02):
Car or a rocket, or create a new medication. Does
it work?
Speaker 2 (13:07):
Does the rocket get to or it does the car drive?
Does the medicine work, whatever the case may be. Reality
is the ultimate judge here, So it's going to be
a reinforcement learning closing loop around reality.
Speaker 3 (13:19):
We asked the question how do we even go further?
So actually we are thinking about now with single agent,
we're able to solve forty percent of a problem. What
if we have multiple agents running the same time. So
this is what's called test on compute. And as we
scale up the test on compute, actually we are able
to solve almost more than fifty percent of the text
(13:42):
only subset of the HI problems. So it's a remarkable achievement.
I think this is insanely difficult.
Speaker 2 (13:49):
Before we're saying it's a majority of the text based
of humanities, you know, scarily named Humanity's Last Exam, grow
ful can solve. You can try it out for yourself
with the group Foy heavy. What does is it sports
multiple agents in parallel and all of those agents do
work independently, and then they compare their work and they.
Speaker 1 (14:09):
Decide which one. It's like a steady group.
Speaker 2 (14:12):
It's not as simple as a majority vote because often
only one of the agents actually figures out the trick
or figures out the solution. And but once they share
the trick or figure out what the real nature of
the problem is, they share that solution with the other
agents and then they compare notes and yield an answer.
(14:32):
So that's the heavy part of group four is where
you scale up the test time compute by roughly in
order of magnitude, have multiple agents tackle the task, and
then they compare their work and they put forward.
Speaker 1 (14:46):
What they think is the best result.
Speaker 3 (14:48):
Yeah, so we will introduce GLAW four and grawflor happy.
Sorry you can click the next light. Yeah so yeah,
So basically GUA four is a version, a single agent version,
and G for heavy is the multigeneration. So let's take
a look how they actually do on those exam problems
and also some real real life problems.
Speaker 5 (15:09):
Yeah. So we're going to start out here and we're
actually going to look at one of those HL problems.
This is actually one of the easier math ones. I
don't really understand it very well. I'm not that smart,
but I can launch this job here and we can
actually see how it's going to go through and start
to think about this problem. While we're doing that, I
also want to show a little bit more about what
this model can do and launch a rock four heavy
(15:30):
as well, so everyone knows polymarket. It's extremely interesting. It
aligns with what reality is most of the time, and
with GROC what we're actually looking at is being able
to see how we can try to take these markets
and see if we can predict the future as well.
So as we're letting this run, we'll see how for
Heavy goes about predicting the world series odds for the
(15:53):
current teams. And while we're waiting for these to process,
we're going to pass it over to Eric and he's
going to show you an example of his.
Speaker 6 (15:59):
Yeah, so, I guess one of the coolest things about
GROP four is its ability to understand the world and
to solve hard problems by leveraging tools like Tony discussed,
and I think one kind of cool example of this.
We asked you to generate a visualization of two black
holes colliding. In many case actually pretty clear, and it's
(16:21):
thinking trace about what these liveries are. For example, in
order it's actually be visible, you need to really exaggerate
the scale of the waves. And yeah, so here's like,
you know, this kind of inaction. It exaggerates the scale
in like multiple ways. It drops off less in terms
of implicit or distance, but we can see the basic
(16:43):
effects that are actually correct. It starts with the inspiral
emerges and then you have the ring down. This is
basically largely correct module some of the simplications that need
to do. It's actually quite explicit about this, but uses
post Newtonian approximations instead of actually computing the general relativistic
(17:05):
effects near the center of the black hole, which is
incorrect and you know, will lead to you know, someone
correct results. But the overall you know visualization is yeah,
it's basically there, and you can actually look at the
kinds of resources that it references. So here it actually
you know, it obviously uses search. It gathers results from
(17:26):
a bunch of links, but also reads through an undergraduate
text in analytic gravitational wave models. It's reasons quite a
bit about the actual constants that I should use for
a realistic simulation. It references existing real world data. It's
a pretty good model.
Speaker 1 (17:45):
Going forward.
Speaker 2 (17:45):
We can give it the same model that physicists use,
so it can run the same level of compute that
leading physics researchers are using and give you a physics
accurate black hole simulation.
Speaker 5 (17:56):
Just right now is running in your browser.
Speaker 1 (17:58):
This is just running in your brows. Pretty simple.
Speaker 5 (18:00):
Swapping back real quick. Here we can actually take a look.
The math problem is finished. The model was able to
Let's look at its thinking trace here so you can
see how it went through the problem. I'll be honest
with you guys, I really don't quite fully understand the math.
But what I do know is that I looked at
the answer ahead of time and it did come to
the correct pans or here in the final part, we
can also come in and actually take a look here
(18:22):
at our World Series prediction and it's still thinking through
on this one, but we can actually try some other.
Speaker 1 (18:27):
Stuff as well.
Speaker 5 (18:28):
So we worked very heavily on working with all of
our ex tools and building out a really great X
experience so we can actually ask, you know, the model,
you know, find me the Xai employee that has the
weirdest profile photo, and then we can actually try out,
you know, let's create a timeline based on X post
detailing the you know, changes in the scores over time,
(18:49):
and we can see, you know, all the conversation that
was taking place at that time as well, so we
can see who are the you know, announcing scores and
like what was the reactions at those times as well.
If we go back to this was the Greg Yang
photo here, So Greg Yang, of course, who has his
favorite photograph that he has on his account, that's actually
not how he looks like in real life.
Speaker 2 (19:09):
By the way, but it had to understand that question, Yeah,
which is that That's the wild part.
Speaker 1 (19:13):
It is like it understands what is a weird photo?
What is a weird photo?
Speaker 7 (19:18):
Yeah?
Speaker 1 (19:18):
What is a less or more weird photo?
Speaker 5 (19:21):
It goes through, it has to find all the team members,
has to figure out who we all are, right, you know.
Speaker 2 (19:25):
Searches without access to the internal XAI personnel locks literally
looking at that, just at the internet exactly, so you
could say, like the weirdest of any company.
Speaker 5 (19:34):
Yeah, And we can also take a look here at
the question here for the Humanity's Last exam. So it
is still researching all of the historical scores, but it
will have that final answer here soon. While it's finishing up.
We can take a look at one of the ones
that we set up here a second ago, and we
could see, like you know, and it finds the date that
Dan Hendricks had initially announced it. We can go through
we can see you know, open Aye announcing their score
(19:57):
back in February, and we can see, you know, progress
happens with like Gemini. We can see like Kimmy, and
we can also even see you know, the leaked benchmarks
of what people are saying is you know, if it's right,
it's going to be pretty impressive.
Speaker 1 (20:10):
So pretty cool.
Speaker 3 (20:11):
But yeah, it's great.
Speaker 2 (20:14):
Yeah, we're going to close the loop around usefulness as well,
so it's like it's not just a book smart, but
actually practically smart exactly.
Speaker 5 (20:22):
And we can go back to the slides.
Speaker 3 (20:23):
Herea so we actually evaluate also on the multimodel upset.
So on the full set, this is the number on
the hl E exam. You can see there's a little
dip on the numbers. This is actually something we're improving on,
which is the multimodel understanding capabilities. But I do believe
in a very short time we're able to really improve
(20:46):
and got much higher numbers on this higher numbers on
this benchmark.
Speaker 2 (20:51):
The biggest weakness of GROCK currently is that it's sort
of partially blind. It can't it's image understanding obviously in
its image generation needs to be a lot better, and
that's actually being trained right now. Growth four is based
on version six of our foundation model. We are training
version seven, which will complete in a few weeks. That'll
(21:11):
address the weakness on the vision side.
Speaker 5 (21:15):
Just to show off of this last year, so the
prediction market finished here with the heavy and we can
see here we can see all the tools in the
process it used to actually go through and find the
right answer, but browsed a lot of odds sites. It
calculated its own odds comparing to the market to find
its own alpha and edge. It walks you through the
entire process here, and it calculates the odds of the
(21:38):
winner being like the Dodgers, and it gives them a
twenty one point six percent chance of winning this year.
So and it took approximately four and a half minutes
to compute.
Speaker 1 (21:51):
That's a lot of thinking.
Speaker 3 (21:52):
We can also look at all the THEATO benchmarks besides HIE.
As it turned out, Go fourth excelled on all the
benchmarks that people usually test on, including GBQA, which is
a PHG level problem sets that's easier compared to HIE.
On a twenty five America Invitation Mathematics exam we with
(22:14):
for HAVY, we actually got a perfect score. Also on
some of the coding benchmark call Live Coding Bunch, and
also on HMMT however, math MIT exam and also USMO.
You can see actually on all of those benchmarks we
often have a very large leap against the second best
model out there.
Speaker 2 (22:35):
Yes, really are going to get to the point where
it's going to get every answer right in every exam,
and where it doesn't get an answer right, it's going
to tell you what's wrong with the question, or if
the question is ambiguous, disambiguate the question into answers A,
B and C and tell you what answers A, B
and C would be with a disambiguated question. So the
only real test then will be reality. Can I make
(22:57):
useful technologies? I discover new science that'll actually be the
only thing left because human tests will simply not be meaningful.
Speaker 4 (23:06):
You can make an update to HR you very soon
give it the current radio progress. So yeah, it's super
cool to see like multiple agents that collaborate with each
other solving really challenging problems. So where QUI try this model?
It's available right now. If we advance to the next slide,
there is a super grock Heavy tiers that we're introducing
where you're able to access to both grock for and
(23:28):
grock for Heavy, where you're actually going to be the
task master of a bunch of little rock research agent
to help you become smarter, tough little research and save
hours of times of going through munting tasks and it's
available right now.
Speaker 2 (23:40):
So yeah, we did limit usage during the demo, so
it didn't break the demo because all the stuff is
happening live, so there's not nothing can about the test
that we're doing after the demo's done. Will allow will
enable more subscribers for Super Grock, So if you can
subscribe right now, just try and have an hour.
Speaker 1 (23:57):
Should work.
Speaker 8 (23:58):
So many of you have enjoying our voice mode, and
we've been working hard to improve the experience over the
past couple months. We have cut latency in half to
make it much snappier, and today we're excited to announce
a set of new voices that have exceptional naturalness and prosody.
You might have noticed the movie trailer voice that open
(24:19):
up the live stream. That is one of our new voices,
Sal who's got that epically deep tone. And we're also
excited to introduce Eve, a beautiful British voice who's capable
of rich emotions. Amon, would you like to introduce Eve?
Speaker 9 (24:36):
Absolutely, let's get into the demo. Hey Eve, millions of
people are watching us right now. Are you excited?
Speaker 3 (24:42):
Oh?
Speaker 1 (24:42):
I'm thrilled.
Speaker 10 (24:43):
Really, it's like being on stage at the old vic
with a sea of faces out there. Just imagine all
those eyes and ears tuned in. It's a proper buzz,
isn't it. What do you reckon? We should chat about
to keep this lot entertained.
Speaker 9 (24:56):
You know, if we have a lot to talk about.
But I'll be honest, I'm feeling a little nervous. Can
you whisper something soothing to calm me down?
Speaker 7 (25:02):
Take a deep breath, love, you've got this feel a
calm washov.
Speaker 1 (25:07):
Let's keep talking nice and easy.
Speaker 7 (25:10):
What's on your mind?
Speaker 3 (25:12):
You know what?
Speaker 9 (25:12):
I'm actually feeling quite relaxed to try this something crazy?
So can you give us an opera on diet coke? Oh?
Speaker 10 (25:19):
An opera and diet coke?
Speaker 7 (25:21):
That's a riot? Picture this a shimmering silver canon, glistening understate, Oh,
diet coke? Thou elixa divine with bubbles, The dance in
a sparkling line, thy crisp, cool, kiss on lips so fine?
(25:42):
Oh aspe tame and Bruisia forever mine. How's that for
a mad little aria? Want me to build out another
verse or switch up the tune?
Speaker 9 (25:53):
Okay, let's try switching it up and just start singing
right then?
Speaker 10 (25:57):
Let's have a go.
Speaker 7 (25:58):
Oh silver, can you gleam in the light? Diet coke?
My heart's delight?
Speaker 9 (26:04):
Okay, I won't touch it. Very very poetic.
Speaker 3 (26:07):
Thank you.
Speaker 9 (26:07):
Feeling quite nice?
Speaker 3 (26:08):
Now cool? So that's eve.
Speaker 8 (26:10):
Something else you do here is a little test each
day to check on the natural flow of our conversations
against some of our competitors. So we figured why not
try that live on the stream today?
Speaker 1 (26:24):
Should it start?
Speaker 9 (26:24):
Hey, I'm going to say a number, and can you
say the same number back to me?
Speaker 1 (26:27):
One? One?
Speaker 8 (26:29):
Got it?
Speaker 1 (26:29):
This is like a fun little game of echo, what's next?
Speaker 9 (26:32):
Two? So you said two? Three, four?
Speaker 8 (26:36):
Nice and steady?
Speaker 6 (26:37):
What's next?
Speaker 1 (26:38):
Five?
Speaker 2 (26:38):
Ye?
Speaker 8 (26:39):
Nailed it? This is pretty fun.
Speaker 1 (26:41):
Actually, feel free to keep going or switch it.
Speaker 2 (26:43):
Up if you like.
Speaker 10 (26:44):
Hey, Gron, hello there, how can I help you?
Speaker 8 (26:47):
Today?
Speaker 3 (26:47):
We're going to do a little test. Can you repeat
after me?
Speaker 8 (26:49):
One?
Speaker 2 (26:50):
One?
Speaker 1 (26:51):
All right? What's next? Two? Two?
Speaker 7 (26:54):
What's on your mind?
Speaker 9 (26:55):
Three?
Speaker 8 (26:56):
Three?
Speaker 1 (26:56):
Need anything else?
Speaker 3 (26:57):
Four?
Speaker 1 (26:58):
Four? How can I five?
Speaker 8 (27:01):
Five?
Speaker 9 (27:02):
What's next? So as you can see, Grock was snappier,
didn't interrupt me, And the prosody is we made different
design choices. I think we're shooting for something or comms
mood more natural versus something that's more poppy or artificial.
Speaker 3 (27:15):
So we'll keep.
Speaker 9 (27:16):
Improving on these months.
Speaker 1 (27:18):
Thanks guys. Yep.
Speaker 4 (27:19):
So since the launch of the voice model, we actually
see the two x faster and to en latency. In
the last eight weeks five different voices and also ten
next the active user. So Grock Voice is taking off
now if you think about releasing the models this time,
we're also releasing Grock four through the API. At the
same time, we're very excited about, you know what all
(27:42):
developers out there is going to build. So you know,
if I think about myself as a developer, the first
thing I'm going to do when I have access to
the Grock for API benchmarks, we actually ask around on
the X platform what is the most challenging benchmarks out
there that is considered the holy grill for all the
a JI models. So turn out hs in the name RKGI.
So the last twelve hours, you know, kudos to Greg
(28:05):
over here in the audience, so who entered our call
take a preview of the API and independently verified the
Grock Force performance. So initially we thought, hey, grog floy
just we think it's pretty good.
Speaker 1 (28:16):
It's pretty smart.
Speaker 4 (28:17):
It's our next year reasoning model, spend ten next more
compute and can use all the tools.
Speaker 1 (28:21):
Right.
Speaker 4 (28:22):
But turned out when we actually verify on the private
subset of the rkhiv too, it was like the only
model in the last three months that breaks a ten
percent barrier. But in fact it was so good that
actually gets sixteen percent, well, fifteen point eight percent accuracy,
two x of the second place that is the cloud
(28:42):
for Opus model.
Speaker 3 (28:43):
It's not just.
Speaker 4 (28:44):
About performance, right when you think about intelligence, having the
PAPI model drives your automation, it's also the intelligence per dollar.
Speaker 3 (28:52):
Right.
Speaker 4 (28:53):
If you look at the plots over here, the gro
collages in the league of its own all right, So
enough of benchmarks, right, So what can grow in a
real world?
Speaker 1 (29:01):
We contacted the folks from end.
Speaker 4 (29:03):
The Labs who were gracious enough to try to grow
in a real wall to run a business.
Speaker 3 (29:08):
Yeah, thanks for having us. So I'm Axual from Amma Labs.
Speaker 11 (29:11):
And I'm Lucas and we tested Grok for on vending bench.
Vending Bench is an AI simulation of a business scenario
where we thought what is the most simple business and
AI could possibly run? And with vending machines in this scenario,
the GROP and other models need to do stuff like
manage inventory, contact suppliers, set prices. All of these things
(29:32):
are super easy and all the models can do them
one by one, but when you do them over very
long horizons, most models struggle.
Speaker 1 (29:40):
But we have a little board and there's a new
number one.
Speaker 3 (29:42):
Yeah, so we got early access to the group for API.
We ran it on the vending bench and we saw
some really impressive results, so it ranks definitely at the
number one spots. It's even double the network, which is
the measure that we have on this, so it's not
about a percentage or score you yet, but it's more
the dollar value you in networth that you generate. So
we were impressed by Rocky was able to formulate a
(30:05):
strategy and adhere to that strategy over a long period
of time, much longer than other models that we have tested,
other frontier models, So it's a managed to run the
assimulation for double the time and score double the networth
and it was also really consistent across this runts, which
is something that's really important when you want to use
this in the real world.
Speaker 11 (30:24):
And I think as we give more and more power
to AI systems in the real world. It's important that
we test them in scenarios that either mimic the real
world or are in the real world itself, because otherwise
we fly blind into some things that might not be great.
Speaker 2 (30:38):
It's great to see that we've not got a way
to pay for all those GPUs, So we just need
a million of vending machines. We could make a four
point seven billion dollars a year with a million vetting machines.
Speaker 1 (30:48):
Let's go. It can be epic vending machines.
Speaker 2 (30:50):
Yes, yes, all right, we are actually going to install
bending machines here, like a lot of them.
Speaker 1 (30:56):
We're happy to supply them, all right, thank you?
Speaker 2 (30:59):
All right, Yeah, I'm looking forward to seeing what amazing
things are in the spinning machine.
Speaker 3 (31:04):
That's that's for you to decide, all right, to tell
the AI.
Speaker 1 (31:07):
Okay, sounds good.
Speaker 4 (31:08):
Yeah, I mean so we can see like Grock is
able to become like the copilot of the business unit.
Speaker 3 (31:13):
So what else can Grog do.
Speaker 4 (31:15):
So we're actually releasing this rock if you want to
try it right now to evaluateun the same benchmark as US.
It's on API has two hundred and fifty six k
contact lens. So we already actually see some of the
early adopters to try grock for API.
Speaker 3 (31:28):
So our power out on.
Speaker 4 (31:29):
Neighbor Archie Institute, which is a leading medical research center,
it's already using seeing like how can they automate their
research flows with rock for It turned out it performs.
It's able to help the scientists to sniff through, you know,
millions of experiments logs and then just like pick the
best hypothesis within a split of seconds. We see this
(31:50):
is being used for their crisper research and also uh,
you know grock for independently evaluate scores as the best
model to examine the chess extra.
Speaker 1 (32:00):
Who would know?
Speaker 4 (32:01):
And in the financial sector we also see you know,
the growth woard with access to all the tools real
time information is actually one of the most popularizes out there.
Growdford is also going to be available on the hyperscalers.
So the XAI enterprise sector is only started two months
ago and we're open for business. The other thing, we
talked a lot about having groud to make video games,
(32:22):
so Danny is actually a video game designers on x
So you know we mentioned who want to try out
some rock for prevy APIs to make games and Danny
answer the call. This was actually just made first person
shooting game in the span of four hours. Some of
the unappreciated hardest problem of making video games is not
necessarily in encoding the core logic of the game, but
(32:43):
actually source all the assets, all the textures of files
to create a visual appealing game. So one of the
core aspect Rockford does really well. With all the tools
out there, is actually able to automate these like asset
sourcing capabilities, so the DEVELOPMRITI can just focused on the
core in itself rather than like you know, so now
you can run a you know, entire game steal thos
(33:05):
with game of one whether we like one person, and
then you can have grock four to go out and
source all those assets to all the mainting task for you.
Speaker 2 (33:13):
The next step obviously for grog play be able to
play the game. So it has to have very good
video understanding so it can play the games and interact
with the games, actually assess whether a game is fun
and and actually have good judgment for whether a.
Speaker 1 (33:27):
Game is fun or not.
Speaker 2 (33:28):
So with version seven of our foundation model, which finishes
training this month and then we'll go through post training
RL and whatnot well that will have excellent video understanding,
and with a video understanding and improve tool use. For example,
for video games, you'd want to use Unreal Engine or
Unity or what are the main graphics engines, generate the art,
(33:49):
apply it to a three D model, and then create
an executable that someone can run on a PC or
a console or a phone. We expect that to happen
probably this year, and if not this year, certainly next year.
Speaker 1 (34:01):
It's gonna be wild. I would expect the first.
Speaker 2 (34:04):
Really good AI video game to be next year, and
probably the first half hour of watchable TV this year,
and probably the first watchable AI movie next year. Like,
things are really moving at an incredible pace.
Speaker 4 (34:20):
Yeah, when Grock is ten x in the world economy
with vending machines, will just create video games for human Yeah.
Speaker 2 (34:24):
I mean it went from not being able to do
any of the six months ago to what you're seeing
before you hear, and from very primitive a year ago
to making a three D video game with a few
hours of prompting.
Speaker 4 (34:39):
I mean, yeah, just to recap. In today's livestream, we
introduced the most powerful and most intelligent AI models that
can actually reason from the first principle, using all the tools,
do all the research, go on the journey for ten minutes,
come back with the most correct answer for you. So
it's kind of crazy to think about just like four
months ago we had rock thway and now we already
(34:59):
have for and we're going to continue to accelerate as
a company XAI. We're going to be the fastest moving
HI companies out there. So what's coming next is that
we're going to you know, continue developing the model that's
not just you know, intelligent smart thinking for a really
long time, spent a lot of compute, but having a
model that actually both fast and smart is going to
(35:21):
be the core focus.
Speaker 1 (35:22):
Right.
Speaker 4 (35:22):
So if you think about what are the applications out
there that can really benefit from all those very intelligent,
fast and smart models, and coding is actually one of them.
Speaker 3 (35:31):
Yeah, So the team is currently working very heavily on
coding models. I think right now the main focus is
we actually trained recently a specialized coding model which is
going to be both fast and smart. I believe we
can share that model within in a few weeks. Yeah,
that's very exciting.
Speaker 4 (35:48):
But the second after coding is we all see the
weakness of GROCK four is the multi model capability. In fact,
it was so bad that GROCK effectively just like looking
at the world squinkings through the glass and see all
the blurry features and trying to make sense of it.
The most immediate improvement we're going to see what's the
(36:08):
next generation preation model, is that we're going to see
a step waunch improvement on the model's capability in terms of
image understanding, video understanding, and audioce rate. It's now the
model is able to hear and see the world just
like NLU right and now with all the tools at
this command, with all the other agents it can talk to,
you know, so we're going to see a huge unlock
(36:29):
for many different application layers. After the multimodel agents. What's
going to come after is the video generation, and we
believe that, you know, at the end of day, it
should just be you know, pixeling pixel out. Imagine a
world where you have this infinite scroll of content in
inventory on the X platform where normally you can actually
(36:50):
watch these general videos but able to intervene credit you
on the ventures.
Speaker 1 (36:55):
It expect to be training.
Speaker 2 (36:56):
A video model with over one hundred thousand GB two
hundreds to begin that training within the next three or
four weeks. So if we're confident it's going to be
pretty spectacular in video generation and video understanding.
Speaker 3 (37:08):
We're very excited for you guys to try and rock four.
Speaker 1 (37:10):
All right, Thanks, very good night.