Can you pass humanity's last exam?

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Alright, I'm a Legacy Retirement Group dot com phone line.

Speaker 2 (00:02):
Let's check in with Mike Debuski, ABC's Tech reporter. It
is Tech Tuesday. Mike, good morning, Thanks for the time.
I appreciate you as always. I do want to talk
about Humanity's last exam, which sounds a little apocalyptic to me,
but we'll But I do have some some kind of
lingering questions about the deep seek AI model from last week.

(00:25):
We know that it kind of turned tech stocks on
their side last week open ai and Nvidio, So a
couple of questions. I guess they're kind of related. Did
deep seek ai use open ai models? Do they kind
of look over their shoulder and kind of cheat on
the test a little bit?

Speaker 3 (00:42):
So that is currently what Microsoft and open ai are investigating.
We don't know a definitive answer to that just yet.
It seems as if you know, some of the theories that
are kind of floating around out there is that deep
seek was able to use this product know or this
method I should say you know is distillation where essentially
they take open AI's model, ask it a bunch of

(01:04):
questions and use those answers to train their own model.
That is not an illegal practice. It's kind of just
something that's out there that's been used before by it
by others. But it does kind of, you know, put
some pressure on the leaders in the industry. You've invested
tens of billions of dollars into training these models, saying like, hey,
you know, all these other guys were able to do
this kind of on the cheap, you know, on the

(01:25):
backs of our research and on the backs of our technology.
So that's what they're looking into right now.

Speaker 1 (01:30):
Interesting.

Speaker 3 (01:30):
We also have seen open ai launch a whole slew
of new models over the weekend that have you know,
thus far sort of usurped the capabilities of deep Seek's
latest model. So deep Seek, there are one model kind
of scored similarly to or better than common AI models,
you know, on common standardized tests. Now the latest models

(01:51):
from open ai, one called deep Research, is able to
now usurp that model on these standardized tests.

Speaker 1 (01:56):
And that's kind of the predictable path for AI.

Speaker 2 (01:59):
It's it's only going to get more efficient and more
quote unquote intelligent, right, I mean, it's not going to
go the other way. As they start to you know,
the programmers and get smarter and they start using other
models and it all kind of contributes to this sort
of growing intelligence. And the other question I have about
deep Seak, and we'll move on to the humanity's last exam.

(02:22):
Did did do we know if deep seke used in
Nvidia chips in their the you know, their hardware, because
you know, in Vidia had a rough go with with
stocks last week, which doesn't really make sense that deep
Seek's success would be bad news for Nvidia.

Speaker 3 (02:39):
Well, in some ways. What we do know is that
in Vidia or excuse me, that deep seek was sort
of forced to use either older or slightly worse chips
than what American companies had on offer. And in Vidia
gets a lot of the attention because they're kind of
the leader in this space. The H one hundred chip,
for example, has severe restrictions on where in Vidia can

(03:02):
export that chip, and China has a pretty big export
controls placed on it, so they can't get these latest
and greatest chips. That doesn't necessarily mean they weren't using
in Vidio chips. They could have been using older ones.
They could have been using slightly less powerful ones. They
could be using chips that you know, were developed you know,
domestically and in China itself, but we don't know the

(03:22):
answer to that just yet. In Vidia has recovered somewhat
on the sock market since then. You know, at the
end of the day, a chip's a chip, right, Like,
they're still selling a chip, even if it's not their
their greatest chip. But yeah, that's kind of the interesting
piece of this. And at the end of the day,
all of these other companies now have access to and
sort of have a better understanding of how to run

(03:43):
a more efficient model because of what Deep Seek was
able to do. So you know, there's now the question
of how open AI can use that sort of knowledge
to develop new either more efficient models or more powerful
models using the chips that they already have.

Speaker 1 (03:58):
ABC's Tech reporter bike to Usky joining us.

Speaker 2 (04:00):
All right, So the question, Mike is how do we
actually measure the intelligence of artificial intelligence?

Speaker 3 (04:08):
Yeah, and it's a growing question in the AI space
in a pretty big way. And the answer to that
question up until recently has been standardized tests, right the
same way we measure the intelligence of a student in school.
We subject these large language models to a bunch of
math science logic problems. As this industry has matured, as
these models have gotten more intelligent, the tests have gotten harder,

(04:29):
right to the point where your average test for an
AI model is like a PhD level test, and one
popular one is called the Massive Multitask Language Understanding Tests,
or the MMLU.

Speaker 1 (04:42):
They're not very good at naming things space in my opinion.

Speaker 3 (04:45):
However, the problem now is that some popular models are
getting basically too good at these common standardized tests. For example,
recent models from OpenAI, Google and Anthropic all score higher
than eighty percent on the LU, raising the question of
kind of what is the good of a test where
everyone can ace it. So that's where Dan Hendrix, he's

(05:07):
the director of the Center for AI Safety, comes into
the picture. He came up with a new test that
supposedly is the hardest test that generative AI technology has
ever faced. They call it Humanity's Last Exam. Apparently they
were originally going to call it Humanity's Last Stand, but
that was deemed a little too dramatic, so they've gone
with Humanity's Last Exam. This is a test of three

(05:30):
thousand questions from over one hundred different subjects. Forty two
percent of them are math questions, eleven percent our biology questions,
eight percent are humanities and social sciences, and on and
on it goes. And these are questions that come from
and were written by more than one thousand different subject
matter experts, so researchers, professors, that type of thing, and

(05:51):
they were affiliated with five hundred institutions from across fifty countries,
kind of pulling from a huge base of different knowledge
that has now been compiled into this pretty hefty test.

Speaker 2 (06:03):
I guess the question is is how smart can AI
actually get? Can it actually eclipse the intelligence of humans?
Because I mean, at the end of the day that
that's the bar, right, I mean, you can you can
you be smarter than a human because all the information
that's out there is been created by and I know
I'm getting kind of you know, philosophical. We should probably
be in a back room somewhere, but like, how how

(06:25):
smart can AI actually get?

Speaker 3 (06:28):
Well, what you're describing there where an AI sort of
surpasses the intelligence of a humans is a term in
the AI space known as AGI, or artificial general intelligence,
a computer that is able to match the capabilities of
the human brain, and many companies, Open AI among them,
have said that that's their goal, right, They want to
get there and create that because that will be this

(06:50):
you know, transformative moment in the world of technology. It
would open up the possibility of having computers do things
that they've never been able to do before and change
our lives in a huge way. The thing is, we're
nowhere close to that by all accounts, right if we're
just going by humanity's last exam, this standardized test, this
is really hard for current AI models. For example, open

(07:11):
AI's OH one model, one of their most recent reasoning models,
only scored a nine point one percent on this test.
That is a pretty definitive failing grade. Deep seek fun
fact that we've just been talking about. There Are one
model scored a little bit better. It did a nine
point four percent, but still pretty failing grade. There Google
scored a seven point seven percent. Anthropic scored a four

(07:33):
point three percent, GROC two, which is Elon Musk's large
language model that you might see baked into X or
some of his other platforms three point eight percent, so
not doing too great on this exam. However, Hendrix says
that he expects models to regularly score about fifty percent
on this test by the end of the year. To

(07:53):
speaking to your point, which is how fast these things
are advancing and raising the question of you know one,
humanity's last exam becomes outmoded, What comes next?

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

Dateline NBC

On Purpose with Jay Shetty

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Can you pass humanity's last exam?

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

Dateline NBC

On Purpose with Jay Shetty

All Episodes

Can you pass humanity's last exam?

Stuff You Should Know