Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Humanity's last exam designed as a be all, end all,
standardized test for AI models. It's how we measure the
intelligence of artificial intelligence.
Speaker 2 (00:09):
Yeah, drawing from experts around the world's models have to
answer questions dealing with everything from mathematics to ecology. But
how are the new models launched as recently as this
weekend faring on the test? Joining us now on the
KWA Common Spirit Health Hotline to talk more about it.
ABC News Tech reporter Mike Debuski. He's not one of
the AI models. Mike, thank you so much for your
time this morning. I'll be honest with you. I've read
(00:31):
the first question on this test and my brain broke
even trying to comprehend what they were asking with this
what is this full test? Tell us a little bit
more about it.
Speaker 3 (00:39):
It's fascinating, absolutely, and I think you're not alone. I
read these questions and I have no real understanding of
even if they're kind of in English or a language
that I would understand. But the point here, guys, is
that there's this growing question in the AI space of
how to measure the quote unquote intelligence of artificial intelligence? Right?
How do you figure out how smart these things? Are,
(01:02):
and the answer for the most part up until now
has been standardized tests the same way that we measure
the intelligence of people. So you subject these models to
a bunch of math science logic problems and you see
how they do. And over time these technologies have advanced
and as a result, the tests have gotten kind of harder,
to the point where the standard at this point is
(01:23):
like a PhD level question right. One popular test, for example,
is called the Massive Multitask Language Understanding Test. The problem
is that many of the popular models from some of
the leading AI companies out there, Open Ai, Google, Anthropic,
a couple others are regularly doing very well on these tests.
(01:44):
For example, they all score higher than eighty percent on
the MMLU, those companies that I just mentioned, which kind
of raises the question like what are they actually measuring
at the end of the day. So this is where
Dan Hendrix comes into the story. He's the director of
the Center for AI Safety, and he's come up with
a new test called Humanity's Last Exam. Supposedly, it was
(02:05):
originally called Humanity's Last Stand, but that was deemed a
little too dramatic. And this is supposedly the hardest test
that generative AI technology has ever faced. It's a collection
of three thousand questions from over one hundred different subjects.
Forty two percent are math questions for example, eleven percent
have to do with biology, eight percent are humanities and
(02:28):
social sciences. It goes on and on, and these questions
come from more than one thousand different subject matter experts,
so professors, researchers that are out there and they are
affiliated with they say, five hundred institutions across fifty different countries.
Speaker 1 (02:43):
So the question a lot of people ask with AI,
especially with at least the early iterations of it, the
machine learning any is there any bias that can be
seen in any of these answers or these straight For
lack of a better terman, I'm a lot ied in
this stem questions where there's an actual answer and there's
not nuance to answer answer, And doesn't that separate us
maybe from AI, where we have the human element we
(03:04):
bring in in AI models do not.
Speaker 3 (03:06):
It's a great question, and for the most part, from
the sample questions that we're able to look at. You
have to pay for the test in order to get
access to all three thousand questions, But from the sample
questions that they've given us. They are pretty discreet. Here.
I'll read one for you just very quickly, and forgive
me for any mispronunciations. But this is one of the
ecology questions. Hummingbirds within app of deformists uniquely have bilaterally
(03:29):
paired oval bones, a sizemoid embedded in the cautolateral portion
of the expanded cruciate appo neurosis of insertion of M depressor.
I haven't even gotten the question part. Yes, how many
paired tendons are supported by this sizemoid bone? And of course, guys,
I know that you know this, but you have to
answer with a number. So that is one of the
(03:52):
questions that is out there. It was submitted by I
believe a researcher or a professor from MIT. There are
other questions that have to do with the classics. For example,
the one next to this asks you to identify a
Roman inscription. So there's some of these tests are are
you know? They require the model to look at a
picture and then understand the text of the question as well,
(04:13):
so kind of testing how well it's able to do that.
There's plenty of computer science questions in here. Mathematics of course, chemistry, linguistics.
They're very hard questions. But there is like that ecology
question I read, despite the fact that I have no
idea really what it was asking at the end of
the day, it does have a discrete answer, right, it's
looking for a number at the end of the day.
So yeah, little room for them to sort of, you know,
(04:37):
fudge around the edges maybe, but it does not. It
seems like it's a pretty standardized test.
Speaker 2 (04:41):
Yeah, Mike, it's fascinating. Thousands of questions and it's interesting
to see how they really shape up. You mentioned how
OpenAI and Google Tropic do. Do we know how Deepseek
really compared in this test?
Speaker 3 (04:52):
We do, and from what we understand so far, this
is a very hard test for large language models. Open
aiyes o one model, which is one of their more
recent reasoning models, that scored a nine point one percent
on this test, so pretty definitive failing grade. I think
it's fair to say deep Seek because you asked there
are one model that kind of shook up the AI
race last week. They scored a nine point four percent
(05:15):
on Humanity's last exam, so a little bit better, but
still not quite into the double digits of doing well
on this test. Google, for example, they have the Gemini
Thinking model that's kind of their latest and greatest model,
seven point seven percent anthropic. They scored a four point
three percent with one of their models, and GROC two,
which is a large language model that is developed by
(05:36):
Xai owned by Elon Musk. You might see a crop
up on your x feed every once in a while,
they scored a three point eight percent. However, guys Hendrick
says that he expects these models to get a lot
better very quickly. He says they he expects them to
regularly score about fifty percent on this test before the
end of the year. Over the weekend, OpenAI launched some
(05:57):
new large language models, including a deep research model, and
that scored a twenty six point six percent. So this
test has only been out there for a few weeks
already these companies are making some pretty major leaps.
Speaker 1 (06:09):
I could can continue on this subject for so many
reasons about whether or not it'll become sentient. I know
the concerns with AI is the hands that it falls
into and less about the knowledge it gains. But that's
the thing I was going to finally ask Mike. So
these machines they learn from their mistakes, so they will
score better as they attain that knowledge, or get more knowledge,
or learn from their mistakes.
Speaker 3 (06:29):
Right, And this gets at one of the fundamental questions
of this sort of style of testing AI. Right in
the same way that a standardized test has faced criticism
among students and parents and educators for not testing the
whole person. Right, some people just aren't good at standardized tests,
but they are otherwise perfectly intelligent and capable of existing
within society. So too, those questions are facing these large
(06:52):
language models. Right. These things can learn to a test
essentially what these are testing are kind of what corpus
of information are these models trained on. As these models
gain more and more information from various sources, they will
you know, supposedly get more intelligent. I think the real
question facing AI is is less how well it scores
on standardized tests, but rather what tests it's going to
(07:16):
face in the real world that you know, it could
potentially see a hallucination, it could see a mistake, and
that really has real world consequences.
Speaker 2 (07:23):
I may have cheated through AI, but I think the
says mooid bone and honey hummingbirds is described as supporting
too pair.
Speaker 3 (07:30):
I saw that as well. So the other funny part
is that, you know, the answers to these questions are
pretty hard to find because of course, if you put
them on the internet, then the models are going to learn.
From the point, you can't really you can't really release them,
you know, widely. But let's go with two. I feel
like two.
Speaker 2 (07:45):
Is that feels ABC News Tech Reporter. It's Mike Tubuski.
Thanks Mike.
Speaker 3 (07:49):
Of course, guys, take care.