All Episodes

February 4, 2025 7 mins
Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
He is the ABC News technology reporter. Mike Dubuski is
joining us now and you know it is Tech Tuesday.
Mike and Man, I don't know. Chuck and I were
just kind of talking about there's a little bit ago
as we were kind of you know, promoting, this was
coming up, and it was like, can you pass Humanity's
Last Exam? And it was like, uh, I got a

(00:20):
little I'll be honest, I was like, oh, this is
this is a little scary what we're seeing right here.
Should we be scared?

Speaker 3 (00:26):
Mike?

Speaker 2 (00:27):
It's certainly a dramatic name. In fact, the people who
created this this exam and were originally going to call
it Humanity's Last Stand, but they said that was a
little too dramatic and they had to reel it back
a little bit. But but what it is, it is
at least very scary to people who are scared of
standardized tests, and I count myself among them. This is
an exam of three thousand questions that cover over one

(00:50):
hundred different subjects. Forty two percent our math questions, eleven
percent our biology, there's humanities, social sciences, and on and
on it goes. And these are questions that were written
by more than a thousand subject matter experts, professors, researchers,
that sort of thing affiliated with five hundred institutions around
the world across fifty countries, they say. And the guy

(01:11):
who kind of organized all this is a man named
Dan Hendricks. He's the director for the Center for AI Safety,
and he says, this new test is supposedly the hardest
test that generative AI technology has ever faced. In other words,
we are not intended as human beings to take this test.
This is rather a test to evaluate how intelligent artificial

(01:33):
intelligence is.

Speaker 3 (01:34):
So okay, see the adaptive nature of AI worries me
right there, because it will not take long for all
those microchips to figure out how to answer these questions correctly.
And the next thing you know, we've got Arnold Schwarzenegger
doing a real life terminator narration for us. Yes, coming
down the street right now.

Speaker 2 (01:52):
Oh boy, it's already happening in some ways. Right, So
previously standardized tests in AI testing is nothing, knew. This
is kind of how we evaluate how good these systems are.
The previous sort of standard was called the Massive Multitask
Language Understanding Test or MMLU. They are not very good
at naming things in the AI space. Even still, this

(02:15):
was a test that was kind of like a PhD
doctorate level test. It asked really hard questions. But the
problem became that popular models basically got too good at
the test. The latest models from open Ai, from Google,
from Anthropic, and others, they all scored higher than eighty
percent on the MMLU. As a point of comparisons, Humanity's

(02:37):
last exam this new thing, open AI's one Reasoning model,
one of their more recent models, scored a nine point
one percent, so a pretty definitive failing grade there. Last week,
you guys and I talked about deep Seek, this new
Chinese AI model. They scored a nine point four percent,
so a little bit better, but even still a shade

(02:57):
of what they were testing at.

Speaker 1 (03:00):
Yeah, so it's going AI is going to essentially this
will be easy for it to learn or I'm just man,
you're telling us exactly what is going on with this,
and it is still mike for me in my small brain,
really still very difficult to comprehend.

Speaker 2 (03:20):
Well, here, do you want a little taste of what
the test is like. Please have sample questions there for you. Yes,
there there are many of them, and they are kind
of complicated to read. But the one that I've sort
of centered on is this one from the ecology section
of the exam, and here it goes, hummingbirds within apodeformis
uniquely have a bilaterally paired oval bone, a seizemoid embedded

(03:43):
in the cautolateral portion of the expanded cruciate aponeurosis of
insertion of M depressor. How many paired tendons are supported
by this sizemoid bone? Answer with a number.

Speaker 1 (03:57):
Six, we do.

Speaker 3 (03:58):
I'm going to go with two.

Speaker 2 (04:00):
So two, from what I've been able to put together,
is the correct answer. I think it is two. However,
it's very difficult to track down the answers to any
of these sample questions because, of course we know that
these models are trained on the Internet. They're trained on
basically anything that you can find online easily accessible, and
I'm just on the website for Humanity's Last Exam looking

(04:23):
at this, so it kind of makes sense that they
would not want to put the answer right there, because
then the model would learn it and then know that answer.
But from what I've been able to put together, it
is two. But that's sort of the flavor of question
that we're getting again. It seems like it's very hard
for these systems. You know. Open ai has kind of
been doing pretty well, you know, in comparison to its
competitors Deep Seek as well. Google was only able to

(04:45):
manage a seven point seven percent, Anthropic managed a four
point three percent. Elon Musk has an AI company that
makes a large language model called Grock. Grock two, the
latest version of that model scored a three point eight percent,
So they're not doing particularly well, but to your point earlier,
it does seem like they're getting better over time. Hendrick

(05:06):
says that he expects these models to regularly score about
fifty percent on this test before the end of the year.
And over the weekend, open ai released a whole bunch
of new models, including a research model that they call
deep Research, and that's going to twenty six point six percent.
Just to give you a little sense of how fast
the stuff has advanted.

Speaker 3 (05:25):
See if they gave two So I'm artificially intelligent.

Speaker 1 (05:28):
I know that if they gave if they redid those tests,
I would imagine even though they could readminister straight away.
And it's again, I feel like it's going to get
you know, it'll go from twenty six to fifty four,
you know, I mean, I feel like it's going to
climb like that and all that. I don't know if

(05:49):
that's something I feel like as where we're headed, no question.

Speaker 2 (05:52):
Right, right, And it kind of gets at some similar
questions that people have raised about actual standardized tests like
the SAT and the AC he do they accurately measure
a person's intelligence or smarts or is it just people
are studying to a test and that sort of thing.
Is that the right way to evaluate a person's intelligence now?
And the same question can apply to these AI models.

Speaker 1 (06:13):
Yeah, you're evaluating their memory, you know, especially if they're
memorizing the correct answers and so on. Yeah, the real
test is if it's applicable in a real life situation
and all of that. But that's not necessarily how they
test or determine IQ and all of those things. And
I thought one of the questions was going to be
if a train leaves Chattanooga six you know. That's why

(06:34):
I was like, oh, I know this one.

Speaker 2 (06:36):
I know this one with all of those energies, and
that's about where I tuned out of my standardized testing world.
So you got me there.

Speaker 3 (06:43):
Those large words. Mike was reading as he gave us
this question translated in my mind what I heard was
on a Hummingbird slurper. How many tendons control the use
of the slurper? Oh, that's what That's how imported the question.
I don't know what else. It's not a beat, but
the thing that humming Bird inserts into the flat slurper,
the slurper. New it's seven eleven, the slurper.

Speaker 2 (07:05):
The slurper.

Speaker 1 (07:06):
Get you one of them? Yeah, Mike Dubuski, ABC News
Technology reporter on Tech Tuesday, Mike, thank you very much.
Appreciate you man.

Speaker 2 (07:15):
Of course, guys they can.

Speaker 1 (07:16):
See yah, the slurper, slurper.

Speaker 3 (07:19):
And now July fourth, twenty twenty five, sky net becomes
self aware.

Speaker 2 (07:23):
This guy that.

Speaker 3 (07:24):
Begins to learn at a geometric rate, it becomes self
aware to fourteen am Eastern time, August twenty ninth. Oh,
it's August twenty ninth. I was awful on that. Okay,
I'm sorry,
Advertise With Us

Popular Podcasts

Math & Magic: Stories from the Frontiers of Marketing with Bob Pittman

Math & Magic: Stories from the Frontiers of Marketing with Bob Pittman

How do the smartest marketers and business entrepreneurs cut through the noise? And how do they manage to do it again and again? It's a combination of math—the strategy and analytics—and magic, the creative spark. Join iHeartMedia Chairman and CEO Bob Pittman as he analyzes the Math and Magic of marketing—sitting down with today's most gifted disruptors and compelling storytellers.

Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

The Breakfast Club

The Breakfast Club

The World's Most Dangerous Morning Show, The Breakfast Club, With DJ Envy And Charlamagne Tha God!

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.