All Episodes

August 13, 2025 15 mins

This month, the tech company behind Chat GPT released what they claim is their smartest AI model yet. According to OpenAI, GPT-5 operates at the level of a PhD student. But experts are warning that the AI race has become a marketing battle, as companies manipulate test results to claim their product is the best. Today, we're unpacking how AI companies measure intelligence and why that's become a problem.

Hosts: Sam Koslowski and Emma Gillespie
Producer: Orla Maher

Want to support The Daily Aus? That's so kind! The best way to do that is to click ‘follow’ on Spotify or Apple and to leave us a five-star review. We would be so grateful.

The Daily Aus is a media company focused on delivering accessible and digestible news to young people. We are completely independent.

Want more from TDA?
Subscribe to The Daily Aus newsletter
Subscribe to The Daily Aus’ YouTube Channel

Have feedback for us?
We’re always looking for new ways to improve what we do. If you’ve got feedback, we’re all ears. Tell us here.

See omnystudio.com/listener for privacy information.

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Already and this is the Daily This is the Daily
ohs oh, now it makes sense.

Speaker 2 (00:14):
Good morning, and welcome to the Daily OS. It's Thursday,
the fourteenth of August. I'm Sam Kazlowski.

Speaker 1 (00:19):
I'm Emma Gillespie.

Speaker 2 (00:20):
This month, the tech company behind chat GBT, released what
they claim is their smartest AI model yet now. According
to Open Ai, GPT five operates at the level of
a PhD student. But experts are warning that the AI
race has become a bit of a marketing battle, as
companies manipulate test results to claim their product is the best.

(00:42):
On today's podcast, we're going to unpack how AI companies
measure intelligence and why that's become a problem.

Speaker 3 (00:52):
Sam.

Speaker 1 (00:52):
I was originally skeptical about having this conversation with you
because I, like maybe some listeners here AI and I
kind of roll my eyes a little.

Speaker 3 (01:03):
Bit and switch off.

Speaker 1 (01:04):
But if you are that person hearing this right now,
hang in there, because this is actually a fascinating conversation,
this idea that we're sort of being marketed to about
this arms race of who is the smartest, which AI model.

Speaker 3 (01:18):
Is the best.

Speaker 1 (01:19):
Let's start with the basics here, though, when we're talking
about AI models.

Speaker 3 (01:24):
What exactly does that mean?

Speaker 2 (01:26):
There is a certain brand of satisfaction that is reserved
for when I can change your mind and whether a
story is going to be interesting or.

Speaker 3 (01:33):
Not, especially if it's a tech story.

Speaker 2 (01:35):
This is this is going to be awesome. So AI
models are computer programs that can understand and generate language
human language. Just think of them as very advanced AU
though complete systems like the ones that could fill in
a form for you or you know, password, remembering little
widgets in your browser, anything that presumes that what you're

(01:56):
about to do or want it can kind of fill
in those gaps for you.

Speaker 3 (02:00):
That's actually a really good way to think of it.

Speaker 2 (02:02):
See we're off to a flyer. You type in a
question or a quest, those responses are generated. The most
famous ones you might have heard of include chat GBT
from open Ai, You've Got Clawed from Anthropic, and Gemini
from Google.

Speaker 1 (02:15):
Okay, now, it does seem like every AI company out
there claims that its model is the smartest or the
most capable or better than the best one that we've
ever seen. And one of the biggest players in this space,
open ai has just released GPT five this week. What
are their claims about this new model.

Speaker 2 (02:37):
So they're making some big statements here. They're saying that
GBT five scored ninety four point six percent on a
test that measures its ability to solve advanced maths problems,
seventy four point nine percent on real world coding tasks,
and produces forty five percent fewer factual errors than their
previous models. To the CEO of the company, Sam Oltman,

(02:59):
he called it the model in the world, which kind
of sounds like those places you were saying before, and
said it represents a significant step towards what's called artificial
general intelligence AGI, which is basically the idea that AI
can actually perform an intellectual task better than humans can.

Speaker 1 (03:16):
Okay, so that's when we start to imagine like the
I robot future.

Speaker 2 (03:20):
Yeah, and it's when we get into those examples of
things like AI blackmailing you if you decide to stop
using it and kind of taking on a life of
its own.

Speaker 1 (03:28):
So those numbers from Open AI about this new model
sound pretty impressive, like ninety five percent on advanced maths.
Particularly interesting this kind of idea of producing fewer factual errors,
because that's always kind of in the spotlight around the
skepticism towards AI, But I'm interested in how these companies
are actually measuring the intelligence of these products. You mentioned

(03:51):
in the intro SAM that this is becoming a bit
of an issue. Yeah, So what exactly is the concern?

Speaker 2 (03:57):
Well, ultimately it's the idea that AI come companies are
all using different tests to prove that their model is
the best. It's like if all car companies all claims
to make the fastest car ever or the safest car ever,
but one tested on a highway, the other tested on
a racetrack, and the other one went downhill on a
windy day. A major study published earlier this year into

(04:21):
AI models actually compared the situation to Volkswagen, who were
found guilty of lying about the emissions or the lack
of emissions that their cars were producing when it basically
cheated on pollution tests. The researchers noted that when companies
manipulated car testing, people were going to jail, but similar

(04:41):
manipulation in AI isn't really coming into our attention.

Speaker 3 (04:46):
Wow, it's fascinating.

Speaker 1 (04:48):
I remember that Volkswagen emission scandal, So a good comparison,
and how the tick for SAM? So, how can these
AI models then be tested in a fair way.

Speaker 3 (04:58):
What does testing.

Speaker 1 (04:59):
Out official intelligence kind of transparently and consistently look like.

Speaker 2 (05:03):
Well, naturally, the first thing to do would be the
standardize the same test across every model, and that would
be described as a benchmark, and you global benchmark for
how these models are performing. And that could be to
measure a specific ability, say in maths, you could give
all of them the same advanced maths problem and then
measure not only the output, but how long it takes

(05:23):
for them to get there, what processes it undertook to
reach that final destination of the answer. You could give
that a score and then actually compare like for like
these models.

Speaker 3 (05:33):
It kind of sounds pretty straightforward.

Speaker 1 (05:35):
That to me seems like the obvious path towards getting
consistent testing. So where does the manipulation come from?

Speaker 2 (05:44):
Well, I think the first thing to acknowledge is that
there is no centralized global body that has the respect
or the ability to actually execute that sort of standardized testing.
There is no say, TGA for drugs, there's no government
sponsored hub that can execute that kind of stuff. So
reason A, there's nobody to do it. But reason B

(06:05):
would be that these models are still in this accelerating
period of marketing where they're cherry picking tests that would
favor their models' strengths while hiding poor performance in other areas.
And one other problem that has come up is that
if AI knows the problem is coming, because it's AI
and it knows how tests are done, then it can

(06:27):
actually almost train itself for the test, and so there's
a bit of a data contamination problem. You'd have to
keep these tests almost offline entirely for the models to
see them for the first time. One study found, for example,
that GPT four, which is the one older model from
open AI, it could solve coding problems from before twenty
twenty one that were published online, but it couldn't solve

(06:50):
new problems. And so then you get a sense of
kind of in the great big world of its brain,
which is the Internet, if those answers are somewhere out there,
it could just regurgitate them.

Speaker 1 (07:00):
So it's like if you've got an advanced copy of
an exam or a test at unior in school, you
can train for the test. That doesn't necessarily mean that
you have the comprehension levels to speak to a certain
topic or question. In the same subject outside of the
confines of that context.

Speaker 2 (07:19):
And if we think about what all of this is for,
it's about trying to work out if these models are
going to be good in practice for us to spend
twenty bucks a month on them. I mean, let's get
back to the real core problem here. We're trying to
work out if it's worth our money. And there was
a great quote from the British Prime Minister, former British
Prime Minister Richie Sunak. He said AI models shouldn't be

(07:39):
trusted to mark their own homework. And I think that
we can all relate to that. Yeah, and it kind
of encapsulates what's the problem with this independent benchmarking framework.

Speaker 1 (07:48):
You also mentioned that companies are testing multiple versions, or
that they're cherry picking their data and choosing the kind
of findings that favor their models the most.

Speaker 3 (08:00):
What's happening there.

Speaker 2 (08:00):
Tell us a bit more well. Some research found that
major companies were talking mesha, Open Ai and Google have
been privately testing dozens of different model versions on popular tests.
They're only revealing the scores from their best performing versions.
So and it's like, you know, you're on a night out,
you take twenty selfies, you put up the best one. Yeah,
of course, and I think at some stage you have

(08:22):
to admit that all businesses would do that. Yeah, you know, TDA,
if we had to report results to the stock market,
you know, we would probably highlight more the pieces that
did really, really well. Not that there's ever any pieces
that don't, but you know.

Speaker 3 (08:36):
A flawless company that never makes mistakes.

Speaker 2 (08:38):
Obviously, but we have to. I think it's good to
acknowledge this bit of kind of business reality there. But
I do think that in this case it's different because
there's no transparency at all in terms of the testing process.
It's to continue with our university kind of example. It's
like a student taking the same exam twenty seven times

(08:59):
and then only reporting the best score. Yep.

Speaker 1 (09:01):
So without that transparency, there's that issue around trust, and
I think we see that really playing out in real
time right now, that there is a lack of trust
in the broader community about AI models because we don't
know how they come to these answers. What are some
of the other consequences of this manipulation. How does this

(09:22):
play out in the real world every day?

Speaker 2 (09:24):
Well, there's definitely that marketing angle of misleading consumers and
you and I signing up to an AI platform because
we think it's ninety six percent going to be great,
and in fact it might be eighty one percent great,
which is still an incredible feat of technology. But then
from a government perspective, governments are looking at these benchmarks
for the way that they're thinking about regulation or policy decisions.

(09:48):
So the European Union's AI Act it uses benchmarks to
determine whether new AI models pose systemic risk. Can they
be used by extremists? Can they be used to spread
race online? Can they be used to mislead and deliberately
spread misinformation? And if companies are manipulating those scores, it

(10:08):
could affect how these powerful technologies are indeed regulated.

Speaker 1 (10:12):
Okay, because if these scores say that eighty percent of
the content is factual, or that there are these really
great systems in place to catch miss and disinformation or
hate speech, then that might not concern leaders to the
point where they think there needs to be certain levels
of regulation.

Speaker 2 (10:28):
One hundred percent.

Speaker 1 (10:30):
You mentioned this idea of artificial general intelligence earlier. We
used the I robot example. One of Will Smith's best
open AI is claiming that GPT five is a step
forward in AGI, But what does that actually mean in
a not Hollywood kind of fantasy world.

Speaker 2 (10:48):
Well, I gave the example before of outperforming humans. That's
a very broad definition, and the problem is that I
can't really give you a more specific definition because even
open ai can't really do that right. One open Ai
statement said, AGI is still a weekly defined term and
means different things to different people. We don't really know
what we don't know.

Speaker 3 (11:08):
So how can GPT five verse step forward?

Speaker 2 (11:11):
Then?

Speaker 3 (11:11):
If the company itself isn't sure?

Speaker 2 (11:14):
Interesting question very much raises some questions about how do
we know when we got there? Even? Yeah, I mean
this is the exciting and terrifying part of living through
rapidly emerging technology is that we're learning as we go
as a society, and that is not always pretty.

Speaker 1 (11:33):
So for people listening who might be using AI kind
of casually or infrequently in their maybe work or UNI life,
maybe they're building up their understanding of the different platforms
out there. What should we make of all these competing claims?
You know, how do we make better decisions about which
AI model is actually the good one, or the right one,

(11:55):
or or the best one for us?

Speaker 2 (11:56):
I'm constantly asked as somebody who is known now in
my friend group and in the workplace as somebody who's
really interested in AI. I'm constantly asked which one should
I use, what's the best one, And the answer is,
it's about what you're trying to do, essentially, So one
model might be better for creative writing, but another might
excel more a data analysis and crunching some numbers. Studies

(12:16):
are showing though, that AI models often fail when you
move from those controlled test conditions or those use cases
or features that are rolled out by these platforms as
part of marketing campaigns to the messy real world use
that humans actually use these tools for.

Speaker 1 (12:32):
It actually reminds me of and I'm not even sure
if this is the same thing, but when Siri was
first rolled out and Apple kind of in their big
announcements it's like, you can ask her this, or you
can ask her that, or if you want to know what.

Speaker 3 (12:46):
The weather's like, should you take an umbrella?

Speaker 1 (12:48):
And I found when I first started using Siri, like, yeah,
you could definitely answer those sorts of questions, but not
a whole lot else outside of the almost like a
prescribed text from Apple about how to use Siri.

Speaker 2 (13:00):
When you get into the world of trying to engage
with the user no matter what they're about to say.
It can take a little bit of time for the
technology to be refined and to keep learning from what
users actually want.

Speaker 3 (13:12):
So, Sam, what is the way forward in all of this?

Speaker 1 (13:16):
Is there a conversation happening at a more global scale
about this regulation?

Speaker 2 (13:22):
Definitely, and there's no clear leader here. I mentioned the
work being done by the European Union before. There's a
coalition of countries including Australia that signed on to kind
of key principles of how to keep AI safe. That
was in mid twenty twenty three, so there's a bit
of a global movement there. From a government perspective, there's
some really interesting work being done out of universities, particularly

(13:44):
Stanford University. They developed an AI Index report which does
try to compare the models like for like. But I
think we first need to determine who the authority is
going to be in this space before we can kind
of put the burden on them to roll out this
standardized testing. And I do think in a few decades
it will take a while. I do think we'll get there.

(14:05):
I mean, we have the TGA to regulate medicine. We
have a central aviation authority to regulate what a plane
that's airworthy looks like. Yep. I do think that we're
going to see a central AI authority in Australia and
maybe around the world someday. But we are very early
in this story. We're like one percent through in the
AI story if that, and that's really exciting. But it's

(14:27):
also really important to continuously discuss the potential flaws and
the gaps that exist in this big, new scary Well.

Speaker 1 (14:35):
Yeah, I think that healthy dose of skepticism is what
we will be carrying forward. But I look forward to
many more conversations like this with you, Sam.

Speaker 2 (14:43):
Well, we don't have a choice.

Speaker 3 (14:44):
Help me understand it all.

Speaker 1 (14:46):
Thank you so much for breaking that down for us, Sam,
and thank you for listening to today's deep Dive. We'll
be back a little later on with your news headlines,
but until then, have a great day.

Speaker 2 (15:00):
My name is Lily Maddon and I'm a proud Arunda
Bunjelung Calkatin woman from Gadighl Country. The Daily oz acknowledges
that this podcast is recorded on the lands of the
Gadighl people and pays respect to all Aboriginal and torrest
rate island and nations. We pay our respects to the
first peoples of these countries, both past and present.
Advertise With Us

Popular Podcasts

New Heights with Jason & Travis Kelce

New Heights with Jason & Travis Kelce

Football’s funniest family duo — Jason Kelce of the Philadelphia Eagles and Travis Kelce of the Kansas City Chiefs — team up to provide next-level access to life in the league as it unfolds. The two brothers and Super Bowl champions drop weekly insights about the weekly slate of games and share their INSIDE perspectives on trending NFL news and sports headlines. They also endlessly rag on each other as brothers do, chat the latest in pop culture and welcome some very popular and well-known friends to chat with them. Check out new episodes every Wednesday. Follow New Heights on the Wondery App, YouTube or wherever you get your podcasts. You can listen to new episodes early and ad-free, and get exclusive content on Wondery+. Join Wondery+ in the Wondery App, Apple Podcasts or Spotify. And join our new membership for a unique fan experience by going to the New Heights YouTube channel now!

Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

On Purpose with Jay Shetty

On Purpose with Jay Shetty

I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.