All Episodes

June 24, 2025 29 mins
Are Apple’s AI claims about large language models a bombshell—or just old news?

In this episode, we dive deep into the viral Apple paper that claims LLMs (large language models) can’t really reason—they just remix memorized patterns. But is this a shocking revelation, or are we missing the bigger picture? Join us as we break down the science behind neural networks, debunk the myths, and reveal why serious AI researchers aren’t surprised by these findings.

We’ll expose the real story: LLMs aren’t just standalone chatbots—they’re powerful when paired with external tools, and that’s where the true AI innovation happens. Discover how tool integration supercharges LLM accuracy, why token output limits matter, and how the media often gets it wrong about AI’s capabilities. If you’re curious about the future of artificial intelligence, machine learning, and the truth behind the headlines, this episode is your must-listen.

Don’t fall for the hype—get the facts, get inspired, and join the conversation! Hit play, share with your fellow tech enthusiasts, and subscribe for more myth-busting AI insights.


Become a supporter of this podcast: https://www.spreaker.com/podcast/tech-threads-sci-tech-future-tech-ai--5976276/support.

You May also Like:
🎁The Perfect Gift App
Find the Perfect Gift in Seconds


⭐Sky Near Me
Live map of stars & planets visible near you

✔Debt Planner App
Your Path to Debt-Free Living
Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Welcome to the deep dive, where we really try to
cut through the noise and get straight to what you
need to know. Have you ever felt like you're trying
to track, I don't know, some kind of shape shifting creature,
but the creature is actually AI.

Speaker 2 (00:11):
Oh definitely. It's constantly changed.

Speaker 1 (00:13):
Exactly one day, you know, the headlines are just screaming
about an imminent AI job apocalypse, that white collared bloodbath coming.
Kind of talk that genuinely makes you wonder about your career.

Speaker 2 (00:25):
It does get you thinking.

Speaker 1 (00:26):
Yet, and then almost the next day you see other
headlines that completely contradict that. They dismiss large language models
lllms as like nothing more than an illusion of thinking,
maybe even fake intelligence. It's like, well, it's enough to
give anyone whiplash. My own feed looks like that all
the time, and I bet yours.

Speaker 2 (00:44):
Does too, mind too. It's a constant back and forth.

Speaker 1 (00:46):
So today we're drilling down into one specific claim that
really caught fire. It hasn't just gone viral, It's apparently
been seen by tens of millions, quoted everywhere, even in
big places like the guard Guardian. They called the sorts
a devastating paper right.

Speaker 2 (01:03):
I saw that description.

Speaker 1 (01:05):
Quite dramatic, it really was, and the core idea that
AI models don't actually reason it all, they just memorize patterns. Now,
let's be real. You listening right now, you probably have
a mountain of your own stuff to read, right Yeah,
for work, for study, or just you know, staying curious.

Speaker 2 (01:22):
Absolutely, who has time for thirty page papers?

Speaker 1 (01:24):
Exactly? You likely don't have hours to dissect a dense
research paper, let alone chase down all the commentary, encounter
arguments swirling around it. And that's precisely why we're doing this.
This deep dive is basically designed to be your shortcut.
We want to give you a clear, balanced, and hopefully
pretty engaging understanding what's really going on behind those splashy.

Speaker 2 (01:44):
Headlines, sorting the signal from the noise.

Speaker 1 (01:46):
That's the goal. Our mission today is to unpack these
claims about AI reasoning, really look hard the research that
started this whole thing, and crucially reveal what the experts
that people deep in this field actually understand about the powerful,
fast moving models. We're not selling a story here, no agenda,
just clarity. We want to help you separate the real

(02:06):
insights from the hype and give you the tools to
think critically about what AI can and maybe can't do.
And definitely stick with us because we've got some surprising
insights coming up, including a pretty nuanced recommendation on which
AI model you might actually want to use day to day.

Speaker 2 (02:24):
Oh interesting practical advice.

Speaker 1 (02:26):
Yeah, and we'll even touch on open AI's brand new
O three pro model though quick heads up that two
hundred dollars price tag. It's definitely on the premium side.
Right now, Okay, let's really get.

Speaker 2 (02:36):
Into this great because you know, to really understand why
a claim like AI can't reason hits such a nerve,
we have to look broader than just like the frantic
news alerts.

Speaker 1 (02:47):
Right, it's more context.

Speaker 2 (02:48):
There is, there's this real hunger for clarity out there,
and for good reason. Think about the background noise. You've
got these huge statements coming straight from the CEOs of
the AI labs building this stuff, like Sam Altman exactly.
M al Men recently wrote something like humanity is close
to building digital super intelligence, where past the event horizon,
the takeoff has started. Now Okay, those definitions are often

(03:10):
deliberately vague, vague event horizon, right, but you can totally
see why people are paying attention. They see LMS getting
better almost daily in their own lives at work. I mean,
who hasn't been kind of amazed and maybe a little
freaked out by what they can do now compared to
just last year.

Speaker 1 (03:27):
It's a rapid change, it is, So.

Speaker 2 (03:29):
These kinds of pronouncements from the top, they definitely fan
the flames both excitement and, let's be honest, a good
amount of anxiety too.

Speaker 1 (03:36):
Yeah, the anxiety is real soft.

Speaker 2 (03:39):
With the job stuff, and that gets amplified by the
constant drumbeat of headlines predicting huge societal changes. Remember that
New York Times piece quoting Anthropics CEO Dario m Oday
talking about an economic white collar bloodbath coming.

Speaker 1 (03:54):
Yeah, I remember that one.

Speaker 2 (03:56):
These aren't just one offs. It feels like every week
there's another stark warning. So it creates this really pertinent
mixed excitement about possibilities, but also real unease about the
future of jobs intelligence itself.

Speaker 1 (04:09):
So you've got this whiplash again. Superintelligence is coming. No wait,
it's an.

Speaker 2 (04:13):
Illusion precisely so when one side, often the tech leaders,
predicts superintelligence and this massive transformation, and another loud voice
calls it all just smoke and mirrors. It's completely natural
for people to feel confused. They're desperate for some kind
of clear story. It's just human nature, right, trying to
make sense of contradictory info, especially when it's about something
so big like what intelligence even is and how it

(04:36):
might reshape everything. May inbox like yours, I'm sure has
been full of people just try and figure out what's real.

Speaker 1 (04:41):
That makes total sense. Yeah, it really does create the
perfect conditions, a kind of perfect storm for a paper
like that Apple one to just explode online. Now, let's
park any cynical thoughts for a second, like whether Apple
maybe wanted to debunk AI more than improve AI, as some.

Speaker 2 (04:57):
Critics hinted, there was some speculation there, right, Let's.

Speaker 1 (05:00):
Focus on what the paper actually said. Its main point
was pretty bold, maybe even provocative. Large language models lms
don't fundamentally operate using explicit algorithms, and critically, the paper argued,
these models really struggle almost inherently with puzzles that have,
as they put it, sufficient degrees of complexity.

Speaker 2 (05:20):
Okay, so what kind of puzzles are we talking about?

Speaker 1 (05:23):
Well, they use some real classics puzzles. You'd recognize one
big one was the Tower of Hanoi. You probably know it.
Moving different sized discs between three pegs.

Speaker 2 (05:32):
Oh yeah, the one where you can't put a big
disc on a.

Speaker 1 (05:34):
Small one exactly. That one sounds simple, but add more
discs and the number of moves just blows up. Needs
real step by step planning. They also use a special
version of checkers, not player versus player, but moving blue
tokens one way, red tokens the other, following the rules.

Speaker 2 (05:50):
Okay, a movement logic puzzle.

Speaker 1 (05:51):
Right, and then the classic river crossing puzzle. You know,
the fox and chicken one. Get the fox, chicken and
grain across the river, but don't leave the fox with
the chick or the chicken with a grain.

Speaker 2 (06:01):
Got it all requiring careful steps and thinking ahead precisely.

Speaker 1 (06:06):
They all seemed like perfect tests for reasoning, and the
logic behind using them was pretty straightforward. The paper basically said, look,
if these AI models were just pre programmed algorithms like
a calculator or regular software.

Speaker 2 (06:19):
Like deterministic code, exactly, then.

Speaker 1 (06:22):
Their performance shouldn't really change much with complexity, whether the
tower had three discs or ten, the checker's board was
small or big. If it was purely algorithmic, it should
nail it one hundred percent every time.

Speaker 2 (06:33):
Theoretically, yes, that was the premise.

Speaker 1 (06:35):
It was set up as this logical test meant to
show any deviation from perfect algorithmic steps.

Speaker 2 (06:40):
So what did they find? The big shocker?

Speaker 1 (06:43):
Well, maybe not so shocking depending on what you already
knew about llms. The results showed pretty much what many
in AI might expect. Performance dropped, sometimes quite a bit
as the tasks got harder.

Speaker 2 (06:54):
More discs, more tokens, lower accuracy.

Speaker 1 (06:57):
Yeah, the more complex the puzzle, the worse the models did.
And for the Apple authors this was proof proof that
llms are definitely not like traditional deterministic software, and that
conclusion was framed by some at least as this devastating
blow to the whole narrative of AI's advanced smarts.

Speaker 2 (07:15):
Okay, and this is where we really need to add
that crucial context because this specific finding that llms aren't
like old school software and that performance drops with complexity, well,
actually that's not new news at all.

Speaker 1 (07:27):
Oh really, it was presented as quite a bombshell, I.

Speaker 2 (07:29):
Know, but within the AI research community this has been
understood and talked about for years. It's not some secret.
It's fundamental to how they're built. They are not traditional
software where input A always gives output B, like two
plus two always equals four.

Speaker 1 (07:42):
Right, But they aren't random.

Speaker 2 (07:44):
Either, definitely not random. If they were, they wouldn't work
at all. They couldn't generate text or past tests. The
reality is more interesting. Llms are what we call probabilistic
neural networks. They live somewhere in this dynamic space between
fixed algorithms and pure or randomness.

Speaker 1 (08:01):
Okay, so probabilistic meaning they guess the next step.

Speaker 2 (08:06):
Kind of yeah, think of it less like a perfectly
engineered clock in more like a really really good improv actor.
No script, just constantly guessing the most likely next word
or action based on tons of past data. Scripts dialogues
everything they've read. They predict the next token, the next
word based on billions of patterns.

Speaker 1 (08:26):
Okay, that makes sense.

Speaker 2 (08:27):
And the clearest example, the one that really drives this
home and has been known for ages. Honestly, we could
have called this breaking news years ago. In research circles
is multiplication.

Speaker 1 (08:37):
Multiplication. How So, if you don't.

Speaker 2 (08:39):
Give these models any tools, no calculator, no code interpreter, nothing,
and you ask them to multiply big numbers. The second
the number of digits gets too large, they just fail dramatically.

Speaker 1 (08:49):
Can't do the math in their head exactly.

Speaker 2 (08:51):
They can't perform the sum accurately once it exceeds their
pattern matching capacity. It's like asking us to multiply billion
digit numbers. Mentally, it's just not what they're built for.
They operate on probabilities and patterns, not rigid calculation steps. Now, sure,
if the numbers are small enough, they can often reason
it out because they've memorized those smaller multiplication patterns.

Speaker 1 (09:12):
And they have gotten better over time.

Speaker 2 (09:14):
Oh definitely, we've seen improvements. Like if you compare an
older model like open AI'SZER one Mini to their newer
O three Mini, the newer one can handle more digits
before it starts to fail or flamax, as we sometimes say.
And this is key even the absolute latest, greatest models today,
if you take away their tools, they will eventually hit

(09:35):
a wall with large multiplication. They just can't do it perfectly.
And that will always be true because they aren't designed
to be predictable software.

Speaker 1 (09:41):
Ah okay, so what are they designed for? Then?

Speaker 2 (09:44):
They're designed to be generative They're designed fundamentally to use software,
not be software. Their main job is to create plausible outputs,
which leads us right into why they hallucinate.

Speaker 1 (09:57):
Right. The hallucination problem making stuff up confidently.

Speaker 2 (10:00):
Exactly, and that tendency to generate wrong answers or hallucinate
when they can't actually handle the question isn't necessarily a flaw,
you see, not in the context of their generative design.
It's almost a feature. They're built to produce something that
sounds right, to fill gaps with the most statistically likely text,
even if it's.

Speaker 1 (10:20):
Factually wrong, so the guess flausibly.

Speaker 2 (10:23):
Pretty much like I recently gave a really complex calculation
to Claude for Opus Anthropic's latest and also to Google's
Gemini two point five. Pro deliberately gave them no tools,
knew they couldn't get the exact answer, but instead of
saying I don't know, they just made up.

Speaker 1 (10:37):
Answers, hallucinated the numbers yep.

Speaker 2 (10:39):
And what was funny and telling was that the fake
answers looked plausible. They ended in twos started with sixty seven,
which ironically the correct answer also does. It perfectly shows
how these models are well, very convincing BS artists.

Speaker 1 (10:52):
Uh huh, okay, convincing BS artists.

Speaker 2 (10:55):
It's not like they're lying intentionally like a personal it's
just baked into their design generate fluent, probable text. If
there's no clear statistical path to a correct answer, they'll
just generate the most textually likely one, which yeah, often
ends up being confidently wrong.

Speaker 1 (11:13):
That's a really really crucial point. Designed to use software,
not b software that changes how you think about them entirely,
And that reframing, I think, really highlights the first big
miss in that Apple paper, how it basically ignored that
lllms are actually really good at using tools.

Speaker 2 (11:32):
That was a major oversetting.

Speaker 1 (11:33):
Let's go back to your multiplication example. You said, Claude
hallucinated without tools, but will you let it use a
code interpreter?

Speaker 2 (11:39):
It got the answer right perfectly.

Speaker 1 (11:41):
And what's even wilder, you said you didn't even have
to tell it to use a tool. It just figured it.

Speaker 2 (11:45):
Out, It inferred it from the complexity, knew it needed help.

Speaker 1 (11:48):
See that's the thing for me. The surprising part of
the Apple paper wasn't that llms fail at exact math,
like you said, we kind of know that's how they
work right, probabli.

Speaker 2 (11:58):
Istic, it's a known characteristic.

Speaker 1 (12:00):
The surprise was that the paper found that surprising. It's
like being shocked that a brilliant poet can't instantly solve
I don't know, complex calculus in their head without a calculator.
Different skills, different design, and crucially different reliance on tools.
It felt like they were testing a runner after tying
their shoes together and then declaring them slow. You're not
testing potential. You're testing a weirdly specific.

Speaker 2 (12:21):
Handicap exactly right, and connecting that back. The tool use
issue is just one piece. There were several other frankly
fatal flaws in the Apple paper's approach that for many
of us in the field kind of render its big
conclusions well largely moot okay, like what else. Another huge
thing that maybe didn't get much attention is token limits.

(12:42):
The paper talks dramatically about accuracy collapsing to zero beyond
a certain.

Speaker 1 (12:46):
Complexity, right, I remember that part?

Speaker 2 (12:48):
But why? Because these models have limits on how many
tokens basically chunks of text or concepts they can process
and spit out at once. They have a finite context
window like a mental scid for the Claude model they tested.
It was big one hundred and twenty eight thousand turkens.
But here's the kicker. Some of the really complex puzzles

(13:09):
they tested actually needed more than one hundred and twenty
eight thousand tokens for a full step by step solution.

Speaker 1 (13:16):
So they literally couldn't write out the full answer, even
if they knew.

Speaker 2 (13:19):
It precisely, even if we pretend for a second they
were perfect calculators, which they aren't. They simply didn't have
enough space, enough virtual paper to write down the whole
answer trace that paper demanded. Honestly, I think it's actually
to the model's credit that they seem to realize this.
So instead of just trying forever to output something too long,
they kind of intelligently gave up. They produced what the

(13:39):
paper called shorter traces. They'd say things like, Okay, here's
the algorithm you need, or here's the tool you should use.
That suggests the kind of smart awareness of their own limits,
which seems pretty reasonable to me, like us knowing when
we need pen and paper.

Speaker 1 (13:51):
That's a really interesting perspective.

Speaker 2 (13:52):
Okay, well, well, here's a quick but very revealing detail
buried deep in the paper itself. The Apple authors actually
admit they originally wanted to compare thinking versus non thinking models.
You know, ones that show their work versus ones that
just give an answer on standard math tests. Okay, but
the results didn't quite fit the story they may be expected.
What's fascinating is they found the thinking bottles did actually

(14:16):
outperform the non thinking ones with the same amount of compute.
Because that result, for whatever reason, didn't align with their
initial idea, they actually dropped the math tests and switched
to the puzzles like Tower of Hanoi.

Speaker 1 (14:28):
Whoa, So they changed the test when the first one
didn't give the result they expected.

Speaker 2 (14:32):
It certainly looks that way. I can't help but feel
they might have gone into it with a preconceived idea
about LM lements, and that maybe consciously or not, shaped
how they designed the later tests and interpreted the results.
It's a good reminder for all researchers right follow the data,
don't bend the data to fit your narrative.

Speaker 1 (14:50):
Definitely cautionary tail there.

Speaker 2 (14:52):
Wow, and another aha moment from the paper was the
author's genuine surprise that even when they gave the models
the exact algorithm for the puzzles, and the prompt the
model still messed up.

Speaker 1 (15:03):
Sometimes wait, even with the instructions, Yeah.

Speaker 2 (15:06):
They wrote something like, surely finding the solution needs more
computation than just executing a given algorithm. But by now
you the listener, probably get it. They aren't calculators. They
aren't built for perfectly executing rigid steps.

Speaker 1 (15:19):
Because they're probabilistic.

Speaker 2 (15:21):
Exactly, because they're probabilistic neural nets. Even if there's like
a ninety nine point nine percent chance they get the
next step right, when you have a puzzle needing maybe
millions of sequential steps, which these complex tasks can break
down into that tiny zero point one percent error chance
on each step starts to add up exponentially. They will
eventually make a mistake somewhere along.

Speaker 1 (15:41):
The line, like the multiplication example.

Speaker 2 (15:43):
Again precisely, of course, an LM knows the algorithm for
a multiplication step. The basic transformer architecture is literally built
on matrix multiplications. But that doesn't mean over millions of
steps they won't slip up. It's like asking a human
to do a billion calculations perfectly without a mistake. It's

(16:03):
just not realistic for a system based on probability and patterns,
not flawless logic circuit.

Speaker 1 (16:09):
Okay, this is making so much more sense now.

Speaker 2 (16:11):
So the paper's big conclusion, the headline grabber, was that
we may be encountering fundamental barriers to generalizable reasoning. But honestly,
if you look at the bigger AI research picture, this
exact limitation has been pointed out by serious experts for
a long time. I remember interviewing Professor Ralph way back
in December twenty twenty three, and he was already talking
about this exact point. It's definitely not breaking news in

(16:33):
serious AI circles.

Speaker 1 (16:34):
So the hype was way ahead of the actual novelty
way ahead.

Speaker 2 (16:37):
And if you need one more nail on the coffin
for the Apple paper's methodology, check this out. One researcher
actually used Claude for Opus as a co author on
a follow up.

Speaker 1 (16:46):
Paper, Use the AI to critique the paper about aih.

Speaker 2 (16:50):
That's brilliant it was, And that co authored paper meticulously
detailed all the flaws and the Apple Paper things even
I missed on first read. For instance, it pointed out
the absolutely stunning fact that some questions in the Apple
paper were literally impossible to answer because they were logically impossible.

Speaker 1 (17:08):
You're kidding. They tested it on unsolvable problems.

Speaker 2 (17:10):
Unbelievable, right, They were testing the models on questions that
had no correct answer to begin with. It was literally
an impossible test.

Speaker 1 (17:17):
Wow, Okay, that's pretty damning.

Speaker 2 (17:20):
So No, despite what some headlines suggested, like that Guardian
piece quoting AI critic Gary Marcus saying the tech world
was reeling because the paper showed AI power was wildly
over sould, the tech world was not reeling. I'd go
so far as to say there isn't a single serious
informed AI researcher who would have been genuinely shocked by
those results. The headlines were just way, way more sensational

(17:42):
than the actual findings were new or surprising to people
in the field. Classic case of presenting a known trait
as some shocking new flaw.

Speaker 1 (17:49):
That's a really thorough debunking of the paper's impact or
lack thereof within the expert community. But okay, it's also
important as we do this to stress that just because
that paper had flaws doesn't mean we should just flip
the switch and say llms are perfect reasoners with no limits.
That's not right.

Speaker 2 (18:08):
Either, Is it absolutely not that would be swinging the
pendulum way too far the other way.

Speaker 1 (18:13):
We definitely need to acknowledge they do make basic reasoning
mistakes sometimes in pretty simple, everyday kinds of situations.

Speaker 3 (18:20):
They are definitely not infallible correct they still have limitations.

Speaker 1 (18:24):
Like for example, there's this benchmark I put together called
simple bench It's designed specifically to test common sense, everyday reasoning,
not complex logic puzzles, and I tested the new three
pro from open Ai on it. There's one scenario where
models often don't get that if you just drop a
glove from your hand, it'll well just fall onto the road.

Speaker 2 (18:43):
Seems pretty basic physics.

Speaker 1 (18:45):
Yeah, incredibly basic. A kid would know that. Yet these
super advanced models, even after thinking for ages eighteen minutes
the three pro did on this one, can still stumble
over something so fundamental, so grounded in just how the
physical world works. These aren't the Tower of Anoi. There's simple,
real world things where models sometimes lack that basic intuition.

Speaker 2 (19:06):
So common sense can still be a challenge.

Speaker 1 (19:08):
Really can. And these quirks, these failure modes, they go
beyond just reasoning. We see hallucinations pop up all the
time in the real world, and it can be pretty jarring,
maybe even a bit unsettling, when you're expecting facts.

Speaker 2 (19:21):
Like the multiplication answers you.

Speaker 1 (19:22):
Got exactly, Or that recent example with Google Gemini's new
V three image generator, someone prompted it to output a
London scene with absolutely zero lampposts, not one, and the
picture I made was well creative but totally unrealistic. A
London street with no street lights at all.

Speaker 2 (19:39):
Uh huh, yeah, I saw that one a bit surreal.

Speaker 1 (19:42):
Now, sometimes you can actually use those hallucinations for fun,
creative stuff like imagine a weird ad for dog food
with a talking dog saying egg prices are going up
maybe twenty dollars this month. That company probably saved a
ton on actors and sets for one viral ad. Right.
The models can be in incredibly inventive BS artists.

Speaker 2 (20:01):
Sometimes they can be creatively unpredictable, for sure.

Speaker 1 (20:05):
But the real problem, the thing that genuinely shakes people's trust,
is when those hallucinations are presented as fact, or when
people think they're fact. I really hope you listening aren't
as shocked as that Sky News presenter in the UK
was recently.

Speaker 2 (20:19):
Ah the chat GPT transcript INCIDENTA Yeah, that got hundreds
of thousands of views.

Speaker 3 (20:23):
Chet GPT just confidently made up a whole interview transcript,
and it led to loads of news segments articles all
asking basically, can we actually trust chat GPT if it
just makes stuff up like this? It really highlights that
even though AI is powerful, it can also just fabricate
things seemingly without checking if they're true.

Speaker 2 (20:41):
Fluency doesn't equal accuracy. That Sky News example is perfect. Actually,
it ties right into what I call the two thoughts concept.
It's really essential if you want to understand LMS right now,
you have to hold two seemingly contradictory ideas in your
head simultaneously.

Speaker 1 (20:56):
Okay, what are the two thoughts?

Speaker 4 (20:58):
First thought, LLM are rapidly, astonishingly catching up to human
performance and almost everything text based, generating, summarizing, translating.

Speaker 2 (21:08):
The progress is genuinely breathtaking.

Speaker 1 (21:11):
Okay, that's thought one. They're getting really good at language tasks, right.

Speaker 2 (21:15):
But the second, equally important thought is they have almost
zero hesitation about generating mistruths, confidently stating things that are wrong,
much like well, let's be fair, many humans doh fair point.
So if human performance is your yardstick. Then yeah, lms
are catching up fast, and they can definitely BS with
the best of them.

Speaker 1 (21:32):
Okay, so amazing language skills, but also capable of confident
inaccuracy exactly.

Speaker 2 (21:38):
But here's what they are not. They are not supercomputers
in the way we usually think of them. They aren't
the kind of AI that can say, predict the weather
perfectly or solve brand new science problems all by themselves
without any help or data. Their core strength isn't pure
unaided calculation or flawless logic.

Speaker 1 (21:56):
So where do the breakthroughs come from.

Speaker 2 (21:58):
They're real breakthroughs, and this is where it's super interesting
and actually mirrors human invention. Happen when they use tools
strategically and operate in an environment that actively corrects their
BS for them, or at least gives them verifiable external
facts to work with.

Speaker 1 (22:14):
Ah back to the tool use, but with a feedback loop.

Speaker 2 (22:17):
Precisely, this ability to leverage tools effectively, that's what's driving
genuine scientific advance right now, not in some far off future.
If you want to see a great example, look up
the Alpha Evolve video series. It's truly mind blowing.

Speaker 1 (22:30):
Alpha Evolve. What does this show?

Speaker 2 (22:32):
It shows how AI, when it can use external tools
and get feedback from those tools, can make real scientific discoveries,
things way beyond what a human team could do alone.
In the same time, Honestly, after digging into that research,
I was really surprised to hear people like Sam Altman
saying things like it's going to be twenty twenty six
when we see systems that can figure out novel.

Speaker 1 (22:51):
Insights, you think we're already there.

Speaker 2 (22:53):
As far as I'm concerned, we have that now. Again,
it's not lllm's achieving solo superintelligence. It's LM's working powerfully
in combination with symbolic systems, using tools, using verifiable data.
That combination is incredibly powerful. It lets them go way
beyond just pattern matching to actually generate novel insights and
speed up science. It's the augmented intelligence, the collaborative intelligence,

(23:17):
not the isolated AI, that's the real game changer right now.

Speaker 1 (23:20):
That is absolutely fascinating. The real power isn't the LM alone,
but the LLM plus tools plus feedback plus maybe symbolic systems. Okay,
so with all that context, all these nuances we've gone through,
you listening might be thinking, Okay, great, theory, But which
AI should I actually use?

Speaker 2 (23:39):
It's been totally a fair question, right, especially with new
models popping up all the time.

Speaker 1 (23:42):
The practical question. Everyone wants to know what to use.

Speaker 2 (23:45):
So let me give you one big cautionary word first
about benchmarks. Just in the last couple of days, for example,
open ai finally released its three pro model. Right now,
it's at of pretty steep two hundred dollars a month.
I guess it might eventually come down in price, but
for now it's definitely high end depth's commitment, and the
benchmark scores they put out look amazing on paper right,
ninety three percent on competition math, eighty four percent on

(24:08):
really tough PhD science questions, great coding rank. These are
numbers that definitely jump off the page, makes you think, wow,
this must be the best.

Speaker 1 (24:14):
Headline numbers are always impressive. But here's the catch, the
beyond the headline trap. My cautionary note comes from looking
back at the original three model, not pro, just three
that open Ai tased back in December twenty twenty four
on their Day twelve of Christmas thing.

Speaker 2 (24:31):
I remember that tea is it looked powerful?

Speaker 1 (24:33):
It did? And what's really interesting is that this new
three pro that just came out actually mostly underperforms that
system they tease just a few months ago.

Speaker 2 (24:41):
Oh really, that is interesting. So the pro isn't necessarily
better across the board than the.

Speaker 1 (24:45):
Teased version, seems not based on the available comparisons. Yeah,
and that really highlights the main point. You often have
to look way past the big headline benchmark numbers and
really figure out how these models perform on your specific tasks.
A score on some general test doesn't always mean it's
the best for your particular workflow, your problems. It's about

(25:05):
the right fit, practical use, not just abstract scores. It's
like buying a supercar based only on its top speed
when you really just need something for driving around town.

Speaker 2 (25:14):
That's a great analogy, and building on that caution about benchmarks,
When you do look at the results companies release, you'll
often see a few common patterns, and sometimes things aren't
totally clear, Like what while companies might just not compare
themselves to competitors at all, open AI tends to do
that now. Just focusing on their own progress, or like

(25:35):
Inthropic sometimes does with their claud models, they'll show you
lots of great benchmark scores, but they might not be
super transparent about how many tries it took to get
that absolute peak score. You see the shiny number, but
not necessarily the messy process behind it.

Speaker 1 (25:50):
The multiple attempts issue, right.

Speaker 2 (25:52):
And they might also conveniently leave out details about, say,
strict usage limits on their biggest models, or kind of
gloss over the much higher price tag, like we saw
with three pro. All this lack of full transparency can
make it really tricky for you, the user, even an
informed one, to make a truly good decision about what's
actually best for you.

Speaker 1 (26:10):
Okay, so it's complicated. Given all that, what would you
recommend right now if someone just wants to try a
good model for free.

Speaker 2 (26:18):
Right for free use, maybe with some daily limits. My
current practical recommendation would probably be Google's Gemini two point
five PRO. I admit I'm a bit biased because it
did get the top score on that simple bench test
we talked about, showing good common sense reasoning.

Speaker 1 (26:33):
Okay, Gemini two point five pro any other perks. Yeah.

Speaker 2 (26:36):
A nice little bonus is you usually get a few
free tries of their V zero video generator model, which
is pretty cool to experiment with.

Speaker 1 (26:43):
Nice. Any other mentions maybe for lower cost.

Speaker 2 (26:47):
Yeah, as an honorary mention, especially if you're looking for
solid performance but maybe need cheaper API access for development
or something, I definitely point you towards deepsekr.

Speaker 1 (26:56):
One deep ckr one. Why that one.

Speaker 2 (26:58):
It's API is incredibly cheap, which is great, but also
crucially it comes with a proper technical report you can
actually read. They're transparent about how it works, its strengths,
it's weaknesses. That level of openness is really valuable and
funny anecdote, many of you might have noticed a production
quality on my deep Seek documentary jumped up quite a
bit this month. Oh yeah, well that was largely thanks

(27:19):
to integrating tools and resources smartly, including deep seek models.
It really helped polish the final output, which just goes
back to that main idea, Right, it's not just the
model alone, it's how you combine it with cools and
systems to get better results.

Speaker 1 (27:34):
That's fantastic advice, really practical recommendations there that help cut
through the noise. Okay, so we've fronted this deep dive
talking about all that confusion around AI, you know, the
job of apocalypse versus the it's all fake headlines.

Speaker 2 (27:47):
The whiplash exactly.

Speaker 1 (27:50):
Then we really took a part that so called devastating
Apple paper showing how it's big claims kind of missed
the point about known LLM trace like their ability to
use two. We saw how some of his shocking findings
were actually things experts already knew about probabilistic models.

Speaker 2 (28:06):
Old news presented as new.

Speaker 1 (28:08):
Right, and we definitely didn't ignore the real limits these
models still have. Yes, they can mess up basic reasoning.
Sometimes they can definitely hallucinate, sometimes with real consequences, like
that Sky News thing.

Speaker 2 (28:18):
You need to acknowledge both sides absolutely.

Speaker 1 (28:21):
But maybe the biggest, most useful takeaway from all this
is that AI's true power, it's real breakthroughs right now
aren't coming from some standalone super brain. It's coming from
their amazing ability to work with tools, to operate in
systems where their probabilistic guesses can be checked, refined, may
be corrected by more deterministic systems. That's where the genuine

(28:42):
scientific progress is happening now. LMS plus symbolic systems already
finding novel insights.

Speaker 2 (28:47):
The augmented intelligence path exactly.

Speaker 1 (28:50):
So as you leave this deep dive hopefully with a
clearer picture of AI's real strengths and its current limits.
Here's something to think about. If llms combined smartly with
tools and feedback are already generating novel insights and catching
up to humans in so many areas, even while you
can still yeah BS like the best of us, what.

Speaker 2 (29:06):
Does that really mean for how humans and AI will
work together?

Speaker 1 (29:10):
How might you, in your own life, your work you're learning,
start using these tool using AIS to get results you
couldn't before. And maybe the biggest question, how do we
as individuals and as a society build the systems that
reliably correct their BS so we can actually unlock their
full trustworthy potential. The conversation is only just starting to
get really, really interesting, and now hopefully you're much better

(29:33):
equipped to be a smart part of it.
Advertise With Us

Popular Podcasts

Stuff You Should Know
My Favorite Murder with Karen Kilgariff and Georgia Hardstark

My Favorite Murder with Karen Kilgariff and Georgia Hardstark

My Favorite Murder is a true crime comedy podcast hosted by Karen Kilgariff and Georgia Hardstark. Each week, Karen and Georgia share compelling true crimes and hometown stories from friends and listeners. Since MFM launched in January of 2016, Karen and Georgia have shared their lifelong interest in true crime and have covered stories of infamous serial killers like the Night Stalker, mysterious cold cases, captivating cults, incredible survivor stories and important events from history like the Tulsa race massacre of 1921. My Favorite Murder is part of the Exactly Right podcast network that provides a platform for bold, creative voices to bring to life provocative, entertaining and relatable stories for audiences everywhere. The Exactly Right roster of podcasts covers a variety of topics including historic true crime, comedic interviews and news, science, pop culture and more. Podcasts on the network include Buried Bones with Kate Winkler Dawson and Paul Holes, That's Messed Up: An SVU Podcast, This Podcast Will Kill You, Bananas and more.

The Joe Rogan Experience

The Joe Rogan Experience

The official podcast of comedian Joe Rogan.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.