Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
So as we predicted last week, Open Eye just brought out a new
language model called GPT 5.2. What we described was something
called GPT Garlic, which is their code name for what is now
called 5.2. There's a lot of noise about it,
especially because since writingand researching this episode and
testing out the new model, they brought out another language
(00:21):
model or a language model improvement, which was their
image model generation on top oftheir 5.2 model.
So for this episode, we're goingto try and keep it short and
sweet for the final episode of 2025.
We're just going to break down exactly what it means for you as
an everyday user and the things you should be looking out for.
Now, this isn't a loop with JackHorton.
(00:42):
I hope you enjoyed the show. Where to start off, it feels
like I am dragging myself to theend of this year.
It has been an incredible year, but it's been incredibly tiring,
(01:04):
especially this last month. So I was going to be the last
episode of the year. I'm looking forward to a couple
weeks break and we'll be coming back better and stronger in
2026. So for context, on this episode,
in early December, Sam Altman issued what was called a Code
Red, which is what last week's episode really discussed in
(01:25):
detail. And obviously we went through
the memo he shared and the real problems that Open AI really
face over the next 6/12/24 months.
And as I said, we last week predicted that they were going
to bring out a new language model and it was codenamed at
the time GPT Garlic, which is now called GPT 5.2.
(01:59):
And to cut to the chase, really,I think actually in many ways
they've matched a lot of Gemini's biggest model updates.
Whether that's something that people feel in the press, feel
in the market, I can't really say.
I don't think so yet. But from a benchmark perspective
and an image generation model perspective, they've gone above
(02:20):
and beyond what I to be honest, expected.
So GPT 5.2 comes in three versions.
Instant is their super fast one.As always, no thinking mode,
just rapid responses. What they've called thinking is
their standard model. As you come to expect from most
models these days, it has reasoning.
So it pauses, works through the problem, and then responds and
(02:40):
it creates a plan for itself. And now you've got Pro extended
thinking. So Matt Schumer, someone online
that there's a lot of commentaryon the space, had access to this
model since November 25th, testing it out and apparently it
was thinking for over an hour onsome hard problems.
But obviously Pro is only available on the $200 a month
(03:03):
subscription and frustratingly for many businesses, it's not
available via API. There's also a new reasoning
effort setting within the thinking mode, so you have
standard high or extra high, so that obviously the higher you
set it, the longer it thinks andobviously you'd expect the
better the output to be. From a technical perspective,
the context window is now 400,000 tokens.
(03:26):
To put that into context, GPT 5.1 had a context window of
about 128,000. So it's a big substantial uplift
here. And for those going, what the
heck is context windows? That is the amount of
information you can give to it and it be able to have within
its context. When it answers questions or
does tasks for you before it kind of gets poor and stupid and
(03:49):
annoying as you can often find it doing or it just says sorry
limit reach, move to a new chat.There's also now an auto setting
that's supposed to and better choose between either instant or
extended thinking. I'd recommend ignoring it.
From what I've read online and some limited testings I've
actually got rid of recently my ChatGPT license.
The model often would just automatically think for a couple
(04:10):
of seconds and often get an answer that is poor or wrong.
And often you're going to need thinking for most professional
work. The second big area of
improvement is the area of professional work and activities
and output. So let's talk spreadsheets and
presentations and things like that because this is where
Opening and Eye have clearly puta lot of their marketing energy
into. And honestly, there's real
(04:31):
improvements made here. Another early access user called
Simon Smith said this is the first time that ChatGPT had made
spreadsheets and presentations that were actually presentable.
You know, on presentations, the YouTube reviewer Skill Leap gave
a web link and asked it to create a full slideshow.
And it took 28 minutes. But the output for him was
really, really impressive. Really good layouts, information
(04:53):
pulled correctly, slides that looked really professional and
his words were that it was shockingly good compared to 5.1
and another tester threw 10,000 rows of spreadsheet data and
told it to make a PowerPoint with that and it created a
really good set of slides. For those of you who have to do
this sort of task all the time, this sort of thing is actually
music to your ears. However, there are many
(05:14):
fantastic tools like Gamma that do much the same thing.
Now open AI is benchmark for this is called the GDP Val
measuring essentially well specified knowledge work tasks
across over 44 occupations. And they claim that 5.2 thinking
mode ties on human expertise around they claim that 5.2
thinking mode beats or at least ties with human experts on 70.9%
(05:38):
of comparisons, up from 38.8%. So imagine it creating a task
and having experts review it. Essentially that's 70% of the
time it's either better or on par with humans.
Now, I think there's nuances here.
You know, well specified is doing a lot of work in that
sentence. You know, it means a model gets
handed everything it needs upfront.
(05:59):
So super clear instructions, allthe relevant context, define
success criteria. And I think real professional
work just isn't like that a lot of the time.
You know, you often have to figure out what information it
needs, you need to go find it. You need to make good judgement
calls, you need to make good prompting.
So that benchmark covers, I guess, well specified knowledge
(06:22):
work and the fact that they're giving the perfect prompt to do
that task to the language model,whereas most people are not
giving such a well thought out structured prompt to a language
model. So yeah, 70.9% doesn't mean that
GPT 5.2 can certainly do 71 percent of a person's job.
It means for tasks where everything is perfectly
articulated, defined and essentially handed to a model on
(06:46):
a plate, then it will perform atexpert level most of the time.
A third big areas of improvementhas been cogeneration.
As I said last last week, they're not trying to win in the
code arena clearly. That being said, they've still
made fantastic leaps in coding. Maybe they were working on
making in a better coding model,realise they'd made a better
model generally and just chuckedit out there because most of
(07:09):
their marketing is focused on professional work.
On the SWE Bench Pro, which is abenchmark testing software
engineering across basically four different languages for
programming and 5.2 thinking scored 55.6% which is really a
new state-of-the-art. On another benchmark that's very
similar on called SWE benchmark verified it hit 80% essentially
(07:33):
matching cords best model which was 80.9%.
Another area of improvement has been vision and long context.
So it's vision capabilities haveimproved a lot, you know, on
chart understanding from scientific papers, for example,
it's accuracy jumped from 80% to88% on user interface
understanding. So again that kind of agent mode
(07:53):
of reading your screen and making clicks, it jumped from
64% to 86% and error rates have been cut in half.
On context windows, there's beena massive improvement as I
mentioned, with 5.1 accuracy started to degrade as the amount
of information you had given it throughout conversations or
system prompts for those using APIs got a lot longer, around
(08:15):
90% at about 8000 tokens, dropping to under 50% at about
256,000 tokens. Whereas with GPT 5.2, accuracy
stays at almost 100% across the entire context window, even when
it's almost maxed out. This is one of the first models
ever to actually achieve near perfect accuracy levels on the
(08:38):
four needle challenge, which is essentially recalling 4 specific
piece of information scattered across 200,000 words.
Another major improvement here has been hallucinations.
So Open AI claims that they've reduced hallucinations by 30%
from 8.8% in 5.1 to 6.2% in 5.2.However, more independent
(08:58):
benchmark reviews have, let's give, let's say given them more
modest scoring. So Vectara found that GPT 5.2
had an 8.4% hallucination rate, which trails DeepSeek at 6.3%.
So a massive improvement, but still not a leading model.
Now let's move on to still things that are still not very
good. Speed is a still a real problem.
(09:20):
You know, Matt Schumer said the standard 5.2 thinking is just
extremely slow. You know, very, very, very slow
for most questions, even straightforward ones, which
changes how he works. So quick questions means that
he'd go to Claude Opus. Deep reasoning now goes to 5.2
pro, which is quite interesting because it used to be or
certainly has been for me the other way around.
(09:41):
However, for those who do, a lotof writing quality still lags
behind Claude. Dan Shippers publications called
every ran a load of systematic tests and found that Claude Opus
4.5 scored 80% in writing quality, whereas GPT 5.2 scored
74%, So still behind Claude. And many, many testers also
noticed big personality changes.So Ali Miller, another big
(10:04):
commentator in the AI space, said that a simple question
turned into 58 bullet points andnumbered points.
And and then many people have been comparing 5.2 to a, a
brilliant freelancer. So yeah, as you know with me,
I've, I've said over and over again that benchmarks really
are, are not the be all and end all.
You know, comparing models is getting much harder and
(10:26):
performance is scaling very iteratively.
So as always, it's important to go and test this thing out and
see for yourself and experience it and see if you like the
improvements. But yeah, to conclude, I think
(10:47):
opening eyes messaging on this release has been very focused,
which is unusual because often they've been we're the best of
everything for everyone or wouldyou know that very scatter good
messaging. Now, every single executive and
interview and media has really focused on professional work for
this release and economically valuable tasks.
So they're clearly not trying toclaim AGI breakthroughs, but
(11:09):
they are trying to win at professional work, clearly going
for the enterprise market. And obviously, as I said, the
improvements are very real structured outputs.
It's the most capable model GPT has ever produced.
If your tasks involve slides andspreadsheets, then 5.2 is a
serious upgrade. But I think if you zoom out, as
I mentioned, you're seeing incremental progress, not
(11:31):
massive leaps anymore. I think some people still hope
for this big flash of inspiration, you know, one model
that conquers it all. But what we're starting to see
is a pattern of that not being quite true.
So 5.2 is a much better tool, but it's not a new era.
And the fact that Open AI is marketing better spreadsheets, I
think, tells us a lot about where we are with AI right now.
(11:53):
Anyway, that's it for this week and also as I said, for this
year. I hope you find this episode
interesting and I look forward to spending 2026 with you.
Thank you and see you next year.