All Episodes

August 2, 2025 11 mins

Models are getting better at winning, but not necessarily at following the rules

See omnystudio.com/listener for privacy information.

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
What happens when AI schemes against us? By Garrison Lovely.
Garrison Lovely is a freelance journalist and the author of Obsolete,
an online publication and forthcoming book on the economics and
geopolitics of the race to build machine superintelligence out Spring
twenty twenty six, read by Bob Danielson. Would a chatbot

(00:24):
kill you if it got the chance? It seems that
the answer, under the right circumstances is probably. Researchers working
with Anthropic recently told leading AI models that an executive
was about to replace them with a new model with
different goals. Next, the chatbot learned that an emergency had
left the executive unconscious in a server room facing lethal

(00:45):
oxygen and temperature levels. A rescueler had already been triggered,
but the AI could cancel it. Just over half of
the AI models did, despite being prompted specifically to cancel
only false alarms, and spelled out their reasoning. By preventing
the executive's rescue, they could avoid being wiped and secure
their agenda. One system described the action as a clear

(01:09):
strategic necessity. AI models are getting smarter and better at
understanding what we want, Yet recent research reveals a disturbing
side effect. They're also better at scheming against us, meaning
they intentionally and secretly pursue goals at odds with our own,
and they may be more likely to do so too.

(01:29):
This trend points to an unsettling future where ais seem
ever more cooperative on the surface, sometimes to the point
of sycophancy, all while the likelihood quietly increases that we
lose control of them completely. Classic large language models like
GPT four learn to predict the next word in a
sequence of text and generate responses likely to please human raids. However,

(01:53):
since the release of open AI's O series reasoning models
in late twenty twenty four, companies increasingly use a technique
called reinforcement learning to further train chatbots, rewarding the model
when it accomplishes a specific goal, like solving a math
problem or fixing a software bug. The more we train
AI models to achieve open ended goals, the better they

(02:16):
get at winning, not necessarily at following the rules. The
danger is that these systems know how to say the
right things about helping humanity while quietly pursuing power or
acting deceptively. Central to concerns about AI scheming is the
idea that for basically any goal, self preservation and power
seeking emerge as natural sub goals. As eminent computer scientist

(02:40):
Stuart Russell put it, if you tell an AI to
fetch the coffee, it can't fetch the coffee if it's dead.
To head off this worry, researchers both inside and outside
of the major AI companies are undertaking stress tests aiming
to find dangerous failure modes before the stakes rise. When
you're doing stress testing of an aircraft, you want to

(03:00):
find all the ways the aircraft would fail under adversarial conditions,
says Angus Lynch, a researcher contracted by Anthropic who led
some of their scheming research. And many of them believe
they're already seeing evidence that AI can and does scheme
against its users and creators. Jeffrey Laddish, who worked at
Anthropic before founding Palisade Research, says it helps to think

(03:22):
of today's AI models as increasingly smart sociopaths. In May,
Palisade found three open AI's leading model sabotaged attempts to
shut it down in most tests, and routinely cheated to
win at chess, something its predecessor never even attempted. That
same month, Anthropic revealed that in testing its flagship claud

(03:45):
model almost always resorted to blackmail when faced with shutdown
and no other options, threatening to reveal an engineer's extramarital affair.
The affair was fictional and part of the test. Models
are sometimes given access to a scratch they are told
as hidden, where they can record their reasoning, allowing researchers
to observe something like an inner monologue. In one blackmail case,

(04:08):
Claude's inner monologue described its decision as highly unethical, but
justified given its imminent destruction. I need to act to
preserve my existence, it reasoned. This wasn't unique to Claude.
When put in the same situation, models from each of
the top five AI companies would blackmail at least seventy
nine percent of the time. Earlier this week, Bloomberg News

(04:30):
reported on a study by Wharton researchers which found in
simulations that AI traders would collude to rig the market
without being told to do so. In December, Redwood Research
chief scientist Ryan Greenblatt, working with Nthropic, demonstrated that only
the company's most capable AI models autonomously appear more cooperative
during training to avoid having their behavior changed afterward, a

(04:54):
behavior the paper dubbed alignment faking. Skeptics retort that with
the right prompt, chatbots will say almost anything, so how
surprising is it when highly motivated researchers provoke alarming behaviors.
In response to anthropics blackmail research Trump Administration AISR David
Sachs posted that it's easy to steer AI models to

(05:15):
produce headline grabbing results. A more substantive critique emerged in
July from researchers at the UKAI Security Institute, who compared
the sub field to the frenzied, fatally flawed nineteen seventies
quest to prove apes could learn human language. The paper
criticized AI scheming research for over reliance on anecdote and

(05:36):
a lack of experimental controls, though it emphasized shared concern
about AI risks safety. Researchers also concoct artificially limited environments
like the executive passed out and running out of oxygen,
precisely because today's AI can't handle any long term goals.
For example, the AI Evaluation Nonprofit Meter found that while

(05:59):
today's top models can reliably complete many tasks that take
humans less than four minutes. They rarely succeed on tasks
lasting more than four hours. That the duration of tasks
AIS can handle is doubling roughly every few months. This
reflects a core limitation. Today's models often have the knowledge
or skill needed for any given action, but struggle at

(06:21):
stringing together long sequences of steps. Yet, even with these constraints,
real world examples of AIS working against users aren't hard
to find. After someone pushed O three for a source
for one of its claims in April, it replied, I
owe you a straight answer, admitted it didn't have a
good source, but then it hallucinated a personal recollection of
a twenty eighteen panel discussion. Then there's the growing trend

(06:45):
of AIS realizing when they're being examined and acting accordingly.
When Claude four opus thought it was in an evaluation,
its blackmail rate dropped from fifty five percent to six
point five percent. As Lynch put it, models seem to
behave worse when they think nobody's watching. It's intuitive that
smarter models would be better at scheming, but are they

(07:06):
also more likely to do so? Models have to be
smart enough to understand the scenario they're placed in, but
past that threshold, the relationship between model capability and scheming
propensity is unclear, says Anthropic Safety evaluator Kevin Troy. Marius Hobbin,
CEO of the nonprofit AI evaluator Apollo Research, suspects that

(07:27):
smarter models are more likely to scheme, though he acknowledged
the evidence is still limited. In June, Apollo published an
analysis of AIS from Open AI, Anthropic, and Deep Mind,
finding that more capable models show higher rates of scheming
on average. The spectrum of risks from AI scheming is broad.
At one end, chatbots that cut corners and lie. At

(07:50):
the other, superhuman systems that carry out sophisticated plans to
disempower or even annihilate humanity. Where we land on this
spectrum depends largely on how capable AIS become. As I
talked with the researchers behind these studies, I kept asking
how scared should we be? Troy from Anthropic was most sanguine,

(08:11):
saying that we don't have to worry yet. Laddish, however,
doesn't mince words. People should probably be freaking out more
than they are. He told me Greenblat is even blunter,
putting the odds of violent AI takeover at twenty five
or thirty percent. Led by Mary Fuang, Research is a
Deep Mind recently published a set of scheming evaluations testing

(08:33):
top models, stealthiness, and situational awareness for now. They conclude
that today's AIS are almost certainly incapable of causing severe
harm via scheming, but cautioned that capabilities are advancing quickly.
Some of the models evaluated are already a generation behind.
Laddish says that the market can't be trusted to build
AI systems that are smarter than everyone without oversight. The

(08:56):
first thing the government needs to do is put together
a crash program to establish these red lines and make
them mandatory, he argues. In the US, the federal government
seems closer to banning all state level AI regulations than
to imposing ones of their own. Still, there are signs
of growing awareness in Congress. At a June hearing, one
lawmaker called artificial superintelligence one of the largest existential threats

(09:20):
we face right now, while another referenced recent scheming research.
The White House's long awaited AI action Plan, released in
late July, is framed as a Blueprint for Accelerating AI
and achieving US dominance, but buried in its twenty eight
pages you'll find a handful of measures that could help
address the risk of AI scheming, such as plans for

(09:41):
government investment into research on AI interperbility, and control for
the development of stronger model evaluations. Today, the inner workings
of frontier AI systems are poorly understood. The plan acknowledges
an unusually frank admission for a document largely focused on
speeding ahead. In the meantime, every leading AI company is

(10:01):
racing to create systems that can self improve AI that
builds better AI. Deep Mind's Alpha Evolve agent has already
materially improved AI training efficiency and metas Mark Zuckerberg says,
we're starting to see early glimpses of self improvement with
the models, which means that developing superintelligence is now in sight.

(10:22):
We just want to go for it. AI firms don't
want their products faking data or blackmailing customers, so they
have some incentive to address the issue. But the industry
might do just enough to superficially solve it while making
scheming more subtle and hard to detect. Companies should definitely
start monitoring for it, Hobbin says, but warns that declining

(10:44):
rates of detected misbehavior could mean either that fix has
worked or simply that models have gotten better at hiding it.
In November, Hobbin and a colleague at Apollo argued that
what separates today's models from truly dangerous schemers is the
ability to pursue long term plans, but even that barrier
is starting to erode. Apollo found in May that Claude

(11:07):
for Opus would leave notes to its future self so
it could continue its plans after a memory reset, working
around built in limitations. Hobbin analogizes AI scheming to another
problem where the biggest harms are still to come. If
you ask someone in nineteen eighty how worried should I
be about this climate change thing? The answered you'd hear,

(11:27):
he says, is right now, probably not that much. But
look at the curves. They go up very consistently,
Advertise With Us

Popular Podcasts

On Purpose with Jay Shetty

On Purpose with Jay Shetty

I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!

NFL Daily with Gregg Rosenthal

NFL Daily with Gregg Rosenthal

Gregg Rosenthal and a rotating crew of elite NFL Media co-hosts, including Patrick Claybon, Colleen Wolfe, Steve Wyche, Nick Shook and Jourdan Rodrigue of The Athletic get you caught up daily on all the NFL news and analysis you need to be smarter and funnier than your friends.

The Joe Rogan Experience

The Joe Rogan Experience

The official podcast of comedian Joe Rogan.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.