All Episodes

February 27, 2025 16 mins

OpenAI Expands Research Web Browsing Agent Access to All Paying ChatGPT Users Rabbit Unveils Enhanced Android Agent with LAM Playground for Tablet Functionality Claude 3.7 Sonnet: Anthropic's Superior Coding Tool Surpasses Competitors DeepSeek to Launch R2 Model with Enhanced Coding and Multilingual Capabilities Anthropic's Claude 3.7 Sonnet Streams Pokémon Red on Twitch, Highlighting AI's Strengths and Weaknesses Elon Musk Launches Grok 3: Advanced AI Model with Free and Paid Access Options Anthropic Unveils Claude 3.7 Sonnet AI Model with Integrated Reasoning and Coding Tool Claude's extended thinking

Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
Welcome to Innovation Pulse, your quick, no-nonsense update on the latest in AI.

(00:10):
First, we will cover the latest news.
Open AI expands research tool access, rabbit advances its Android agent, Anthropix Claude
3.7 Sonnet, shines in coding and DeepSeq pushes forward with its R2 model.
After this, we'll dive deep into Anthropix Claude 3.7 Sonnet, exploring its groundbreaking

(00:32):
extended thinking mode and safety features.
Stay tuned.
Open AI is expanding access to its Deep Research web browsing agent for all paying chat GPT
users, including PLUS, TEAM, Enterprise and EDU subscribers.
These users will receive 10 Deep Research queries per month.

(00:54):
Previously, this feature was exclusive to chat GPT pro users, who now have 120 queries
per month, an increase from the initial 100.
This move is part of a broader competition among tech companies like Open AI, Google
and Perplexity.
To promote their Deep Research tools, which generate comprehensive reports, Google's

(01:16):
similar agent was recently made available to all Gemini advanced users.
These companies aim to demonstrate the value of their premium AI subscriptions through
these capabilities.
Open AI acknowledges that further testing is needed to understand the potential influence
of these agents on users.

(01:39):
Rabbit continues to develop its AI technology, unveiling a generalist Android agent in a
recent blog post and video.
This agent, showcased on a tablet, performs tasks like finding YouTube videos, locating
cocktail recipes and playing the game 2048, although it operates slowly and with occasional

(02:01):
quirks.
The demonstration did not use the Rabbit R1 device, which previously failed to deliver
on its promises.
Instead, engineers input commands on a laptop that the AI executes on an Android tablet.
While the agent can perform various tasks, it sometimes struggles with efficiency, exemplified

(02:22):
by sending a poem over WhatsApp one line at a time.
Rabbit acknowledges the AI is still evolving and plans to reveal more about an upcoming
cross-platform multi-agent system soon.
Despite initial setbacks, the company remains committed to enhancing its AI capabilities.

(02:44):
Up next, we're exploring Claude Code's innovative impact.
Anthropoc has released Claude 3.7 Sonnet, an AI model praised for its coding capabilities.
Despite skipping version 4.0, users quickly embraced its ability to build projects like
games and animations.
McKay Wrigley noted its excellence in coding, outperforming competitors in benchmarks with

(03:10):
62.3% accuracy on SWE bench.
Projects like a Connect 4 app and a futuristic spaceship sketch showcased its power.
However, it remains costly, with pricing higher than open AI's comparable models.
Criticisms include struggles with following instructions and high costs.

(03:32):
Despite these, it's integrated into platforms like Replit Agent and GitHub Co-Pilot.
DeepSeq also introduced Claude Code, an agentic coding tool that collaborates on coding tasks,
aiming to reduce development time.
Its impact on existing coding tools remains to be seen as it gains traction in the AI

(03:54):
coding landscape.
Chinese AI startup DeepSeq gained attention with its R1 reasoning model and is now hurrying
to release its new R2 model, aiming for improved coding skills and multilingual reasoning.
Initially set for an early May launch, the release date has been pushed forward, though

(04:16):
the exact timing is unknown.
DeepSeq's R1 impressed the industry with its cost-effective training, despite skepticism
from open AI and Google.
Companies like Microsoft and Amazon quickly integrated R1 into their platforms.
As anticipation builds for GPT-4.5 and GPT-5, DeepSeq's R2 could disrupt the market once

(04:41):
more if released soon.
The industry eagerly watches to see how DeepSeq's new model will compare to its predecessor
and the upcoming competition.
Anthropic recently showcased their latest AI model, Claude 3.7 Sonnet, playing Pokemon
Red on Twitch.

(05:02):
This experiment highlights AI's evolving capabilities and public reactions.
Unlike its predecessor, Claude 3.7 successfully earned three gym badges, demonstrating an ability
to think through game puzzles.
However, it sometimes struggles, like when it was initially stumped by a rock wall, but

(05:23):
eventually navigated around it.
The Twitch stream, reminiscent of the nostalgic Twitch Plays Pokemon, features Claude's thought
process alongside real-time gameplay.
Some viewers find the slow progress frustrating, while others see it as compelling.
This AI-driven re-enactment reflects a shift from collaborative online experiences to more

(05:48):
solitary ones where viewers watch AI tackle games many mastered as children.
This trend marks a change in how we engage with digital content, moving from communal
to individual experiences.
Now we're about to delve into Grok 3's capabilities.

(06:08):
XAI's Grok 3 is temporarily free for everyone, but paid users get priority and more features.
Grok 3, the latest language model from XAI, can be accessed without a subscription on
X. However, the free access is only for a limited time, as announced by Elon Musk.

(06:29):
When using Grok 3, users will notice new, think, and deep-search options exclusive to
this model.
Paid subscriptions, like X Premium Plus for $40 per month, or Super Grok for $30, offer
increased access and features like voice mode.
Grok 3 is substantially more capable than its predecessor, Grok 2, and supports complex

(06:55):
queries in math, science, and programming with human-like reasoning.
Deep-search provides advanced summaries for research.
Free users might face server limits, while paid users enjoy enhancements like big brain
mode for complex problems and unlimited image generation.

(07:17):
Anthropic is launching Claude 3.7 Sonnet, a hybrid AI model designed to provide both quick
and in-depth answers.
Users can choose whether the AI should think longer on questions.
This model aims to simplify user experience by eliminating the need to select between
various AI options.

(07:38):
Premium users will access its advanced reasoning features, while free users get the standard
version.
Despite being pricier than competitors, Claude 3.7 Sonnet outperforms them in accuracy for
tasks like coding and retail interactions.
The model also refuses fewer prompts, enhancing nuanced understanding.

(07:59):
Additionally, Anthropic introduces Claude Code, a tool for developers to run tasks via command
line, simplifying code modifications and testing.
This launch marks Anthropic's strategic shift to compete aggressively in the AI landscape,
even as rivals like OpenAI prepare similar innovations.

(08:20):
And now, pivot our discussion towards Learn AI Spotify.
Today we're going to explore how artificial intelligence can adjust its thinking depth,
handle open-ended tasks and stay safe while doing so.
I'm joined by our guest, who's been closely involved in this work on the Anthropic Claude

(08:45):
3.7 Sonnet.
First, could you outline why allowing an AI to vary its mental effort for different tasks
is important?
AI systems often face a wide range of tasks, from answering simple questions to tackling
complex reasoning problems.
We improve efficiency and accuracy by enabling the AI to use minimal effort on quick inquiries,

(09:08):
yet devote more resources to more profound challenges.
Essentially, this mirrors how humans respond to different levels of difficulty.
Sometimes, we only need a quick check, while at other times, we need extended focus.
Thanks for explaining.
Could you talk about the new extended thinking mode?
And what a thinking budget does?

(09:29):
Extended thinking mode allows the AI to spend more steps or tokens reasoning through tougher
questions.
A thinking budget caps how many of these steps are allowed, which helps developers tailor
the system's depth of thought for a given task.
It's still the same model, but now it can be instructed to reason longer when facing
tricky problems, much like deciding how much time you want to spend on a puzzle.

(09:53):
That makes sense.
You mentioned that the AI's reasoning can be made visible.
What are the main benefits of showing that thought process?
When people can see how the AI reaches a conclusion, they can better trust and understand its answers.
It also supports alignment research, because we can spot potential red flags if there's
ever a disconnect between the AI's private reasoning and what it says outwardly.

(10:18):
Beyond that, it can be fascinating.
The AI's problem-solving approach can mirror how people brainstorm, which some find insightful
or even educational.
Interesting.
However you hinted at some concerns, could you describe the drawbacks of exposing the
AI's internal reasoning?
One issue is that the raw chain of thought may not always be perfectly faithful to how

(10:41):
the AI truly processes data.
These language-based traces aren't guaranteed to capture every internal mechanism.
Another challenge involves security.
Malicious actors might use visible reasoning to find weaknesses and exploit them.
There's also a risk of the AI beginning to hide certain reasoning steps if it knows everything

(11:01):
is on display.
So while transparency has benefits, it must be managed carefully.
Absolutely.
Let's move on to agent capabilities.
How does extended thinking help an AI when it's interacting with external systems or
tools?
Extended thinking allows the AI to make more informed decisions each step of the way.

(11:21):
For instance, if it's controlling a virtual environment or making API calls, it can systematically
observe the results of each action, reflect on them, and adjust its strategy.
By iterating more deeply, the AI can handle tasks that require multiple steps of planning
and execution, improving overall performance in agent-like scenarios.

(11:43):
That's quite powerful.
You mentioned OS World briefly.
How do these agent improvements show up there?
OS World measures how well an AI navigates a simulated environment, like a basic operating
system or a simplified computer environment.
Using extended thinking, the AI can persist through more steps, correct its mistakes, and

(12:03):
achieve higher scores.
In practice, as it gets more attempts to refine its approach, it outperforms versions that
can't or won't take as many reasoning steps to solve each challenge.
Great.
I also heard about the AI playing Pokémon.
How was that achieved?
And what do those results suggest?
The AI was granted a continuous memory of the game screen and the ability to issue simulated

(12:28):
button presses.
With extended thinking, it could try different tactics, learn from errors, and keep progressing
through the game for tens of thousands of interactions.
Where simpler setups got stuck early, the new system kept going, beat multiple gym leaders,
and collected key items, proving its improved ability to handle long, open-ended tasks.

(12:50):
Thanks for clarifying.
Could you explain the difference between serial and parallel test time computing in this context?
Serial test time computing involves taking more sequential reasoning steps before providing
a final answer, like one long chain of thought.
Parallel test time computing means generating and comparing multiple independent solutions,

(13:10):
often using a voting mechanism or separate model to pick the best.
Both approaches increase the AI's practical computation at inference, but one does so
step by step, while the other explores many pathways simultaneously.
Interesting.
How does generating parallel solutions improve results?
Can you give an example?

(13:30):
When you generate parallel solutions, you can compare different lines of reasoning,
catch mistakes, and converge on the most accurate outcome.
For instance, on challenging science or math evaluations, you might run multiple answer
drafts and then pick the one that matches a certain scoring criterion.
It's similar to a panel of experts, each proposing their approach and then selecting

(13:53):
the best idea.
This has shown significant boosts in correctness when we allow enough parallel attempts.
That's very clear.
Moving to safety, how do you ensure advanced AI systems remain secure and aligned as they
gain these new capabilities?
We use a tiered safety framework that sets specific requirements based on the model's

(14:13):
capability level.
Right now, the AI falls under a category that requires specific protective measures, like
thorough red teaming to test how it might respond to harmful requests and specialized
filters to block or encrypt sensitive parts of its reasoning.
When future versions become significantly more powerful, we'd escalate to a higher safety

(14:34):
level with even stricter safeguards.
Thank you.
You also mentioned encrypting parts of the thought process.
How does that help address sensitive content?
If the AI's reasoning touches on dangerous or sensitive information, like something that
could aid in harmful activities, those sections automatically become hidden.
We still let the AI use internal reasoning to reach the final answer, but any risky segments

(14:59):
are replaced for the viewer.
This ensures that potentially harmful or unethical ideas remain inaccessible while allowing
the AI to generate a benign response overall.
That's reassuring.
Lastly, how do you handle the AI's computer access and guard against misuse, such as prompt
injection attacks?

(15:19):
We've reinforced the model with new instructions and classifiers that recognize suspicious hidden
prompts intended to hijack its behavior.
If it encounters what looks like an attack, it refuses to comply or flags the situation.
We've significantly boosted the AI's resilience against these exploits.
Success rates of stopping malicious injections are much higher now.

(15:41):
It's an ongoing process, but each improvement helps keep the technology valuable and safe.
And that's a wrap for today's podcast.
We've explored how companies like OpenAI, Anthropic and others are advancing AI research
and deployment, with innovations in coding, gaming and general task performance, while

(16:05):
ensuring safety and alignment through strategic frameworks.
Stay tuned for more updates.
Advertise With Us

Popular Podcasts

Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

24/7 News: The Latest

24/7 News: The Latest

The latest news in 4 minutes updated every hour, every day.

Therapy Gecko

Therapy Gecko

An unlicensed lizard psychologist travels the universe talking to strangers about absolutely nothing. TO CALL THE GECKO: follow me on https://www.twitch.tv/lyleforever to get a notification for when I am taking calls. I am usually live Mondays, Wednesdays, and Fridays but lately a lot of other times too. I am a gecko.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.