Apple: The Illusion of Thinking – Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity - ibl.ai

All Episodes

Apple: The Illusion of Thinking – Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

June 17, 2025 • 14 mins

Summary of https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

Explores the capabilities and limitations of Large Reasoning Models (LRMs), which generate detailed thinking processes, compared to standard Large Language Models (LLMs). The authors use controllable puzzle environments like Tower of Hanoi and River Crossing to systematically evaluate performance as complexity increases.

Findings indicate that LRMs outperform LLMs on medium-complexity tasks but both struggle and eventually fail at high complexities. Surprisingly, LRMs show a decrease in reasoning effort (measured by tokens) as problems become extremely difficult, and they exhibit limitations in executing precise algorithmic steps.

Current Large Reasoning Models (LRMs) face a complete accuracy collapse beyond certain complexity levelswhen evaluated using controllable puzzle environments. This study found three distinct performance regimesbased on problem complexity: standard LLMs perform better at low complexity, LRMs show an advantage at medium complexity, and both types of models fail at high complexity.
LRMs exhibit a counter-intuitive scaling limit in their reasoning effort (measured by inference thinking tokens) relative to problem complexity. While reasoning effort initially increases with complexity, it declines as problems approach the complexity threshold where accuracy collapses, even when ample token budget is available.
Analysis of the intermediate reasoning traces ("thoughts") reveals complexity-dependent reasoning patterns. For simple problems, LRMs often find correct solutions early but continue exploring incorrect alternatives, a phenomenon termed "overthinking". At moderate complexity, correct solutions tend to emerge later in the thinking process, after exploring incorrect paths. Beyond a certain high complexity threshold, models fail to generate any correct solutions within their thought process.
The research questions the reliance on established mathematical and coding benchmarks for evaluating LRMs, noting issues like data contamination and lack of insight into reasoning traces. Controllable puzzle environments were adopted to allow for systematic variation of complexity while maintaining consistent logical structures and enabling detailed analysis of solutions and internal reasoning.
Surprising limitations were uncovered in LRMs' ability to perform exact computation and follow explicit algorithms. For instance, providing the solution algorithm for the Tower of Hanoi puzzle did not improve performance or prevent the accuracy collapse. Models also demonstrated inconsistent reasoning, succeeding on some puzzles with higher move counts (like Tower of Hanoi with N=5 requiring 31 moves) but failing much earlier in others with lower required move counts (like River Crossing with N=3 having an 11-move solution).

Mark as Played

Advertise With Us

Popular Podcasts

Crime Junkie

Does hearing about a true crime case always leave you scouring the internet for the truth behind the story? Dive into your next mystery with Crime Junkie. Every Monday, join your host Ashley Flowers as she unravels all the details of infamous and underreported true crime cases with her best friend Brit Prawat. From cold cases to missing persons and heroes in our community who seek justice, Crime Junkie is your destination for theories and stories you won’t hear anywhere else. Whether you're a seasoned true crime enthusiast or new to the genre, you'll find yourself on the edge of your seat awaiting a new episode every Monday. If you can never get enough true crime... Congratulations, you’ve found your people. Follow to join a community of Crime Junkies! Crime Junkie is presented by audiochuck Media Company.

24/7 News: The Latest

The latest news in 4 minutes updated every hour, every day.

Stuff You Should Know

If you've ever wanted to know about champagne, satanism, the Stonewall Uprising, chaos theory, LSD, El Nino, true crime and Rosa Parks, then look no further. Josh and Chuck have you covered.

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Apple: The Illusion of Thinking – Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity