All Episodes

July 2, 2025 4 mins

AI models pass SWE-Bench from memory
Text version: https://pivot-to-ai.com/2025/07/02/how-to-pass-an-ai-coding-benchmark-train-on-the-questions/

Patreon: https://www.patreon.com/davidgerard
Ko-Fi: https://ko-fi.com/A1529D5
Buy me nice things: https://www.amazon.co.uk/hz/wishlist/ls/3Q8VZW46J6DM6
Get an extremely cool Pivot to AI shirt or mug: https://pivot-to-ai.redbubble.com

Source:

The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason https://arxiv.org/abs/2506.12286

Previously on Pivot to AI:

OpenAI o3 beats FrontierMath — because OpenAI funded the test and had access to the questions https://pivot-to-ai.com/2025/01/20/openai-o3-beats-frontiermath-because-openai-funded-the-test-and-had-access-to-questions/
AI benchmarks are self-promoting trash — but regulators keep using them https://pivot-to-ai.com/2025/02/25/ai-benchmarks-are-self-promoting-trash-but-regulators-keep-using-them/
Apple: ‘Reasoning’ AIs fail hard if they actually have to think https://pivot-to-ai.com/2025/06/08/apple-reasoning-ais-fail-hard-if-they-actually-have-to-think/
video: https://www.youtube.com/watch?v=gSx9pI5so30&list=UU9rJrMVgcXTfa8xuMnbhAEA

Full Pivot to AI playlist: https://www.youtube.com/playlist?list=UU9rJrMVgcXTfa8xuMnbhAEA

 

Mark as Played

Advertise With Us

Popular Podcasts

United States of Kennedy
Stuff You Should Know

Stuff You Should Know

If you've ever wanted to know about champagne, satanism, the Stonewall Uprising, chaos theory, LSD, El Nino, true crime and Rosa Parks, then look no further. Josh and Chuck have you covered.

Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.