AI models pass SWE-Bench from memory
Text version: https://pivot-to-ai.com/2025/07/02/how-to-pass-an-ai-coding-benchmark-train-on-the-questions/
Patreon: https://www.patreon.com/davidgerard
Ko-Fi: https://ko-fi.com/A1529D5
Buy me nice things: https://www.amazon.co.uk/hz/wishlist/ls/3Q8VZW46J6DM6
Get an extremely cool Pivot to AI shirt or mug: https://pivot-to-ai.redbubble.com
Source:
The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason https://arxiv.org/abs/2506.12286
Previously on Pivot to AI:
OpenAI o3 beats FrontierMath — because OpenAI funded the test and had access to the questions https://pivot-to-ai.com/2025/01/20/openai-o3-beats-frontiermath-because-openai-funded-the-test-and-had-access-to-questions/
AI benchmarks are self-promoting trash — but regulators keep using them https://pivot-to-ai.com/2025/02/25/ai-benchmarks-are-self-promoting-trash-but-regulators-keep-using-them/
Apple: ‘Reasoning’ AIs fail hard if they actually have to think https://pivot-to-ai.com/2025/06/08/apple-reasoning-ais-fail-hard-if-they-actually-have-to-think/
video: https://www.youtube.com/watch?v=gSx9pI5so30&list=UU9rJrMVgcXTfa8xuMnbhAEA
Full Pivot to AI playlist: https://www.youtube.com/playlist?list=UU9rJrMVgcXTfa8xuMnbhAEA
United States of Kennedy
United States of Kennedy is a podcast about our cultural fascination with the Kennedy dynasty. Every week, hosts Lyra Smith and George Civeris go into one aspect of the Kennedy story.
Stuff You Should Know
If you've ever wanted to know about champagne, satanism, the Stonewall Uprising, chaos theory, LSD, El Nino, true crime and Rosa Parks, then look no further. Josh and Chuck have you covered.
Dateline NBC
Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com