This March 27, 2025 Anthropic paper provides an overview and detailed excerpts from two related Anthropic papers concerning the **interpretability of large language models**, specifically focusing on Claude 3.5 Haiku. The core objective is to reverse engineer the **internal computational mechanisms**, or "circuits," that drive the model's behavior, analogous to studying biology or neuroscience. The research introduces a **circuit tracing methodology** that uses attribution graphs and feature analysis to examine how the model handles various tasks, including **multi-step reasoning**, **planning in poems**, **multilingual translation**, and **arithmetic**. Findings reveal sophisticated strategies like internal planning and the existence of "default" refusal circuits that must be **inhibited by "known answer" features** for the model to respond to questions, illuminating the mechanisms behind **hallucinations and jailbreaks**.
Sources:
https://transformer-circuits.pub/2025/attribution-graphs/biology.html
https://www.anthropic.com/research/tracing-thoughts-language-model
Spooky Podcasts from iHeartRadio
Whether you’re a scaredy-cat or a brave bat, this collection of episodes from iHeartPodcasts will put you in the Halloween spirit. Binge stories, frights, and more that may keep you up at night!
Dateline NBC
Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com
Stuff You Should Know
If you've ever wanted to know about champagne, satanism, the Stonewall Uprising, chaos theory, LSD, El Nino, true crime and Rosa Parks, then look no further. Josh and Chuck have you covered.