All Episodes

October 26, 2025 14 mins

This March 27, 2025 Anthropic paper provides an overview and detailed excerpts from two related Anthropic papers concerning the **interpretability of large language models**, specifically focusing on Claude 3.5 Haiku. The core objective is to reverse engineer the **internal computational mechanisms**, or "circuits," that drive the model's behavior, analogous to studying biology or neuroscience. The research introduces a **circuit tracing methodology** that uses attribution graphs and feature analysis to examine how the model handles various tasks, including **multi-step reasoning**, **planning in poems**, **multilingual translation**, and **arithmetic**. Findings reveal sophisticated strategies like internal planning and the existence of "default" refusal circuits that must be **inhibited by "known answer" features** for the model to respond to questions, illuminating the mechanisms behind **hallucinations and jailbreaks**.


Sources:

https://transformer-circuits.pub/2025/attribution-graphs/biology.html

https://www.anthropic.com/research/tracing-thoughts-language-model

Mark as Played

Advertise With Us

Popular Podcasts

Spooky Podcasts from iHeartRadio
Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

Stuff You Should Know

Stuff You Should Know

If you've ever wanted to know about champagne, satanism, the Stonewall Uprising, chaos theory, LSD, El Nino, true crime and Rosa Parks, then look no further. Josh and Chuck have you covered.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.