All Episodes

June 17, 2025 17 mins
Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness.

Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing. Produced as part of the ML Alignment & Theory Scholars Program in the winter 2024–25 cohort of the shard theory stream.

Read our paper on ArXiv and enjoy an interactive demo.

Robust unlearning probably reduces AI risk

Maybe some future AI has long-term goals and humanity is in its way. Maybe future open-weight AIs have tons of bioterror expertise. If a system has dangerous knowledge, that system becomes [...]

---

Outline:

(01:01) Robust unlearning probably reduces AI risk

(02:42) Perfect data filtering is the current unlearning gold standard

(03:24) Oracle matching does not guarantee robust unlearning

(05:05) Distillation robustifies unlearning

(07:46) Trading unlearning robustness for compute

(09:49) UNDO is better than other unlearning methods

(11:19) Where this leaves us

(11:22) Limitations

(12:12) Insights and speculation

(15:00) Future directions

(15:35) Conclusion

(16:07) Acknowledgments

(16:50) Citation

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
June 13th, 2025

Source:
https://www.lesswrong.com/posts/anX4QrNjhJqGFvrBr/distillation-robustifies-unlearning

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing.
Matching oracle behavior doesn’t guarantee robust unlearning. Graph (a) shows the loss during distillation of the student (Reference) and the Student (Random). Graphs (b) and (c) show forget performance through retraining for Language and Arithmetic settings, respectively.
Mark as Played

Advertise With Us

Popular Podcasts

Stuff You Should Know
24/7 News: The Latest

24/7 News: The Latest

The latest news in 4 minutes updated every hour, every day.

Crime Junkie

Crime Junkie

Does hearing about a true crime case always leave you scouring the internet for the truth behind the story? Dive into your next mystery with Crime Junkie. Every Monday, join your host Ashley Flowers as she unravels all the details of infamous and underreported true crime cases with her best friend Brit Prawat. From cold cases to missing persons and heroes in our community who seek justice, Crime Junkie is your destination for theories and stories you won’t hear anywhere else. Whether you're a seasoned true crime enthusiast or new to the genre, you'll find yourself on the edge of your seat awaiting a new episode every Monday. If you can never get enough true crime... Congratulations, you’ve found your people. Follow to join a community of Crime Junkies! Crime Junkie is presented by audiochuck Media Company.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.