All Episodes

April 4, 2025 136 mins

Most AI safety conversations centre on alignment: ensuring AI systems share our values and goals. But despite progress, we’re unlikely to know we’ve solved the problem before the arrival of human-level and superhuman systems in as little as three years.

So some are developing a backup plan to safely deploy models we fear are actively scheming to harm us — so-called “AI control.” While this may sound mad, given the reluctance of AI companies to delay deploying anything they train, not developing such techniques is probably even crazier.

Today’s guest — Buck Shlegeris, CEO of Redwood Research — has spent the last few years developing control mechanisms, and for human-level systems they’re more plausible than you might think. He argues that given companies’ unwillingness to incur large costs for security, accepting the possibility of misalignment and designing robust safeguards might be one of our best remaining options.

Links to learn more, highlights, video, and full transcript.

As Buck puts it: "Five years ago I thought of misalignment risk from AIs as a really hard problem that you’d need some really galaxy-brained fundamental insights to resolve. Whereas now, to me the situation feels a lot more like we just really know a list of 40 things where, if you did them — none of which seem that hard — you’d probably be able to not have very much of your problem."

Of course, even if Buck is right, we still need to do those 40 things — which he points out we’re not on track for. And AI control agendas have their limitations: they aren’t likely to work once AI systems are much more capable than humans, since greatly superhuman AIs can probably work around whatever limitations we impose.

Still, AI control agendas seem to be gaining traction within AI safety. Buck and host Rob Wiblin discuss all of the above, plus:

  • Why he’s more worried about AI hacking its own data centre than escaping
  • What to do about “chronic harm,” where AI systems subtly underperform or sabotage important work like alignment research
  • Why he might want to use a model he thought could be conspiring against him
  • Why he would feel safer if he caught an AI attempting to escape
  • Why many control techniques would be relatively inexpensive
  • How to use an untrusted model to monitor another untrusted model
  • What the minimum viable intervention in a “lazy” AI company might look like
  • How even small teams of safety-focused staff within AI labs could matter
  • The moral considerations around controlling potentially conscious AI systems, and whether it’s justified

Chapters:

  • Cold open |00:00:00|  
  • Who’s Buck Shlegeris? |00:01:27|  
  • What's AI control? |00:01:51|  
  • Why is AI control hot now? |00:05:39|  
  • Detecting human vs AI spies |00:10:32|  
  • Acute vs chronic AI betrayal |00:15:21|  
  • How to catch AIs trying to escape |00:17:48|  
  • The cheapest AI control techniques |00:32:48|  
  • Can we get untrusted models to do trusted work? |00:38:58|  
  • If we catch a model escaping... will we do anything? |00:50:15|  
  • Getting AI models to think they've already escaped |00:52:51|  
  • Will they be able to tell it's a setup? |00:58:11|  
  • Will AI companies do any of this stuff? |01:00:11|  
  • Can we just give AIs fewer permissions? |01:06:14|  
  • Can we stop human spies the same way? |01:09:58|  
  • The pitch to AI companies to do this |01:15:04|  
  • Will AIs get superhuman so fast that this is all useless? |01:17:18|  
  • Risks from AI deliberately doing a bad job |01:18:37|  
  • Is alignment still useful? |01:24:49|  
  • Current alignment methods don't detect scheming |01:29:12|  
Mark as Played

Advertise With Us

Popular Podcasts

Stuff You Should Know
Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

On Purpose with Jay Shetty

On Purpose with Jay Shetty

I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.