Computation and Language - Automating Steering for Safe Multimodal Large Language Models - PaperLedge

All Episodes

Computation and Language - Automating Steering for Safe Multimodal Large Language Models

July 20, 2025 • 7 mins

Alright learning crew, Ernis here, ready to dive into some cutting-edge research! Today, we’re talking about keeping AI safe, specifically those super-smart AIs that can understand both words and images - what we call Multimodal Large Language Models, or MLLMs for short.

Think of it like this: imagine you're teaching a child to recognize a "bad" thing, like a hot stove. You show them pictures, tell them stories, and explain why touching it is dangerous. Now, imagine someone tries to trick the child, maybe by making the stove look like a toy. That's kind of what "adversarial multimodal inputs" are doing to these MLLMs – trying to fool them into doing something unsafe!

These MLLMs are becoming incredibly powerful, but with great power comes great responsibility, right? The researchers behind this paper were concerned about these “attacks” and wanted to find a way to make these AIs safer without having to constantly retrain them from scratch.

Their solution is called AutoSteer, and it's like giving the AI a built-in safety mechanism that kicks in during use – at inference time. Think of it as adding a smart "filter" to their thinking process. Instead of retraining the whole AI, they focus on intervening only when things get risky.

AutoSteer has three main parts:

Safety Awareness Score (SAS): This is like the AI's inner sense of danger. It figures out which parts of the AI's "brain" are most sensitive to safety issues. It's like knowing which friend gives the best advice when you're facing a tough decision.
Adaptive Safety Prober: This part is like a lie detector. It looks at the AI's thought process and tries to predict if it's about to say or do something harmful. It’s trained to spot those red flags!
Refusal Head: This is the actual intervention part. If the "lie detector" senses danger, the Refusal Head steps in and gently nudges the AI in a safer direction. It might subtly change the wording or even refuse to answer a dangerous question.

The researchers tested AutoSteer on some popular MLLMs like LLaVA-OV and Chameleon, using tricky situations designed to fool the AI. They found that AutoSteer significantly reduced the Attack Success Rate (ASR) – meaning it was much harder to trick the AI into doing something unsafe, whether the threat came from text, images, or a combination of both.

Here’s a key takeaway:

AutoSteer acts as a practical, understandable, and effective way to make multimodal AI systems safer in the real world.

So, why does this matter to you?

For the everyday user: Safer AI means less chance of encountering harmful content, biased information, or being manipulated by AI-powered scams.
For developers: AutoSteer provides a practical way to build safer AI systems without the huge cost of retraining models from scratch.
For policymakers: This research offers a potential framework for regulating AI safety and ensuring responsible development.

This research is a big step towards building AI that’s not only powerful but also trustworthy and aligned with human values.

Now, some questions to ponder:

Could AutoSteer, or systems like it, be used to censor AI or push certain agendas? How do we ensure fairness and transparency in these interventions?
As AI gets even more sophisticated, will these "attackers" always be one step ahead? How do we create safety mechanisms that can adapt to new and unforeseen threats?
What are the ethical implications of "nudging" an AI's responses? At what point does intervention become manipulation?

That's

Mark as Played

Advertise With Us

Popular Podcasts

Crime Junkie

Does hearing about a true crime case always leave you scouring the internet for the truth behind the story? Dive into your next mystery with Crime Junkie. Every Monday, join your host Ashley Flowers as she unravels all the details of infamous and underreported true crime cases with her best friend Brit Prawat. From cold cases to missing persons and heroes in our community who seek justice, Crime Junkie is your destination for theories and stories you won’t hear anywhere else. Whether you're a seasoned true crime enthusiast or new to the genre, you'll find yourself on the edge of your seat awaiting a new episode every Monday. If you can never get enough true crime... Congratulations, you’ve found your people. Follow to join a community of Crime Junkies! Crime Junkie is presented by audiochuck Media Company.

24/7 News: The Latest

The latest news in 4 minutes updated every hour, every day.

Stuff You Should Know

If you've ever wanted to know about champagne, satanism, the Stonewall Uprising, chaos theory, LSD, El Nino, true crime and Rosa Parks, then look no further. Josh and Chuck have you covered.

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Computation and Language - Automating Steering for Safe Multimodal Large Language Models