Harnessing Textual Refusal Directions for Multimodal Safety
Summary
Modality-Agnostic Refusal Steering (MARS) is a novel, training-free approach designed to enhance safety in Multimodal Large Language Models (MLLMs) without requiring unsafe multimodal data, which is difficult to collect. Unlike unimodal LLMs that use post-training alignment or refusal directions, MLLMs face unique challenges. MARS leverages textual refusal directions extracted from the LLM backbone, demonstrating their generalization across image and video modalities. The method corrects modality misalignment via activation re-centering, adaptively scales steering strength within a geometrically defined trust region, and selects the optimal intervention layer, operating at the first generated token. Evaluated on five state-of-the-art MLLMs across safety, utility, and video jailbreak benchmarks, MARS consistently achieves safety gains while preserving utility, indicating shared safety-relevant structure across modalities.
Key takeaway
For AI Security Engineers or Machine Learning Engineers developing MLLMs, you should consider Modality-Agnostic Refusal Steering (MARS) to enhance model safety. This training-free approach allows you to inject robust multimodal safety by leveraging existing textual refusal directions, circumventing the need for hard-to-collect unsafe multimodal data. Implementing MARS can achieve consistent safety gains across various benchmarks, including video jailbreaks, while effectively preserving the model's utility.
Key insights
Textual refusal directions from LLM backbones can generalize to MLLMs, offering a training-free path to multimodal safety.
Principles
- Multimodal safety alignment is constrained by unsafe data scarcity.
- Textual refusal directions generalize across image and video modalities.
- Steering effectiveness depends on layer, strength, and cross-modal alignment.
Method
MARS corrects modality misalignment via activation re-centering, adaptively scales steering strength within a trust region, and selects the optimal intervention layer at the first generated token.
In practice
- Inject multimodal safety without needing multimodal safety data.
- Apply safety interventions at the first generated token.
Topics
- Multimodal LLMs
- LLM Safety
- Refusal Directions
- Activation Steering
- Computer Vision
- Machine Learning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.