Harnessing Textual Refusal Directions for Multimodal Safety

2026-06-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

Modality-Agnostic Refusal Steering (MARS) is a novel, training-free approach designed to enhance safety in Multimodal Large Language Models (MLLMs) without requiring unsafe multimodal data, which is difficult to collect. Unlike unimodal LLMs that use post-training alignment or refusal directions, MLLMs face unique challenges. MARS leverages textual refusal directions extracted from the LLM backbone, demonstrating their generalization across image and video modalities. The method corrects modality misalignment via activation re-centering, adaptively scales steering strength within a geometrically defined trust region, and selects the optimal intervention layer, operating at the first generated token. Evaluated on five state-of-the-art MLLMs across safety, utility, and video jailbreak benchmarks, MARS consistently achieves safety gains while preserving utility, indicating shared safety-relevant structure across modalities.

Key takeaway

For AI Security Engineers or Machine Learning Engineers developing MLLMs, you should consider Modality-Agnostic Refusal Steering (MARS) to enhance model safety. This training-free approach allows you to inject robust multimodal safety by leveraging existing textual refusal directions, circumventing the need for hard-to-collect unsafe multimodal data. Implementing MARS can achieve consistent safety gains across various benchmarks, including video jailbreaks, while effectively preserving the model's utility.

Key insights

Textual refusal directions from LLM backbones can generalize to MLLMs, offering a training-free path to multimodal safety.

Principles

Multimodal safety alignment is constrained by unsafe data scarcity.
Textual refusal directions generalize across image and video modalities.
Steering effectiveness depends on layer, strength, and cross-modal alignment.

Method

MARS corrects modality misalignment via activation re-centering, adaptively scales steering strength within a trust region, and selects the optimal intervention layer at the first generated token.

In practice

Inject multimodal safety without needing multimodal safety data.
Apply safety interventions at the first generated token.

Topics

Multimodal LLMs
LLM Safety
Refusal Directions
Activation Steering
Computer Vision
Machine Learning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.