Harnessing Textual Refusal Directions for Multimodal Safety

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition, Cybersecurity & Data Privacy · Depth: Expert, quick

Summary

Modality-Agnostic Refusal Steering (MARS) is a novel, training-free approach designed to enhance safety in Multimodal Large Language Models (MLLMs) without requiring unsafe multimodal data, which is difficult to collect. Unlike unimodal LLMs that use post-training alignment or refusal directions, MLLMs face unique challenges. MARS leverages textual refusal directions extracted from the LLM backbone, demonstrating their generalization across image and video modalities. The method corrects modality misalignment via activation re-centering, adaptively scales steering strength within a geometrically defined trust region, and selects the optimal intervention layer, operating at the first generated token. Evaluated on five state-of-the-art MLLMs across safety, utility, and video jailbreak benchmarks, MARS consistently achieves safety gains while preserving utility, indicating shared safety-relevant structure across modalities.

Key takeaway

For AI Security Engineers or Machine Learning Engineers developing MLLMs, you should consider Modality-Agnostic Refusal Steering (MARS) to enhance model safety. This training-free approach allows you to inject robust multimodal safety by leveraging existing textual refusal directions, circumventing the need for hard-to-collect unsafe multimodal data. Implementing MARS can achieve consistent safety gains across various benchmarks, including video jailbreaks, while effectively preserving the model's utility.

Key insights

Textual refusal directions from LLM backbones can generalize to MLLMs, offering a training-free path to multimodal safety.

Principles

Method

MARS corrects modality misalignment via activation re-centering, adaptively scales steering strength within a trust region, and selects the optimal intervention layer at the first generated token.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.