Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition, Multimedia · Depth: Expert, quick

Summary

A novel framework for cross-model safety steering in generative AI addresses the challenge of model-specific safety controls by introducing a portable latent safety direction. This direction is initially estimated within a source Large Language Model (LLM) using paired safe-unsafe prompts. It is then transferred to a target generator through a lightweight alignment process, which exclusively utilizes benign data, crucially avoiding any unsafe data on the target side. The framework supports both a single global safety direction and a multi-vector extension for more granular, category-specific control. Evaluations across text-to-image and text-to-video generation demonstrate that these transferred safety directions achieve Attack Success Rate (ASR) reduction and maintain CLIP-Score/FID trade-offs comparable to directions learned natively on the target model with unsafe data, without compromising generation quality. This research indicates that safety-relevant behavior can be controlled via latent directions that persist across diverse models.

Key takeaway

For Machine Learning Engineers developing new generative models, this research indicates you can achieve robust safety controls without extensive, model-specific unsafe data collection. You should explore cross-model safety steering frameworks to transfer learned safety directions from existing LLMs, significantly reducing the burden of acquiring and curating sensitive datasets for each new architecture. This approach allows you to maintain generation quality while enhancing safety efficiently.

Key insights

Safety representations can be learned in one model and transferred to others, enabling cross-model steering without target-specific unsafe data.

Principles

Method

Estimate safety direction in source LLM from safe-unsafe prompts. Transport to target generator via lightweight alignment on benign data. Apply at inference time.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.