Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation
Summary
A novel framework for cross-model safety steering in generative AI addresses the challenge of model-specific safety controls by introducing a portable latent safety direction. This direction is initially estimated within a source Large Language Model (LLM) using paired safe-unsafe prompts. It is then transferred to a target generator through a lightweight alignment process, which exclusively utilizes benign data, crucially avoiding any unsafe data on the target side. The framework supports both a single global safety direction and a multi-vector extension for more granular, category-specific control. Evaluations across text-to-image and text-to-video generation demonstrate that these transferred safety directions achieve Attack Success Rate (ASR) reduction and maintain CLIP-Score/FID trade-offs comparable to directions learned natively on the target model with unsafe data, without compromising generation quality. This research indicates that safety-relevant behavior can be controlled via latent directions that persist across diverse models.
Key takeaway
For Machine Learning Engineers developing new generative models, this research indicates you can achieve robust safety controls without extensive, model-specific unsafe data collection. You should explore cross-model safety steering frameworks to transfer learned safety directions from existing LLMs, significantly reducing the burden of acquiring and curating sensitive datasets for each new architecture. This approach allows you to maintain generation quality while enhancing safety efficiently.
Key insights
Safety representations can be learned in one model and transferred to others, enabling cross-model steering without target-specific unsafe data.
Principles
- Safety control can be modular and reusable.
- Latent directions encode safety behaviors.
- Target-side unsafe data is not required.
Method
Estimate safety direction in source LLM from safe-unsafe prompts. Transport to target generator via lightweight alignment on benign data. Apply at inference time.
In practice
- Apply learned safety directions to new generative models.
- Implement multi-vector steering for specific categories.
- Reduce ASR in visual generation tasks.
Topics
- Cross-Model Steering
- Generative AI Safety
- Latent Representations
- Text-to-Image Generation
- Text-to-Video Generation
- Large Language Models
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, AI Security Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.