Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

SafeDIG is a novel safety steering framework designed for Text-to-Image Diffusion Transformers (DiTs), addressing the limitations of traditional safety controls like prompt filtering or output detection. These methods struggle with DiTs due to their layered generation process, where harmful semantics can progressively bind to visual latents and entangle with rendering dynamics, leading to unstable steering and poor generalization across risk domains. SafeDIG formulates DiT safety adaptation as position-aware sparse feature transfer. It employs Sparse Autoencoders over distinct DiT intervention positions, using robustness-aware pre-training routing to identify stable intervention sites. The framework separates transferable safety features by freezing the SAE encoder as a reusable sparse safety dictionary, adapting only the decoder to the target-domain activation manifold. During inference, SafeDIG utilizes Blend and Repel operations to guide activations. Experiments on FLUX.1 Dev and Stable Diffusion 3.5 Large demonstrate that SafeDIG consistently reduces unsafe generation rates in target-domain and overall contexts, while maintaining source-domain safety and image quality.

Key takeaway

For Machine Learning Engineers developing or deploying text-to-image diffusion models, SafeDIG offers a robust approach to enhance safety and generalization beyond prompt-level filtering. You should consider integrating position-aware sparse feature transfer mechanisms, like those in SafeDIG, to manage evolving harmful semantics within layered generation processes. This method can significantly reduce unsafe outputs in new risk domains while preserving image quality, making your models more reliable and adaptable.

Key insights

SafeDIG enables robust and generalizable safety steering for Diffusion Transformers by transferring sparse features across risk domains.

Principles

Method

SafeDIG constructs Sparse Autoencoders at distinct DiT intervention positions, pre-trains routing for robustness, freezes the SAE encoder as a safety dictionary, adapts the decoder to the target manifold, and uses Blend/Repel operations during inference.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.