SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
Summary
SafeDiffusion-R1 introduces a novel online reinforcement learning framework designed to mitigate unsafe content generation in diffusion models without relying on expensive supervised data or specialized reward models. The method employs Group Relative Policy Optimization (GRPO) during post-training, utilizing both negative and positive text prompts. A key innovation is a "steering reward mechanism" that leverages CLIP embeddings to guide text representations towards safety and away from harmful directions in the embedding space. This online-policy approach allows the model to learn from diverse prompts, including explicit unsafe content, while avoiding catastrophic forgetting and maintaining generation quality. Experiments show SafeDiffusion-R1 reduces inappropriate content to 18.07% (from 48.9% for SD v1.4) and nudity detections to 15 (from 646 baseline), simultaneously improving compositional generation quality on GenEval from 42.08% to 47.83%. These safety improvements generalize across seven harm categories for out-of-domain unsafe prompts.
Key takeaway
For research scientists developing or deploying diffusion models, SafeDiffusion-R1 offers a robust approach to enhance content safety without the prohibitive cost of supervised data. Your teams can adopt this online reinforcement learning framework with CLIP-based steering to significantly reduce unsafe generations while preserving or even improving compositional quality, thereby addressing critical ethical and deployment challenges more efficiently.
Key insights
Online reinforcement learning with CLIP embedding steering effectively enhances diffusion model safety without supervised data.
Principles
- Online learning prevents catastrophic forgetting.
- CLIP embeddings can steer safety directions.
- GRPO optimizes policies with diverse prompts.
Method
The method uses online reinforcement learning with Group Relative Policy Optimization (GRPO) and a steering reward mechanism that exploits CLIP embeddings to guide text representations towards safety in the embedding space.
In practice
- Integrate GRPO for post-training safety.
- Utilize CLIP embeddings for reward steering.
- Apply online learning to avoid forgetting.
Topics
- SafeDiffusion-R1
- Diffusion Models
- Online Reinforcement Learning
- CLIP Embeddings
- Content Moderation
Code references
Best for: Computer Vision Engineer, Research Scientist, CTO, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.