SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models
Summary
SOAR (Self-Correction for Optimal Alignment and Refinement) is a novel post-training method for diffusion models that addresses "exposure bias," a mismatch between ground-truth training states and model-generated inference states. Unlike traditional supervised fine-tuning (SFT) which optimizes only on ideal states, or reinforcement learning (RL) which uses sparse terminal rewards, SOAR performs a single stop-gradient rollout to generate off-trajectory states. It then re-noises these states and supervises the model to correct back towards the original clean target, providing dense, reward-free, per-timestep supervision. Evaluated on SD3.5-Medium, SOAR improves GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over SFT, while also raising all model-based preference scores. In controlled experiments, SOAR surpassed Flow-GRPO on aesthetic and text-image alignment tasks without a reward model, demonstrating its effectiveness as a stronger first post-training stage.
Key takeaway
For AI Engineers and Research Scientists working with diffusion models, SOAR offers a robust alternative to traditional SFT, directly addressing exposure bias and improving generation quality across multiple metrics. You should consider integrating SOAR as the initial post-training phase to enhance model performance and stability, especially for tasks requiring high compositional accuracy and text rendering, before applying any targeted reward optimization.
Key insights
SOAR corrects diffusion model exposure bias by providing dense, on-policy, reward-free supervision for off-trajectory states.
Principles
- Exposure bias degrades diffusion model performance.
- Dense, per-timestep correction is superior to sparse terminal rewards.
- On-policy training improves generalization to inference states.
Method
SOAR constructs off-trajectory states via a single stop-gradient ODE step, re-noises them to auxiliary levels, and supervises the model to steer back to the original clean target using an analytically derived correction objective.
In practice
- Replace SFT with SOAR as a first post-training stage.
- Curate high-quality data for SOAR training.
- Consider ODE-only SOAR for efficiency.
Topics
- Diffusion Models
- Exposure Bias
- Trajectory Correction
- Flow Matching
- Post-Training Alignment
Code references
Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.