SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

SOAR (Self-Correction for Optimal Alignment and Refinement) is a novel post-training method for diffusion models that addresses "exposure bias," a mismatch between ground-truth training states and model-generated inference states. Unlike traditional supervised fine-tuning (SFT) which optimizes only on ideal states, or reinforcement learning (RL) which uses sparse terminal rewards, SOAR performs a single stop-gradient rollout to generate off-trajectory states. It then re-noises these states and supervises the model to correct back towards the original clean target, providing dense, reward-free, per-timestep supervision. Evaluated on SD3.5-Medium, SOAR improves GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over SFT, while also raising all model-based preference scores. In controlled experiments, SOAR surpassed Flow-GRPO on aesthetic and text-image alignment tasks without a reward model, demonstrating its effectiveness as a stronger first post-training stage.

Key takeaway

For AI Engineers and Research Scientists working with diffusion models, SOAR offers a robust alternative to traditional SFT, directly addressing exposure bias and improving generation quality across multiple metrics. You should consider integrating SOAR as the initial post-training phase to enhance model performance and stability, especially for tasks requiring high compositional accuracy and text rendering, before applying any targeted reward optimization.

Key insights

SOAR corrects diffusion model exposure bias by providing dense, on-policy, reward-free supervision for off-trajectory states.

Principles

Exposure bias degrades diffusion model performance.
Dense, per-timestep correction is superior to sparse terminal rewards.
On-policy training improves generalization to inference states.

Method

SOAR constructs off-trajectory states via a single stop-gradient ODE step, re-noises them to auxiliary levels, and supervises the model to steer back to the original clean target using an analytically derived correction objective.

In practice

Replace SFT with SOAR as a first post-training stage.
Curate high-quality data for SOAR training.
Consider ODE-only SOAR for efficiency.

Topics

Diffusion Models
Exposure Bias
Trajectory Correction
Flow Matching
Post-Training Alignment

Code references

black-forest-labs/flux

Best for: AI Engineer, Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.