SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation
Summary
Score Gradient Matching Distillation (SGMD) is a novel approach designed to accelerate inference in few-step video diffusion models, addressing limitations of the widely used Distribution Matching Distillation (DMD) paradigm. DMD-style methods face challenges with costly training due to continuously evolving generators and conservative reverse-KL matching that can hinder strong motion dynamics. SGMD tackles these issues by directly optimizing the fake score towards the teacher, employing a teacher stop-gradient Fisher as a stable distribution-matching objective. This method incorporates dual potentials: negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking. Benchmarking against DMD2, SGMD demonstrates an approximate ~3x training speedup and significantly enhances motion dynamics in 4-step distilled models, while preserving temporal consistency. Human evaluations further indicate a preference for SGMD's motion quality and overall performance, with visual quality and text alignment remaining comparable.
Key takeaway
For Machine Learning Engineers optimizing video diffusion models for faster inference, SGMD offers a compelling alternative to DMD-style methods. If you are struggling with costly training or conservative motion dynamics in few-step distillation, consider implementing SGMD. This approach can provide an approximate ~3x training speedup and substantially improve motion quality in your 4-step distilled models, while maintaining visual quality and text alignment. Explore the provided code to integrate this method into your workflows.
Key insights
SGMD directly optimizes fake scores with stable teacher stop-gradient Fisher for faster, better video distillation.
Principles
- Direct fake score optimization accelerates training.
- Stable distribution matching improves motion dynamics.
- Dual potentials refine inner and outer loop tracking.
Method
SGMD optimizes the fake score towards a teacher using teacher stop-gradient Fisher for stable distribution matching. It employs negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking.
In practice
- Achieve ~3x training speedup for video models.
- Improve motion dynamics in 4-step distilled models.
- Preserve temporal consistency in distilled videos.
Topics
- Video Diffusion Models
- Model Distillation
- Score Gradient Matching
- Few-Step Inference
- Motion Dynamics
- Temporal Consistency
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.