Mode Seeking meets Mean Seeking for Fast Long Video Generation
Summary
A new training paradigm, "Mode Seeking meets Mean Seeking," addresses the challenge of generating minute-scale videos by decoupling local fidelity from long-term coherence. This approach utilizes a Decoupled Diffusion Transformer with two heads: a global Flow Matching head, supervised on long videos to learn narrative structure, and a local Distribution Matching head. The local head aligns sliding windows to a frozen short-video teacher using a mode-seeking reverse-KL divergence, inheriting local realism. This strategy allows for the synthesis of minute-scale videos that learn long-range coherence and motion from limited long video data, while maintaining local realism from abundant short-video data. The method results in a few-step fast long video generator, effectively closing the fidelity-horizon gap by improving local sharpness, motion, and long-range consistency.
Key takeaway
For research scientists developing video generation models, this "Mode Seeking meets Mean Seeking" paradigm offers a robust solution to the fidelity-horizon gap. You should consider implementing a dual-head diffusion transformer architecture to leverage both limited long-form data for coherence and abundant short-form data for local realism, enabling faster and more consistent long video synthesis.
Key insights
Decoupling local fidelity and long-term coherence enables minute-scale video generation from limited long-form data.
Principles
- Combine supervised flow matching for global coherence.
- Align local segments to a frozen teacher for realism.
Method
A Decoupled Diffusion Transformer uses a global Flow Matching head for narrative structure and a local Distribution Matching head with mode-seeking reverse-KL divergence to align sliding windows to a short-video teacher.
In practice
- Generate minute-scale videos with high fidelity.
- Improve long-range consistency in video synthesis.
Topics
- Long Video Generation
- Diffusion Transformers
- Flow Matching
- Mode Seeking
- Video Synthesis
Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.