HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising
Summary
HiAR is a novel hierarchical denoising framework designed for efficient autoregressive long video generation, addressing the challenge of temporal continuity and quality degradation in existing methods. Unlike conventional approaches that condition on highly denoised contexts, HiAR proposes conditioning on context at the same noise level as the current block, which sufficiently maintains temporal consistency while mitigating error propagation. This framework reverses the typical sequential block generation order, instead performing causal generation across all blocks at each denoising step. This design enables pipelined parallel inference, achieving a 1.8x wall-clock speedup in a 4-step setting. Additionally, HiAR incorporates a forward-KL regularizer in bidirectional-attention mode to counteract a low-motion shortcut amplified by self-rollout distillation, thereby preserving motion diversity. On the VBench benchmark (20s generation), HiAR achieved the best overall score and the lowest temporal drift.
Key takeaway
For research scientists developing long video generation models, HiAR's hierarchical denoising approach offers a significant advancement. You should consider adopting its strategy of conditioning on same-noise-level contexts and its forward-KL regularization to improve temporal consistency, mitigate error accumulation, and enhance motion diversity in your autoregressive diffusion models. This could lead to more efficient and higher-quality long video outputs.
Key insights
Conditioning on same-noise-level context in autoregressive diffusion improves temporal consistency and mitigates error propagation.
Principles
- Highly clean context is unnecessary for temporal continuity.
- Same-noise-level conditioning provides sufficient temporal signal.
Method
HiAR reverses conventional generation order, performing causal generation across all blocks at each denoising step, conditioning on same-noise-level context. It uses a forward-KL regularizer to preserve motion diversity.
In practice
- Achieves 1.8x wall-clock speedup via pipelined parallel inference.
- Reduces temporal drift in long video generation.
Topics
- Autoregressive Diffusion
- Long Video Generation
- Hierarchical Denoising
- Temporal Consistency
- Motion Diversity
Best for: Research Scientist, AI Researcher, AI Scientist, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.