HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

HiAR is a novel hierarchical denoising framework designed for efficient autoregressive long video generation, addressing the challenge of temporal continuity and quality degradation in existing methods. Unlike conventional approaches that condition on highly denoised contexts, HiAR proposes conditioning on context at the same noise level as the current block, which sufficiently maintains temporal consistency while mitigating error propagation. This framework reverses the typical sequential block generation order, instead performing causal generation across all blocks at each denoising step. This design enables pipelined parallel inference, achieving a 1.8x wall-clock speedup in a 4-step setting. Additionally, HiAR incorporates a forward-KL regularizer in bidirectional-attention mode to counteract a low-motion shortcut amplified by self-rollout distillation, thereby preserving motion diversity. On the VBench benchmark (20s generation), HiAR achieved the best overall score and the lowest temporal drift.

Key takeaway

For research scientists developing long video generation models, HiAR's hierarchical denoising approach offers a significant advancement. You should consider adopting its strategy of conditioning on same-noise-level contexts and its forward-KL regularization to improve temporal consistency, mitigate error accumulation, and enhance motion diversity in your autoregressive diffusion models. This could lead to more efficient and higher-quality long video outputs.

Key insights

Conditioning on same-noise-level context in autoregressive diffusion improves temporal consistency and mitigates error propagation.

Principles

Method

HiAR reverses conventional generation order, performing causal generation across all blocks at each denoising step, conditioning on same-noise-level context. It uses a forward-KL regularizer to preserve motion diversity.

In practice

Topics

Best for: Research Scientist, AI Researcher, AI Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.