Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
Summary
Sparse Forcing is a novel training and inference paradigm for autoregressive video diffusion models that enhances long-horizon video generation quality while simultaneously reducing decoding latency. It is based on the empirical observation that attention in autoregressive diffusion rollouts concentrates on a persistent subset of salient visual blocks, forming an implicit spatiotemporal memory, and exhibits a locally structured block-sparse pattern within sliding windows. The method introduces a trainable native sparsity mechanism that learns to compress, preserve, and update these persistent blocks, while restricting local window computation to dynamically selected neighborhoods. To ensure practical scalability, the authors developed Persistent Block-Sparse Attention (PBSA), an efficient GPU kernel accelerating sparse attention and memory updates. Experiments show Sparse Forcing improves VBench scores by +0.26 over Self-Forcing for 5-second text-to-video generation, with a 1.11–1.17x decoding speedup and 42% lower peak KV-cache footprint. These gains are more significant for longer rollouts, achieving +0.68 and +2.74 VBench improvements and 1.22x and 1.27x speedups for 20-second and 1-minute generations, respectively.
Key takeaway
For AI Engineers and Research Scientists developing real-time, long-form video generation systems, Sparse Forcing offers a significant advancement. You should consider integrating its trainable sparse attention and persistent memory mechanisms to achieve superior visual consistency and reduced inference latency, especially for minute-level video outputs. This approach directly addresses the computational and quality challenges of scaling autoregressive diffusion models.
Key insights
Sparse Forcing improves long-horizon video generation quality and efficiency by leveraging trainable sparse attention and persistent memory.
Principles
- Attention in video diffusion exhibits persistent, clustered, and local block-sparse patterns.
- Structured sparse conditioning can control error propagation and improve generation quality.
- Dynamically updated cache and adaptive local attention mitigate train-test mismatch.
Method
Sparse Forcing maintains a bounded KV memory with persistent spatiotemporal blocks and a streaming local window, using blockified compression and coarse scoring for Top-C persistent updates, and row-wise Top-K block selection for local window sparsity.
In practice
- Use Persistent Block-Sparse Attention (PBSA) kernel for efficient sparse attention.
- Apply average pooling for block compression to reduce sequence length.
- Train with dynamically updated cache and adaptive local attention to stabilize rollouts.
Topics
- Autoregressive Video Generation
- Sparse Attention
- Persistent Block-Sparse Attention
- Diffusion Models
- KV Cache Optimization
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.