Accelerating Text-to-Video Generation with Calibrated Sparse Attention
Summary
CalibAtt is a novel training-free method designed to accelerate text-to-video generation by implementing calibrated sparse attention. Diffusion models, while capable of high-quality video generation, are often hindered by slow runtimes due to large transformer-based backbones and spatiotemporal attention bottlenecks. CalibAtt addresses this by identifying and skipping token-to-token connections that consistently yield negligible attention scores, a pattern that often repeats across queries and local token blocks. The method involves an offline calibration pass to pinpoint stable block-level sparsity and repetition patterns, which are then compiled into optimized attention operations for each layer, head, and diffusion timestep. During inference, CalibAtt computes only selected input-dependent connections densely, skipping others in a hardware-efficient manner. This approach achieves up to 1.58x end-to-end speedup on models like Wan 2.1 14B and Mochi 1, outperforming existing training-free methods while preserving video generation quality and text-video alignment.
Key takeaway
For Computer Vision Engineers optimizing text-to-video generation pipelines, CalibAtt offers a significant speedup without requiring model retraining. You should consider integrating this training-free method to achieve up to 1.58x faster inference on models like Wan 2.1 14B and Mochi 1, thereby improving computational efficiency while maintaining output quality.
Key insights
CalibAtt accelerates text-to-video generation by skipping consistently negligible attention computations via calibrated sparse attention.
Principles
- Negligible attention scores are stable across inputs.
- Sparsity patterns repeat across queries and local blocks.
Method
CalibAtt performs an offline calibration to identify stable block-level sparsity and repetition patterns, compiling them into optimized attention operations for inference.
In practice
- Apply CalibAtt to existing diffusion models.
- Utilize for faster video generation inference.
Topics
- Text-to-Video Generation
- Sparse Attention
- Diffusion Models
- Model Acceleration
- Spatiotemporal Attention
Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.