Training-free sparse attention based on cumulative energy filtering
Summary
A new training-free sparse attention method, based on cumulative energy filtering, accelerates Diffusion Transformers (DiTs) for video generation. This approach addresses the challenge of simultaneously maximizing sparsity and minimizing accuracy degradation, a dual-goal optimization problem not fully met by existing algorithms like Top-p or Top-k. The proposed dynamic thresholding scheme maintains a fixed recall rate to ensure accuracy while significantly improving sparsity. It integrates deeply with Flash Attention (FA), eliminating additional masking computation overhead. Experimental results on Wan 2.2 demonstrate that this strategy boosts sparsity from BLASST's 61.42% to 82% with a VBench metric drop of less than 5%. This translates to an approximate 15% reduction in attention computation and a 1.61x increase in computational efficiency, outperforming BLASST by 1.18x.
Key takeaway
For Machine Learning Engineers optimizing Diffusion Transformers for video generation, your current sparse attention strategies might be suboptimal for balancing computational efficiency and output quality. Consider implementing a dynamic thresholding approach, as demonstrated, to achieve significant sparsity gains (up to 82%) and 1.61x computational efficiency without substantial accuracy loss (less than 5% VBench drop), especially when using Flash Attention. This can notably reduce attention computation by approximately 15%.
Key insights
Dynamic thresholding for sparse attention simultaneously optimizes sparsity and accuracy in Diffusion Transformers for video generation.
Principles
- Maintaining a fixed recall rate is sufficient for ensuring accuracy in sparse attention.
- Dynamic thresholding schemes improve sparsity more effectively than fixed thresholds.
Method
Formulate token filtering as a dual-goal optimization problem to maximize sparsity and minimize accuracy degradation. Implement a dynamic thresholding scheme for token selection, integrated with Flash Attention to avoid masking overhead.
In practice
- Implement dynamic thresholding for sparse attention in Diffusion Transformers.
- Integrate sparse attention directly with Flash Attention for efficiency gains.
Topics
- Diffusion Transformers
- Sparse Attention
- Video Generation
- Flash Attention
- Computational Efficiency
- Dynamic Thresholding
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.