Training-free sparse attention based on cumulative energy filtering

2026-06-15 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

A new training-free sparse attention method, based on cumulative energy filtering, accelerates Diffusion Transformers (DiTs) for video generation. This approach addresses the challenge of simultaneously maximizing sparsity and minimizing accuracy degradation, a dual-goal optimization problem not fully met by existing algorithms like Top-p or Top-k. The proposed dynamic thresholding scheme maintains a fixed recall rate to ensure accuracy while significantly improving sparsity. It integrates deeply with Flash Attention (FA), eliminating additional masking computation overhead. Experimental results on Wan 2.2 demonstrate that this strategy boosts sparsity from BLASST's 61.42% to 82% with a VBench metric drop of less than 5%. This translates to an approximate 15% reduction in attention computation and a 1.61x increase in computational efficiency, outperforming BLASST by 1.18x.

Key takeaway

For Machine Learning Engineers optimizing Diffusion Transformers for video generation, your current sparse attention strategies might be suboptimal for balancing computational efficiency and output quality. Consider implementing a dynamic thresholding approach, as demonstrated, to achieve significant sparsity gains (up to 82%) and 1.61x computational efficiency without substantial accuracy loss (less than 5% VBench drop), especially when using Flash Attention. This can notably reduce attention computation by approximately 15%.

Key insights

Dynamic thresholding for sparse attention simultaneously optimizes sparsity and accuracy in Diffusion Transformers for video generation.

Principles

Maintaining a fixed recall rate is sufficient for ensuring accuracy in sparse attention.
Dynamic thresholding schemes improve sparsity more effectively than fixed thresholds.

Method

Formulate token filtering as a dual-goal optimization problem to maximize sparsity and minimize accuracy degradation. Implement a dynamic thresholding scheme for token selection, integrated with Flash Attention to avoid masking overhead.

In practice

Implement dynamic thresholding for sparse attention in Diffusion Transformers.
Integrate sparse attention directly with Flash Attention for efficiency gains.

Topics

Diffusion Transformers
Sparse Attention
Video Generation
Flash Attention
Computational Efficiency
Dynamic Thresholding

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.