Dispatch-Aware Ragged Attention for Pruned Vision Transformers
Summary
A new Triton-based attention kernel, Dispatch-Aware Ragged Attention, addresses a dispatch-overhead bottleneck in Vision Transformers (ViTs) that prevents token pruning methods from achieving expected wall-clock speedups. Existing variable-length attention APIs like FlashAttention-2's varlen and PyTorch's NestedTensor SDPA incur significant host-side dispatch latency (60-90 µs) at the short sequence lengths typical of pruned ViTs (≤197 tokens), overshadowing sub-microsecond matrix arithmetic. The proposed bidirectional Triton kernel reduces this dispatch floor to ∼40 µs, enabling pruning savings to manifest. Integrated into a complete pack–attend–unpack pipeline, the system achieves up to 2.24× end-to-end throughput over padded PyTorch SDPA, maintains bit-exact classification predictions (<0.007 max absolute logit difference), and scales consistently across DeiT-T/S/B models and four pruning algorithms (Threshold-ℓ₂, DynamicViT, EViT, ATS).
Key takeaway
For Computer Vision Engineers optimizing ViT inference with token pruning, you should evaluate your speedups against variable-length attention kernels, not just padded baselines. The dispatch overhead of standard APIs can negate pruning benefits, making a specialized kernel like the Triton-based approach critical for realizing actual throughput gains, especially at small batch sizes where it offers 5-8% higher throughput than FlashAttention-2 varlen.
Key insights
Dispatch overhead, not compute, bottlenecks pruned ViT attention at short sequence lengths.
Principles
- Short-sequence attention is dispatch-overhead bound.
- Padding negates token pruning FLOP savings.
- Specialized kernels can reduce dispatch overhead.
Method
A two-phase fused token packer (PyTorch GPU ops + Triton kernel) feeds a specialized bidirectional Triton attention kernel, integrated into a pack–attend–unpack pipeline for ViT inference.
In practice
- Benchmark pruning against variable-length kernels.
- Decompose attention vs. MLP latency contributions.
- Consider Triton for short-sequence GPU kernels.
Topics
- Vision Transformers
- Token Pruning
- Dispatch Overhead
- Triton Kernel
- FlashAttention-2
Code references
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.