Dispatch-Aware Ragged Attention for Pruned Vision Transformers
Summary
Token pruning techniques for Vision Transformers (ViTs) aim to reduce attention FLOPs quadratically by removing uninformative patches. However, current variable-length attention APIs, such as FlashAttention-2's varlen and PyTorch's NestedTensor SDPA, do not translate these FLOPs reductions into proportional wall-clock latency improvements. This discrepancy stems from a dispatch-overhead bottleneck, where host-side dispatch paths consume 60-90 microseconds for short post-pruning sequences (<=197 tokens), while matrix arithmetic completes in single-digit microseconds. A new lightweight, bidirectional Triton attention kernel has been developed, reducing the dispatch floor to 40 microseconds, approximately 1.5x lower than FlashAttention-2 varlen. This system, integrated into a pack-attend-unpack pipeline, achieves up to 2.24x end-to-end throughput over padded PyTorch SDPA across four pruning algorithms (Threshold-L2, DynamicViT, EViT, ATS) and maintains bit-exact classification predictions with <0.007 max absolute logit difference.
Key takeaway
For Computer Vision Engineers optimizing pruned Vision Transformers, the dispatch overhead of attention kernels is a critical bottleneck, not just FLOPs. You should evaluate custom kernel solutions like the proposed Triton attention kernel to realize actual wall-clock latency improvements from token pruning, especially for models like DeiT-T/S/B, to achieve significant throughput gains.
Key insights
Dispatch overhead, not FLOPs, bottlenecks pruned Vision Transformer attention latency at short sequence lengths.
Principles
- Wall-clock latency is not always proportional to FLOPs reduction.
- Host-side dispatch can dominate kernel execution time.
Method
A lightweight, bidirectional Triton attention kernel reduces dispatch overhead to 40 us, enabling pruning savings to manifest in wall-clock time within a pack-attend-unpack pipeline.
In practice
- Consider dispatch overhead for short sequence length operations.
- Utilize Triton kernels for custom, low-overhead attention.
Topics
- Vision Transformers
- Token Pruning
- Dispatch Overhead
- Triton Attention Kernel
- FlashAttention-2
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.