Dispatch-Aware Ragged Attention for Pruned Vision Transformers

2026-04-20 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, long

Summary

A new Triton-based attention kernel, Dispatch-Aware Ragged Attention, addresses a dispatch-overhead bottleneck in Vision Transformers (ViTs) that prevents token pruning methods from achieving expected wall-clock speedups. Existing variable-length attention APIs like FlashAttention-2's varlen and PyTorch's NestedTensor SDPA incur significant host-side dispatch latency (60-90 µs) at the short sequence lengths typical of pruned ViTs (≤197 tokens), overshadowing sub-microsecond matrix arithmetic. The proposed bidirectional Triton kernel reduces this dispatch floor to ∼40 µs, enabling pruning savings to manifest. Integrated into a complete pack–attend–unpack pipeline, the system achieves up to 2.24× end-to-end throughput over padded PyTorch SDPA, maintains bit-exact classification predictions (<0.007 max absolute logit difference), and scales consistently across DeiT-T/S/B models and four pruning algorithms (Threshold-ℓ₂, DynamicViT, EViT, ATS).

Key takeaway

For Computer Vision Engineers optimizing ViT inference with token pruning, you should evaluate your speedups against variable-length attention kernels, not just padded baselines. The dispatch overhead of standard APIs can negate pruning benefits, making a specialized kernel like the Triton-based approach critical for realizing actual throughput gains, especially at small batch sizes where it offers 5-8% higher throughput than FlashAttention-2 varlen.

Key insights

Dispatch overhead, not compute, bottlenecks pruned ViT attention at short sequence lengths.

Principles

Short-sequence attention is dispatch-overhead bound.
Padding negates token pruning FLOP savings.
Specialized kernels can reduce dispatch overhead.

Method

A two-phase fused token packer (PyTorch GPU ops + Triton kernel) feeds a specialized bidirectional Triton attention kernel, integrated into a pack–attend–unpack pipeline for ViT inference.

In practice

Benchmark pruning against variable-length kernels.
Decompose attention vs. MLP latency contributions.
Consider Triton for short-sequence GPU kernels.

Topics

Vision Transformers
Token Pruning
Dispatch Overhead
Triton Kernel
FlashAttention-2

Code references

saifmb0/sparse-vits

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.