DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

2026-05-18 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

DashAttention, a novel hierarchical attention method, addresses limitations in existing techniques like NSA and InfLLMv2 by introducing a fully differentiable and adaptively sparse approach. Unlike prior methods that use a fixed top-k selection, DashAttention employs the $\alpha$-entmax transformation to dynamically select a variable number of key-value blocks based on the query, providing a prior for the subsequent softmax attention. This design ensures gradient flow throughout the hierarchy and makes DashAttention non-dispersive, enhancing its long-context modeling capabilities. Experiments with large language models demonstrate that DashAttention achieves accuracy comparable to full attention with 75% sparsity and outperforms NSA and InfLLMv2, particularly in high-sparsity scenarios. An efficient GPU-aware implementation in Triton also provides a speedup over FlashAttention-3 during inference.

Key takeaway

For AI Engineers and Research Scientists developing or deploying large language models with long context windows, DashAttention offers a cost-effective and performant alternative. Its adaptive sparsity and full differentiability allow for significant computational savings (75% sparsity) without sacrificing accuracy, while also providing inference speedups over FlashAttention-3. You should consider integrating DashAttention to optimize long-context LLM performance and resource utilization.

Key insights

DashAttention uses adaptive $\alpha$-entmax for differentiable, variable-sparsity hierarchical attention, improving long-context LLM efficiency.

Principles

Adaptive sparsity improves long-context modeling.
Differentiability across stages is crucial for optimization.
Non-dispersive attention enhances context retention.

Method

DashAttention employs an adaptively sparse $\alpha$-entmax transformation for variable block selection, followed by a second-stage softmax attention, ensuring full differentiability and providing a prior.

In practice

Achieves 75% sparsity with full attention accuracy.
Outperforms NSA and InfLLMv2 in high sparsity.
Offers speedup over FlashAttention-3 in Triton.

Topics

DashAttention
Hierarchical Attention
Sparse Attention
α-entmax Transformation
Long-Context Modeling

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.