AdaSplash-2: Faster Differentiable Sparse Attention

2026-04-16 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

AdaSplash-2 is a new differentiable sparse attention mechanism designed to accelerate transformer training, particularly for long-context lengths. It addresses the computational overhead of $α$-entmax attention, a sparse alternative to softmax, by introducing a novel histogram-based initialization. This technique computes a coarse histogram of attention scores on-the-fly and stores it in on-chip SRAM, significantly reducing the iterations required to compute the normalizer $τ$ to typically 1-2. Coupled with a sparsity-aware GPU implementation that efficiently skips zero blocks, AdaSplash-2 achieves per-step training times comparable to or better than FlashAttention-2 when block sparsity is moderate-to-high (e.g., >60%). Models trained with AdaSplash-2 match softmax baselines at short-context lengths and demonstrate substantial performance improvements in long-context scenarios.

Key takeaway

For AI Engineers developing transformer models with long-context requirements, AdaSplash-2 offers a significant performance advantage. Its ability to match or exceed FlashAttention-2's speed at high sparsity levels means you can achieve faster training times and potentially better model performance in these challenging scenarios. Evaluate AdaSplash-2 to optimize your long-context transformer architectures.

Key insights

AdaSplash-2 accelerates sparse attention in transformers via histogram-based initialization and sparsity-aware GPU implementation.

Principles

Input-dependent sparsity is crucial for long contexts.
Accurate initialization reduces iterative computation.
Sparsity-aware implementations improve GPU efficiency.

Method

AdaSplash-2 computes a coarse histogram of attention scores in SRAM for $τ$ initialization, then uses a sparsity-aware GPU implementation to skip zero blocks.

In practice

Use AdaSplash-2 for long-context transformer training.
Consider $α$-entmax attention for sparse models.
Implement on-chip SRAM for histogram-based initialization.

Topics

AdaSplash-2
Sparse Attention
α-entmax Attention
Transformer Models
Long-Context Training

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.