AdaSplash-2: Faster Differentiable Sparse Attention
Summary
AdaSplash-2 is a new differentiable sparse attention mechanism designed to accelerate transformer training, particularly for long-context lengths. It addresses the computational overhead of $α$-entmax attention, a sparse alternative to softmax, by introducing a novel histogram-based initialization. This technique computes a coarse histogram of attention scores on-the-fly and stores it in on-chip SRAM, significantly reducing the iterations required to compute the normalizer $τ$ to typically 1-2. Coupled with a sparsity-aware GPU implementation that efficiently skips zero blocks, AdaSplash-2 achieves per-step training times comparable to or better than FlashAttention-2 when block sparsity is moderate-to-high (e.g., >60%). Models trained with AdaSplash-2 match softmax baselines at short-context lengths and demonstrate substantial performance improvements in long-context scenarios.
Key takeaway
For AI Engineers developing transformer models with long-context requirements, AdaSplash-2 offers a significant performance advantage. Its ability to match or exceed FlashAttention-2's speed at high sparsity levels means you can achieve faster training times and potentially better model performance in these challenging scenarios. Evaluate AdaSplash-2 to optimize your long-context transformer architectures.
Key insights
AdaSplash-2 accelerates sparse attention in transformers via histogram-based initialization and sparsity-aware GPU implementation.
Principles
- Input-dependent sparsity is crucial for long contexts.
- Accurate initialization reduces iterative computation.
- Sparsity-aware implementations improve GPU efficiency.
Method
AdaSplash-2 computes a coarse histogram of attention scores in SRAM for $τ$ initialization, then uses a sparsity-aware GPU implementation to skip zero blocks.
In practice
- Use AdaSplash-2 for long-context transformer training.
- Consider $α$-entmax attention for sparse models.
- Implement on-chip SRAM for histogram-based initialization.
Topics
- AdaSplash-2
- Sparse Attention
- α-entmax Attention
- Transformer Models
- Long-Context Training
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.