StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Summary
DeepSeek-V3.2 and V4 utilize Compressed Sparse Attention (CSA), which involves a learned scoring projection (indexer) to select top-k keys per query for sparse attention. Existing public CSA implementations materialize a large FP32 score tensor, reaching 256 GB for a sequence length of S=65,536 with V4-Flash dimensions, which exceeds single-GPU HBM. StreamIndex is a Triton-based implementation of the CSA pipeline featuring a chunked partition-merge top-k driver that avoids materializing the full intermediate tensor. On an NVIDIA H200, StreamIndex processes V4-Flash dimensions up to S=1,048,576 with a peak HBM usage of 6.21 GB, extending the operational regime by 32x compared to the materialize path which OOMs at S=65,536. StreamIndex maintains bit-exact set-overlap recall at smaller sequence lengths and achieves a mean recall of 1.0000 across various design-space sweeps.
Key takeaway
For AI Engineers developing large language models with sparse attention mechanisms, StreamIndex offers a critical solution to memory limitations. If you are encountering Out-Of-Memory errors when scaling sequence lengths with DeepSeek-V4-like CSA, adopting StreamIndex can extend your operational capacity by 32x, allowing you to process sequences up to S=1,048,576 on a single NVIDIA H200 without sacrificing recall. Consider integrating this Triton implementation to overcome HBM constraints.
Key insights
StreamIndex enables memory-bounded Compressed Sparse Attention by avoiding full score tensor materialization.
Principles
- Chunking prevents OOM errors.
- Sparse attention reduces computation.
- Learned indexers improve efficiency.
Method
StreamIndex uses a chunked partition-merge top-k driver in Triton to process Compressed Sparse Attention scores without materializing the entire intermediate tensor, integrating with TileLang's pipelined attention kernel.
In practice
- Use StreamIndex for large S CSA.
- Integrate with TileLang kernels.
- Optimize chunk and tile sizes.
Topics
- StreamIndex
- Compressed Sparse Attention
- Memory-Bounded Attention
- Streaming Top-k
- Triton
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.