StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

2026-05-04 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

DeepSeek-V3.2 and V4 utilize Compressed Sparse Attention (CSA), which involves a learned scoring projection (indexer) to select top-k keys per query for sparse attention. Existing public CSA implementations materialize a large FP32 score tensor, reaching 256 GB for a sequence length of S=65,536 with V4-Flash dimensions, which exceeds single-GPU HBM. StreamIndex is a Triton-based implementation of the CSA pipeline featuring a chunked partition-merge top-k driver that avoids materializing the full intermediate tensor. On an NVIDIA H200, StreamIndex processes V4-Flash dimensions up to S=1,048,576 with a peak HBM usage of 6.21 GB, extending the operational regime by 32x compared to the materialize path which OOMs at S=65,536. StreamIndex maintains bit-exact set-overlap recall at smaller sequence lengths and achieves a mean recall of 1.0000 across various design-space sweeps.

Key takeaway

For AI Engineers developing large language models with sparse attention mechanisms, StreamIndex offers a critical solution to memory limitations. If you are encountering Out-Of-Memory errors when scaling sequence lengths with DeepSeek-V4-like CSA, adopting StreamIndex can extend your operational capacity by 32x, allowing you to process sequences up to S=1,048,576 on a single NVIDIA H200 without sacrificing recall. Consider integrating this Triton implementation to overcome HBM constraints.

Key insights

StreamIndex enables memory-bounded Compressed Sparse Attention by avoiding full score tensor materialization.

Principles

Chunking prevents OOM errors.
Sparse attention reduces computation.
Learned indexers improve efficiency.

Method

StreamIndex uses a chunked partition-merge top-k driver in Triton to process Compressed Sparse Attention scores without materializing the entire intermediate tensor, integrating with TileLang's pipelined attention kernel.

In practice

Use StreamIndex for large S CSA.
Integrate with TileLang kernels.
Optimize chunk and tile sizes.

Topics

StreamIndex
Compressed Sparse Attention
Memory-Bounded Attention
Streaming Top-k
Triton

Code references

RightNow-AI/StreamIndex

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.