StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

DeepSeek-V3.2 and V4 utilize Compressed Sparse Attention (CSA), which involves a learned scoring projection (indexer) to select top-k keys per query for sparse attention. Existing public CSA implementations materialize a large FP32 score tensor, reaching 256 GB for a sequence length of S=65,536 with V4-Flash dimensions, which exceeds single-GPU HBM. StreamIndex is a Triton-based implementation of the CSA pipeline featuring a chunked partition-merge top-k driver that avoids materializing the full intermediate tensor. On an NVIDIA H200, StreamIndex processes V4-Flash dimensions up to S=1,048,576 with a peak HBM usage of 6.21 GB, extending the operational regime by 32x compared to the materialize path which OOMs at S=65,536. StreamIndex maintains bit-exact set-overlap recall at smaller sequence lengths and achieves a mean recall of 1.0000 across various design-space sweeps.

Key takeaway

For AI Engineers developing large language models with sparse attention mechanisms, StreamIndex offers a critical solution to memory limitations. If you are encountering Out-Of-Memory errors when scaling sequence lengths with DeepSeek-V4-like CSA, adopting StreamIndex can extend your operational capacity by 32x, allowing you to process sequences up to S=1,048,576 on a single NVIDIA H200 without sacrificing recall. Consider integrating this Triton implementation to overcome HBM constraints.

Key insights

StreamIndex enables memory-bounded Compressed Sparse Attention by avoiding full score tensor materialization.

Principles

Method

StreamIndex uses a chunked partition-merge top-k driver in Triton to process Compressed Sparse Attention scores without materializing the entire intermediate tensor, integrating with TileLang's pipelined attention kernel.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.