You Only Index Once: Cross-Layer Sparse Attention with Shared Routing
Summary
Cross-layer sparse attention (CLSA), built on KV-sharing architectures like YOCO, addresses long-context inference bottlenecks in large language models. This method innovatively shares both the KV cache and the routing index across decoder layers. A single indexer performs token-level top-k selection once, reusing the resulting index to preserve fine-grained selectivity while amortizing routing overhead. CLSA jointly improves pre-filling, KV-cache storage, and long-context decoding. Experiments demonstrate significant efficiency gains, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context, offering a complete architectural solution for advancing LLM quality and inference efficiency.
Key takeaway
For Machine Learning Engineers optimizing large language model inference, consider evaluating Cross-layer Sparse Attention (CLSA) to overcome long-context decoding bottlenecks. This architecture offers a robust solution by jointly improving pre-filling, KV-cache storage, and decoding, delivering up to 7.6x decoding speedup and 17.1x overall throughput at 128K context. Implementing CLSA could significantly enhance your LLM deployment efficiency and performance.
Key insights
Cross-layer sparse attention (CLSA) shares KV cache and routing indices across layers to improve LLM inference efficiency and quality.
Principles
- Long-context LLM inference is constrained by decoding efficiency.
- Structured block sparse methods offer speed but incur quality loss.
- Token sparse methods are accurate but routing overhead is expensive.
Method
CLSA uses a single indexer for token-level top-k selection, reusing the index across decoder layers to amortize routing overhead in KV-sharing architectures.
In practice
- Apply CLSA to improve LLM pre-filling.
- Reduce KV-cache storage requirements.
- Accelerate long-context decoding up to 7.6x.
Topics
- Cross-layer Sparse Attention
- LLM Inference
- KV-sharing Architectures
- Long-context Decoding
- Decoding Efficiency
- Sparse Attention
Best for: AI Engineer, Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.