You Only Index Once: Cross-Layer Sparse Attention with Shared Routing
Summary
Cross-layer sparse attention (CLSA) is a novel architecture for long-context Large Language Models, addressing critical inference bottlenecks in decoding efficiency, KV-cache storage, and pre-filling. Built upon KV-sharing designs like YOCO, CLSA introduces shared routing indices across cross-decoder layers. Instead of recomputing expensive token-level top-k selection for each layer, a single indexer performs this operation once, reusing the resulting index across all layers. This approach amortizes routing overhead while preserving the fine-grained selectivity of token sparse attention. Experiments demonstrate CLSA's effectiveness, achieving up to 7.6× decoding speedup and 17.1× overall throughput improvement at 128K context. The method maintains model quality, performing comparably to or better than dense baselines on benchmarks such as ARC-Challenge, GSM8K, and DROP.
Key takeaway
For ML Engineers and AI Architects deploying long-context LLMs, CLSA offers a compelling solution to persistent inference bottlenecks. You should evaluate this architecture, particularly if your current systems face challenges with decoding speed, KV-cache memory, or pre-filling efficiency. CLSA provides substantial throughput improvements, achieving up to 17.1× overall speedup at 128K context, while demonstrably preserving model quality. This unified approach reconciles efficiency and accuracy, making it a strong candidate for optimizing your production LLM deployments.
Key insights
Cross-layer sparse attention shares a single routing index across decoder layers to amortize top-k cost and boost LLM inference efficiency.
Principles
- Amortize expensive operations by sharing results across layers.
- Token-level sparsity can maintain quality if routing overhead is managed.
- Jointly optimize pre-filling, KV-cache, and decoding for unified efficiency.
Method
CLSA extends YOCO by adding a single-head indexer to the self-decoder, computing a token-level top-k routing index once. Cross-decoder layers then reuse this index and the shared KV cache, trained via multi-layer distillation.
In practice
- Integrate CLSA into vLLM for substantial inference speedups.
- Consider CLSA for long-context LLMs requiring high throughput.
- Use 2048 selected tokens for a favorable quality-efficiency trade-off.
Topics
- Cross-Layer Sparse Attention
- LLM Inference
- KV Cache Optimization
- Sparse Attention Routing
- Long-Context LLMs
- Decoding Throughput
Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.