You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Cross-layer sparse attention (CLSA), built on KV-sharing architectures like YOCO, addresses long-context inference bottlenecks in large language models. This method innovatively shares both the KV cache and the routing index across decoder layers. A single indexer performs token-level top-k selection once, reusing the resulting index to preserve fine-grained selectivity while amortizing routing overhead. CLSA jointly improves pre-filling, KV-cache storage, and long-context decoding. Experiments demonstrate significant efficiency gains, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context, offering a complete architectural solution for advancing LLM quality and inference efficiency.

Key takeaway

For Machine Learning Engineers optimizing large language model inference, consider evaluating Cross-layer Sparse Attention (CLSA) to overcome long-context decoding bottlenecks. This architecture offers a robust solution by jointly improving pre-filling, KV-cache storage, and decoding, delivering up to 7.6x decoding speedup and 17.1x overall throughput at 128K context. Implementing CLSA could significantly enhance your LLM deployment efficiency and performance.

Key insights

Cross-layer sparse attention (CLSA) shares KV cache and routing indices across layers to improve LLM inference efficiency and quality.

Principles

Long-context LLM inference is constrained by decoding efficiency.
Structured block sparse methods offer speed but incur quality loss.
Token sparse methods are accurate but routing overhead is expensive.

Method

CLSA uses a single indexer for token-level top-k selection, reusing the index across decoder layers to amortize routing overhead in KV-sharing architectures.

In practice

Apply CLSA to improve LLM pre-filling.
Reduce KV-cache storage requirements.
Accelerate long-context decoding up to 7.6x.

Topics

Cross-layer Sparse Attention
LLM Inference
KV-sharing Architectures
Long-context Decoding
Decoding Efficiency
Sparse Attention

Best for: AI Engineer, Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.