You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Cross-layer sparse attention (CLSA) is a novel architecture for long-context Large Language Models, addressing critical inference bottlenecks in decoding efficiency, KV-cache storage, and pre-filling. Built upon KV-sharing designs like YOCO, CLSA introduces shared routing indices across cross-decoder layers. Instead of recomputing expensive token-level top-k selection for each layer, a single indexer performs this operation once, reusing the resulting index across all layers. This approach amortizes routing overhead while preserving the fine-grained selectivity of token sparse attention. Experiments demonstrate CLSA's effectiveness, achieving up to 7.6× decoding speedup and 17.1× overall throughput improvement at 128K context. The method maintains model quality, performing comparably to or better than dense baselines on benchmarks such as ARC-Challenge, GSM8K, and DROP.

Key takeaway

For ML Engineers and AI Architects deploying long-context LLMs, CLSA offers a compelling solution to persistent inference bottlenecks. You should evaluate this architecture, particularly if your current systems face challenges with decoding speed, KV-cache memory, or pre-filling efficiency. CLSA provides substantial throughput improvements, achieving up to 17.1× overall speedup at 128K context, while demonstrably preserving model quality. This unified approach reconciles efficiency and accuracy, making it a strong candidate for optimizing your production LLM deployments.

Key insights

Cross-layer sparse attention shares a single routing index across decoder layers to amortize top-k cost and boost LLM inference efficiency.

Principles

Method

CLSA extends YOCO by adding a single-head indexer to the self-decoder, computing a token-level top-k routing index once. Cross-decoder layers then reuse this index and the shared KV cache, trained via multi-layer distillation.

In practice

Topics

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.