Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
Summary
RTPurbo is a novel method that transforms full-attention Large Language Models (LLMs) into highly sparse models with minimal adaptation, addressing the quadratic cost bottleneck of long-context inference. It leverages the intrinsic sparsity of LLMs by identifying that only a small subset of attention heads requires full long-context processing, long-range retrieval is governed by a low-dimensional subspace, and the useful token budget is query-dependent. RTPurbo retains the full KV cache only for "retrieval heads" and introduces a lightweight 16-dimensional token indexer for sparse attention, utilizing dynamic top-$p$ selection over fixed top-$k$. This approach achieves sparsification with only a few hundred training steps, preserving near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36x prefill speedup at 1M context and about a 2.01x decode speedup on NVIDIA H20 GPUs.
Key takeaway
For AI Engineers optimizing LLM inference for long contexts, RTPurbo offers a compelling strategy to achieve significant speedups without sacrificing accuracy. By adopting its head-wise sparse attention and dynamic token selection, you can transform existing full-attention models into efficient sparse ones with minimal retraining. This approach avoids expensive native sparse pretraining, making it a practical choice for deploying long-context LLMs more cost-effectively.
Key insights
Full-attention LLMs are intrinsically sparse and can be efficiently transformed into sparse models with minimal training.
Principles
- Attention heads specialize into retrieval and local roles.
- Long-range retrieval operates within a low-dimensional subspace.
- Optimal token budget is dynamically query-dependent.
Method
RTPurbo identifies retrieval heads via offline calibration, uses low-rank projections for efficient token indexing, and applies dynamic Top-$p$ selection. It employs a two-stage training pipeline with KL divergence minimization and self-distillation.
In practice
- Retain full KV cache only for identified retrieval heads.
- Use 16-dimensional indexers for efficient token retrieval.
- Implement dynamic Top-$p$ selection for adaptive sparsity.
Topics
- RTPurbo
- Sparse Attention
- Long-Context LLMs
- Retrieval Heads
- Dynamic Top-P Selection
Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.