Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

2026-05-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

RTPurbo is a novel method that transforms full-attention Large Language Models (LLMs) into highly sparse models with minimal adaptation, addressing the quadratic cost bottleneck of long-context inference. It leverages the intrinsic sparsity of LLMs by identifying that only a small subset of attention heads requires full long-context processing, long-range retrieval is governed by a low-dimensional subspace, and the useful token budget is query-dependent. RTPurbo retains the full KV cache only for "retrieval heads" and introduces a lightweight 16-dimensional token indexer for sparse attention, utilizing dynamic top-$p$ selection over fixed top-$k$. This approach achieves sparsification with only a few hundred training steps, preserving near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36x prefill speedup at 1M context and about a 2.01x decode speedup on NVIDIA H20 GPUs.

Key takeaway

For AI Engineers optimizing LLM inference for long contexts, RTPurbo offers a compelling strategy to achieve significant speedups without sacrificing accuracy. By adopting its head-wise sparse attention and dynamic token selection, you can transform existing full-attention models into efficient sparse ones with minimal retraining. This approach avoids expensive native sparse pretraining, making it a practical choice for deploying long-context LLMs more cost-effectively.

Key insights

Full-attention LLMs are intrinsically sparse and can be efficiently transformed into sparse models with minimal training.

Principles

Attention heads specialize into retrieval and local roles.
Long-range retrieval operates within a low-dimensional subspace.
Optimal token budget is dynamically query-dependent.

Method

RTPurbo identifies retrieval heads via offline calibration, uses low-rank projections for efficient token indexing, and applies dynamic Top-$p$ selection. It employs a two-stage training pipeline with KL divergence minimization and self-distillation.

In practice

Retain full KV cache only for identified retrieval heads.
Use 16-dimensional indexers for efficient token retrieval.
Implement dynamic Top-$p$ selection for adaptive sparsity.

Topics

RTPurbo
Sparse Attention
Long-Context LLMs
Retrieval Heads
Dynamic Top-P Selection

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.