Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

RTPurbo is a novel method that transforms full-attention Large Language Models (LLMs) into highly sparse models with minimal adaptation, addressing the quadratic cost bottleneck of long-context inference. It leverages the intrinsic sparsity of LLMs by identifying that only a small subset of attention heads requires full long-context processing, long-range retrieval is governed by a low-dimensional subspace, and the useful token budget is query-dependent. RTPurbo retains the full KV cache only for "retrieval heads" and introduces a lightweight 16-dimensional token indexer for sparse attention, utilizing dynamic top-$p$ selection over fixed top-$k$. This approach achieves sparsification with only a few hundred training steps, preserving near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36x prefill speedup at 1M context and about a 2.01x decode speedup on NVIDIA H20 GPUs.

Key takeaway

For AI Engineers optimizing LLM inference for long contexts, RTPurbo offers a compelling strategy to achieve significant speedups without sacrificing accuracy. By adopting its head-wise sparse attention and dynamic token selection, you can transform existing full-attention models into efficient sparse ones with minimal retraining. This approach avoids expensive native sparse pretraining, making it a practical choice for deploying long-context LLMs more cost-effectively.

Key insights

Full-attention LLMs are intrinsically sparse and can be efficiently transformed into sparse models with minimal training.

Principles

Method

RTPurbo identifies retrieval heads via offline calibration, uses low-rank projections for efficient token indexing, and applies dynamic Top-$p$ selection. It employs a two-stage training pipeline with KL divergence minimization and self-distillation.

In practice

Topics

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.