ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

2026-05-19 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

ProxyKV is a novel cross-model proxy pruning framework designed to enhance efficient long-context inference in Large Language Models (LLMs) by addressing the Key-Value (KV) cache memory bottleneck. It offloads importance scoring to a lightweight, intra-family Small-Model Proxy that runs asynchronously to the Large-Model Target, eliminating the prohibitive prefilling overhead of high-precision reconstruction methods like KVZip. To bridge architectural differences between models, ProxyKV introduces a HybridAxialMapper, which disentangles temporal feature extraction from cross-head alignment, and a Multi-Granularity Hybrid Loss that focuses on relative ranking consistency. Across Llama-3.1, Qwen-2.5, and Qwen-3 families (7B to 32B parameters), ProxyKV matches KVZip's accuracy on benchmarks like LongBench, SCBench, and RULER, while achieving up to 3.21x prefilling speedup on Llama-3.1-8B and sustaining speedups at contexts up to 170k tokens on Qwen-2.5-7B.

Key takeaway

For NLP engineers and research scientists optimizing LLM inference for long contexts, ProxyKV offers a compelling solution to the KV cache memory wall. By adopting its asynchronous proxy pruning framework, you can achieve significant prefilling speedups (up to 3.21x) without sacrificing accuracy, even at context lengths up to 170k tokens. This approach allows for more efficient deployment of large models, particularly for applications requiring extensive context processing, by reducing latency and managing GPU memory more effectively during the critical prefill phase.

Key insights

ProxyKV uses an asynchronous small-model proxy to prune KV caches, achieving high accuracy and significant speedups for long-context LLM inference.

Principles

Offload computationally intensive tasks to asynchronous, lightweight proxies.
Disentangle temporal and head-axis feature extraction for cross-model compatibility.
Optimize for ranking consistency rather than rigid value regression in pruning.

Method

ProxyKV employs a three-stage HybridAxialMapper for temporal feature extraction, time-axis context encoding, and head-axis cross-attention, trained with a Multi-Granularity Hybrid Loss emphasizing multi-ratio binary and ranking-consistent objectives.

In practice

Deploy ProxyKV in dual-GPU or single-GPU (CUDA streams) regimes.
Train a dedicated HybridAxialMapper per intra-family target-proxy pair.
Utilize a sliding window strategy for long-context training data processing.

Topics

KV Cache Pruning
Long-Context LLMs
Proxy Models
Asynchronous Inference
HybridAxialMapper

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.