ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference
Summary
ProxyKV is a novel cross-model proxy pruning framework designed to enhance efficient long-context inference in Large Language Models (LLMs) by addressing the Key-Value (KV) cache memory bottleneck. It offloads importance scoring to a lightweight, intra-family Small-Model Proxy that runs asynchronously to the Large-Model Target, eliminating the prohibitive prefilling overhead of high-precision reconstruction methods like KVZip. To bridge architectural differences between models, ProxyKV introduces a HybridAxialMapper, which disentangles temporal feature extraction from cross-head alignment, and a Multi-Granularity Hybrid Loss that focuses on relative ranking consistency. Across Llama-3.1, Qwen-2.5, and Qwen-3 families (7B to 32B parameters), ProxyKV matches KVZip's accuracy on benchmarks like LongBench, SCBench, and RULER, while achieving up to 3.21x prefilling speedup on Llama-3.1-8B and sustaining speedups at contexts up to 170k tokens on Qwen-2.5-7B.
Key takeaway
For NLP engineers and research scientists optimizing LLM inference for long contexts, ProxyKV offers a compelling solution to the KV cache memory wall. By adopting its asynchronous proxy pruning framework, you can achieve significant prefilling speedups (up to 3.21x) without sacrificing accuracy, even at context lengths up to 170k tokens. This approach allows for more efficient deployment of large models, particularly for applications requiring extensive context processing, by reducing latency and managing GPU memory more effectively during the critical prefill phase.
Key insights
ProxyKV uses an asynchronous small-model proxy to prune KV caches, achieving high accuracy and significant speedups for long-context LLM inference.
Principles
- Offload computationally intensive tasks to asynchronous, lightweight proxies.
- Disentangle temporal and head-axis feature extraction for cross-model compatibility.
- Optimize for ranking consistency rather than rigid value regression in pruning.
Method
ProxyKV employs a three-stage HybridAxialMapper for temporal feature extraction, time-axis context encoding, and head-axis cross-attention, trained with a Multi-Granularity Hybrid Loss emphasizing multi-ratio binary and ranking-consistent objectives.
In practice
- Deploy ProxyKV in dual-GPU or single-GPU (CUDA streams) regimes.
- Train a dedicated HybridAxialMapper per intra-family target-proxy pair.
- Utilize a sliding window strategy for long-context training data processing.
Topics
- KV Cache Pruning
- Long-Context LLMs
- Proxy Models
- Asynchronous Inference
- HybridAxialMapper
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.