SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving
Summary
SpectrumKV introduces a per-token mixed-precision approach for Key-Value (KV) cache transfer in prefill-decode (PD) disaggregated LLM serving. Unlike binary KV reduction methods, SpectrumKV assigns varying precision levels: FP16 for high-importance tokens, INT8 for medium, and INT4 for low-importance tokens, when tolerable. It includes a lightweight deployment-time probe using three NIAH trials to adaptively determine INT4 compatibility, falling back to FP16+INT8 if a model like Qwen2.5-7B-Instruct fails. Across Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, and Gemma-2-9B-it, SpectrumKV significantly improves quality, showing perplexity changes of +1.97%, -0.06%, and -0.44% respectively on WikiText-2 at a 50% KV budget, far outperforming PDTrim. It also achieves 50-62% TTFT reductions at b=0.5.
Key takeaway
For MLOps engineers optimizing LLM serving architectures, you should evaluate per-token mixed-precision KV cache transfer to significantly reduce network payload and improve performance. Implement an adaptive probing mechanism to safely utilize INT4 quantization for compatible models like Mistral-7B or Gemma-2-9B, while ensuring FP16 protection for critical tokens. This approach can yield 50-62% TTFT reductions and maintain model quality, moving beyond simple token pruning.
Key insights
Prefill-decode KV cache transfer benefits from per-token mixed-precision allocation rather than binary pruning.
Principles
- KV cache transfer is a precision-allocation problem.
- Adaptive policies improve model tolerance.
- High-importance tokens need FP16 protection.
Method
SpectrumKV assigns FP16, INT8, or INT4 precision per token based on importance. It uses a deployment-time probe with NIAH trials to adaptively determine INT4 tolerance for specific models.
In practice
- Probe models for INT4 KV quantization tolerance.
- Prioritize FP16 for attention sinks.
- Consider INT8/INT4 for less critical tokens.
Topics
- LLM Serving
- KV Cache Optimization
- Mixed-Precision Quantization
- Prefill-Decode Disaggregation
- Model Quantization
- NIAH Retrieval
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.