SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

SpectrumKV introduces a per-token mixed-precision approach for Key-Value (KV) cache transfer in prefill-decode (PD) disaggregated LLM serving. Unlike binary KV reduction methods, SpectrumKV assigns varying precision levels: FP16 for high-importance tokens, INT8 for medium, and INT4 for low-importance tokens, when tolerable. It includes a lightweight deployment-time probe using three NIAH trials to adaptively determine INT4 compatibility, falling back to FP16+INT8 if a model like Qwen2.5-7B-Instruct fails. Across Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, and Gemma-2-9B-it, SpectrumKV significantly improves quality, showing perplexity changes of +1.97%, -0.06%, and -0.44% respectively on WikiText-2 at a 50% KV budget, far outperforming PDTrim. It also achieves 50-62% TTFT reductions at b=0.5.

Key takeaway

For MLOps engineers optimizing LLM serving architectures, you should evaluate per-token mixed-precision KV cache transfer to significantly reduce network payload and improve performance. Implement an adaptive probing mechanism to safely utilize INT4 quantization for compatible models like Mistral-7B or Gemma-2-9B, while ensuring FP16 protection for critical tokens. This approach can yield 50-62% TTFT reductions and maintain model quality, moving beyond simple token pruning.

Key insights

Prefill-decode KV cache transfer benefits from per-token mixed-precision allocation rather than binary pruning.

Principles

Method

SpectrumKV assigns FP16, INT8, or INT4 precision per token based on importance. It uses a deployment-time probe with NIAH trials to adaptively determine INT4 tolerance for specific models.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.