Can I Buy Your KV Cache?

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

A new proposal introduces a "prefill CDN" to address the redundant recomputation of Key-Value (KV) caches in large language models (LLMs) for frequently accessed documents. Currently, each AI agent re-runs the compute-intensive "prefill" step, rebuilding identical KV caches for the same text. The proposed solution involves a publisher precomputing a document's KV cache once, allowing other agents to load it and skip prefill. This method is "token-exact," matching from-scratch prefill with no accuracy cost (24/24 greedy tokens, max logit difference 0.02). Experiments on Qwen3-4B show KV reuse is 9-50x cheaper in compute than prefill, with savings increasing with context length due to prefill's L^2 attention scaling. The break-even point is nearly immediate (N*≈1). While KV artifacts are large (0.148 MB/token for fp16), hosting them provider-side avoids prohibitive egress costs (\$0.049 for a 0.54 GB artifact, 2.6x prefill cost). This enables a market where agents pay for hosted KV loads, offering users a 10x discount while providing significant provider margin.

Key takeaway

For MLOps Engineers managing LLM serving costs, you should evaluate implementing a prefill CDN for high-volume, long-context retrieval-augmented generation (RAG) or agentic workloads. This approach significantly reduces compute expenses by amortizing prefill costs across many users, with savings of 9-50x on Qwen3-4B. Consider hosting KV caches to avoid prohibitive egress fees, and explore lossless compression strategies to maintain token-exactness for critical applications.

Key insights

Precomputing and reusing LLM KV caches for shared content drastically reduces redundant prefill computation and cost.

Principles

Method

Publishers precompute and serialize a document's KV cache with metadata; agents load it, setting correct position IDs and attention masks, then decode.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, AI Architect, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.