Can I Buy Your KV Cache?
Summary
A new proposal introduces a "prefill CDN" to address the redundant recomputation of Key-Value (KV) caches in large language models (LLMs) for frequently accessed documents. Currently, each AI agent re-runs the compute-intensive "prefill" step, rebuilding identical KV caches for the same text. The proposed solution involves a publisher precomputing a document's KV cache once, allowing other agents to load it and skip prefill. This method is "token-exact," matching from-scratch prefill with no accuracy cost (24/24 greedy tokens, max logit difference 0.02). Experiments on Qwen3-4B show KV reuse is 9-50x cheaper in compute than prefill, with savings increasing with context length due to prefill's L^2 attention scaling. The break-even point is nearly immediate (N*≈1). While KV artifacts are large (0.148 MB/token for fp16), hosting them provider-side avoids prohibitive egress costs (\$0.049 for a 0.54 GB artifact, 2.6x prefill cost). This enables a market where agents pay for hosted KV loads, offering users a 10x discount while providing significant provider margin.
Key takeaway
For MLOps Engineers managing LLM serving costs, you should evaluate implementing a prefill CDN for high-volume, long-context retrieval-augmented generation (RAG) or agentic workloads. This approach significantly reduces compute expenses by amortizing prefill costs across many users, with savings of 9-50x on Qwen3-4B. Consider hosting KV caches to avoid prohibitive egress fees, and explore lossless compression strategies to maintain token-exactness for critical applications.
Key insights
Precomputing and reusing LLM KV caches for shared content drastically reduces redundant prefill computation and cost.
Principles
- KV cache reuse is token-exact for shared prefixes.
- Compute savings increase super-linearly with context length.
- Hosting KV caches is critical to avoid egress costs.
Method
Publishers precompute and serialize a document's KV cache with metadata; agents load it, setting correct position IDs and attention masks, then decode.
In practice
- Implement a prefill CDN for popular documents.
- Prioritize caching long-context, frequently accessed content.
- Explore selective/mixed-precision KV compression.
Topics
- KV Cache Reuse
- Large Language Models
- Prefill CDN
- Retrieval-Augmented Generation
- LLM Inference Optimization
- Qwen3-4B
Code references
Best for: AI Engineer, NLP Engineer, AI Architect, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.