Can I Buy Your KV Cache?

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

A new proposal introduces a "prefill CDN" to address the redundant recomputation of Key-Value (KV) caches in large language models (LLMs) for frequently accessed documents. Currently, each AI agent re-runs the compute-intensive "prefill" step, rebuilding identical KV caches for the same text. The proposed solution involves a publisher precomputing a document's KV cache once, allowing other agents to load it and skip prefill. This method is "token-exact," matching from-scratch prefill with no accuracy cost (24/24 greedy tokens, max logit difference 0.02). Experiments on Qwen3-4B show KV reuse is 9-50x cheaper in compute than prefill, with savings increasing with context length due to prefill's L^2 attention scaling. The break-even point is nearly immediate (N*≈1). While KV artifacts are large (0.148 MB/token for fp16), hosting them provider-side avoids prohibitive egress costs (\$0.049 for a 0.54 GB artifact, 2.6x prefill cost). This enables a market where agents pay for hosted KV loads, offering users a 10x discount while providing significant provider margin.

Key takeaway

For MLOps Engineers managing LLM serving costs, you should evaluate implementing a prefill CDN for high-volume, long-context retrieval-augmented generation (RAG) or agentic workloads. This approach significantly reduces compute expenses by amortizing prefill costs across many users, with savings of 9-50x on Qwen3-4B. Consider hosting KV caches to avoid prohibitive egress fees, and explore lossless compression strategies to maintain token-exactness for critical applications.

Key insights

Precomputing and reusing LLM KV caches for shared content drastically reduces redundant prefill computation and cost.

Principles

KV cache reuse is token-exact for shared prefixes.
Compute savings increase super-linearly with context length.
Hosting KV caches is critical to avoid egress costs.

Method

Publishers precompute and serialize a document's KV cache with metadata; agents load it, setting correct position IDs and attention masks, then decode.

In practice

Implement a prefill CDN for popular documents.
Prioritize caching long-context, frequently accessed content.
Explore selective/mixed-precision KV compression.

Topics

KV Cache Reuse
Large Language Models
Prefill CDN
Retrieval-Augmented Generation
LLM Inference Optimization
Qwen3-4B

Code references

zly-idleness/kvstore

Best for: AI Engineer, NLP Engineer, AI Architect, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.