CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference
Summary
CacheWeaver is a lightweight, prompt-layer method designed to optimize Retrieval-Augmented Generation (RAG) inference by reducing prefill costs associated with long prompts. It addresses the inefficiency of prefix caching in serving engines like vLLM, where overlapping retrieved evidence often appears in different orders, preventing reuse of Key-Value (KV) states. CacheWeaver reorders retrieved documents using a knowledge tree that stores recently served evidence sequences. By employing a greedy walk, it places the most reusable prefix first, without altering the serving engine or the retrieved evidence set. Experiments across three vLLM configurations demonstrate that CacheWeaver lowers median time-to-first-token (TTFT) by approximately 20–33% compared to retrieval-order prefix caching, achieving 97.5% of the gain from oracle ordering. This method maintains answer quality in QA tests and adds negligible host-side overhead, around 26 µs per request, while reducing inference p50 by 29%. It is particularly effective for workloads with moderate document overlap and temporal locality.
Key takeaway
For AI Engineers deploying RAG systems with vLLM, you should consider integrating CacheWeaver to significantly reduce inference latency. By reordering retrieved evidence to maximize prefix cache reuse, your median time-to-first-token can improve by 20–33% without compromising answer quality. This lightweight, prompt-layer optimization is particularly beneficial for applications with bursty, related queries, such as customer service or enterprise knowledge bases, where temporal locality is present.
Key insights
CacheWeaver reorders RAG evidence to maximize prefix cache reuse, significantly reducing LLM prefill latency.
Principles
- Evidence order impacts RAG cache reuse.
- Greedy trie search approximates optimal ordering.
- Moderate document overlap yields best gains.
Method
CacheWeaver uses a knowledge tree (trie) of recent document sequences. A greedy algorithm walks the trie to reorder retrieved documents, prioritizing paths that align with cached prefixes.
In practice
- Implement as Python middleware for vLLM.
- Use for customer service, domain assistants.
- Monitor TTFT for cache-state feedback.
Topics
- Retrieval-Augmented Generation
- LLM Inference Optimization
- Prefix Caching
- vLLM
- Time-to-First-Token
- Evidence Ordering
Best for: MLOps Engineer, AI Architect, NLP Engineer, Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.