CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, extended

Summary

CacheWeaver is a lightweight, prompt-layer method designed to optimize Retrieval-Augmented Generation (RAG) inference by reducing prefill costs associated with long prompts. It addresses the inefficiency of prefix caching in serving engines like vLLM, where overlapping retrieved evidence often appears in different orders, preventing reuse of Key-Value (KV) states. CacheWeaver reorders retrieved documents using a knowledge tree that stores recently served evidence sequences. By employing a greedy walk, it places the most reusable prefix first, without altering the serving engine or the retrieved evidence set. Experiments across three vLLM configurations demonstrate that CacheWeaver lowers median time-to-first-token (TTFT) by approximately 20–33% compared to retrieval-order prefix caching, achieving 97.5% of the gain from oracle ordering. This method maintains answer quality in QA tests and adds negligible host-side overhead, around 26 µs per request, while reducing inference p50 by 29%. It is particularly effective for workloads with moderate document overlap and temporal locality.

Key takeaway

For AI Engineers deploying RAG systems with vLLM, you should consider integrating CacheWeaver to significantly reduce inference latency. By reordering retrieved evidence to maximize prefix cache reuse, your median time-to-first-token can improve by 20–33% without compromising answer quality. This lightweight, prompt-layer optimization is particularly beneficial for applications with bursty, related queries, such as customer service or enterprise knowledge bases, where temporal locality is present.

Key insights

CacheWeaver reorders RAG evidence to maximize prefix cache reuse, significantly reducing LLM prefill latency.

Principles

Method

CacheWeaver uses a knowledge tree (trie) of recent document sequences. A greedy algorithm walks the trie to reorder retrieved documents, prioritizing paths that align with cached prefixes.

In practice

Topics

Best for: MLOps Engineer, AI Architect, NLP Engineer, Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.