CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

2026-04-16 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, extended

Summary

CacheWeaver is a lightweight, prompt-layer method designed to optimize Retrieval-Augmented Generation (RAG) inference by reducing prefill costs associated with long prompts. It addresses the inefficiency of prefix caching in serving engines like vLLM, where overlapping retrieved evidence often appears in different orders, preventing reuse of Key-Value (KV) states. CacheWeaver reorders retrieved documents using a knowledge tree that stores recently served evidence sequences. By employing a greedy walk, it places the most reusable prefix first, without altering the serving engine or the retrieved evidence set. Experiments across three vLLM configurations demonstrate that CacheWeaver lowers median time-to-first-token (TTFT) by approximately 20–33% compared to retrieval-order prefix caching, achieving 97.5% of the gain from oracle ordering. This method maintains answer quality in QA tests and adds negligible host-side overhead, around 26 µs per request, while reducing inference p50 by 29%. It is particularly effective for workloads with moderate document overlap and temporal locality.

Key takeaway

For AI Engineers deploying RAG systems with vLLM, you should consider integrating CacheWeaver to significantly reduce inference latency. By reordering retrieved evidence to maximize prefix cache reuse, your median time-to-first-token can improve by 20–33% without compromising answer quality. This lightweight, prompt-layer optimization is particularly beneficial for applications with bursty, related queries, such as customer service or enterprise knowledge bases, where temporal locality is present.

Key insights

CacheWeaver reorders RAG evidence to maximize prefix cache reuse, significantly reducing LLM prefill latency.

Principles

Evidence order impacts RAG cache reuse.
Greedy trie search approximates optimal ordering.
Moderate document overlap yields best gains.

Method

CacheWeaver uses a knowledge tree (trie) of recent document sequences. A greedy algorithm walks the trie to reorder retrieved documents, prioritizing paths that align with cached prefixes.

In practice

Implement as Python middleware for vLLM.
Use for customer service, domain assistants.
Monitor TTFT for cache-state feedback.

Topics

Retrieval-Augmented Generation
LLM Inference Optimization
Prefix Caching
vLLM
Time-to-First-Token
Evidence Ordering

Best for: MLOps Engineer, AI Architect, NLP Engineer, Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.