QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving
Summary
QCFuse is a novel compressed-view query-aware selector designed to enhance the efficiency of Retrieval-Augmented Generation (RAG) serving by optimizing the prefill stage. RAG's prefill process, which grounds LLM generation in external evidence, typically incurs significant computational costs. Existing cache fusion methods either compromise answer quality for speed or introduce pipeline stalls. QCFuse addresses this by employing chunk-anchor query probing, which conditions user queries on compact per-chunk anchors, and critical-layer profiling, which identifies recomputation tokens without extensive layer inspection. Implemented in SGLang, QCFuse was evaluated on four open-weight LLMs (Mistral-v0.3-7B, Llama-3.1-8B, Qwen3-8B, Qwen3-14B) across six datasets. It achieves full-prefill-level quality, demonstrating an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, a strong quality-preserving baseline.
Key takeaway
For Machine Learning Engineers optimizing RAG serving, QCFuse provides a robust solution to reduce prefill latency without sacrificing generation quality. By adopting its compressed-view query-aware selection, you can achieve 1.7x speedup over full prefill and 1.5x over ProphetKV. This approach minimizes I/O bottlenecks and sustains higher throughput under increased request loads, making it ideal for long-context RAG applications.
Key insights
Efficient RAG serving demands query-aware token selection that avoids pipeline stalls through compressed evidence views.
Principles
- Query-aware selection is crucial for RAG quality.
- Compressed token/layer views prevent pipeline stalls.
- Middle Transformer layers often best localize evidence.
Method
QCFuse constructs compact chunk anchors and profiles critical layers offline. Online, it probes queries on anchors and scores context tokens using critical-layer K states.
In practice
- Employ KVzip@10% for chunk anchor selection.
- Identify top-3 critical layers for token localization.
- Integrate selection into layer-wise cache fusion.
Topics
- Retrieval-Augmented Generation
- KV Cache Fusion
- LLM Serving Optimization
- Query-Aware Selection
- Prefill Latency
- SGLang
Code references
Best for: MLOps Engineer, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.