QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving
Summary
QCFuse is a novel query-aware cache fusion selector designed to enhance the efficiency of Retrieval-augmented generation (RAG) serving by reducing the dominant prefill stage cost. Traditional RAG cache fusion methods struggle with a trade-off between maintaining answer quality and achieving fast prefill times, as query-agnostic selectors can miss relevant evidence while full-view selectors introduce latency. QCFuse addresses this by employing "chunk-anchor query probing" to condition user-query states on compact per-chunk anchors and "critical-layer profiling" to identify recomputation tokens without extensive layer inspection. Implemented in SGLang, QCFuse was evaluated across four open-weight LLMs and six datasets. It demonstrated full-prefill-level quality, achieving an average prefill-time speedup of 1.7x compared to full prefill and 1.5x over ProphetKV, a leading quality-preserving baseline.
Key takeaway
For MLOps Engineers optimizing Retrieval-augmented generation (RAG) serving, QCFuse offers a significant solution to reduce prefill latency without sacrificing answer quality. If you are struggling with high inference costs or slow response times from your RAG pipelines, consider integrating QCFuse. Its ability to deliver a 1.7x prefill-time speedup at full-prefill quality means you can deploy more efficient and responsive LLM applications.
Key insights
QCFuse enables efficient, quality-preserving RAG serving by integrating query-aware cache fusion with compressed context views.
Principles
- RAG prefill costs are a primary serving bottleneck.
- Cache fusion reuses precomputed KV caches.
- Query-aware selection is crucial for RAG quality.
Method
QCFuse employs chunk-anchor query probing and critical-layer profiling to identify recomputation tokens efficiently.
In practice
- Integrate QCFuse into SGLang for RAG serving.
- Achieve 1.7x prefill speedup over full prefill.
- Benchmark RAG solutions on diverse datasets.
Topics
- Retrieval-augmented Generation
- LLM Serving
- Cache Fusion
- Query-Aware Selection
- SGLang
- Prefill Optimization
Best for: AI Engineer, NLP Engineer, AI Architect, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.