QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving
Summary
QCFuse is a novel compressed-view query-aware selector designed to enhance Retrieval-Augmented Generation (RAG) serving efficiency. RAG, which improves large language model (LLM) answer quality by using external evidence, incurs significant prefill stage costs. Existing RAG cache fusion methods struggle to balance quality and efficiency. QCFuse addresses this by employing chunk-anchor query probing to condition user-query states and critical-layer profiling to identify recomputation tokens without extensive inspection. Implemented in SGLang and evaluated across four open-weight LLMs and six datasets, QCFuse achieves full-prefill-level quality. It delivers an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, a strong quality-preserving baseline.
Key takeaway
For MLOps engineers optimizing RAG deployments, QCFuse offers a significant solution to the prefill stage bottleneck. You can achieve an average 1.7x prefill-time speedup without compromising answer quality, addressing the common trade-off in existing cache fusion methods. Consider integrating QCFuse to enhance the efficiency and scalability of your RAG-powered LLM applications.
Key insights
QCFuse optimizes RAG serving by using a compressed, query-aware cache fusion selector for efficient prefill.
Principles
- RAG prefill cost is a dominant serving bottleneck.
- Cache fusion selectors face a quality-efficiency dilemma.
- Query-aware selection can be efficient with compressed views.
Method
QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens.
In practice
- Implement QCFuse in SGLang for RAG.
- Apply to open-weight LLMs for speedup.
- Achieve full-prefill-level RAG quality.
Topics
- Retrieval-Augmented Generation
- LLM Serving
- Cache Fusion
- Prefill Optimization
- SGLang
- Query-Aware Selection
Best for: AI Architect, AI Engineer, NLP Engineer, AI Scientist, MLOps Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.