QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

QCFuse is a novel compressed-view query-aware selector designed to enhance the efficiency of Retrieval-Augmented Generation (RAG) serving by optimizing the prefill stage. RAG's prefill process, which grounds LLM generation in external evidence, typically incurs significant computational costs. Existing cache fusion methods either compromise answer quality for speed or introduce pipeline stalls. QCFuse addresses this by employing chunk-anchor query probing, which conditions user queries on compact per-chunk anchors, and critical-layer profiling, which identifies recomputation tokens without extensive layer inspection. Implemented in SGLang, QCFuse was evaluated on four open-weight LLMs (Mistral-v0.3-7B, Llama-3.1-8B, Qwen3-8B, Qwen3-14B) across six datasets. It achieves full-prefill-level quality, demonstrating an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, a strong quality-preserving baseline.

Key takeaway

For Machine Learning Engineers optimizing RAG serving, QCFuse provides a robust solution to reduce prefill latency without sacrificing generation quality. By adopting its compressed-view query-aware selection, you can achieve 1.7x speedup over full prefill and 1.5x over ProphetKV. This approach minimizes I/O bottlenecks and sustains higher throughput under increased request loads, making it ideal for long-context RAG applications.

Key insights

Efficient RAG serving demands query-aware token selection that avoids pipeline stalls through compressed evidence views.

Principles

Method

QCFuse constructs compact chunk anchors and profiles critical layers offline. Online, it probes queries on anchors and scores context tokens using critical-layer K states.

In practice

Topics

Code references

Best for: MLOps Engineer, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.