QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

2026-05-28 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

QCFuse is a novel compressed-view query-aware selector designed to enhance the efficiency of Retrieval-Augmented Generation (RAG) serving by optimizing the prefill stage. RAG's prefill process, which grounds LLM generation in external evidence, typically incurs significant computational costs. Existing cache fusion methods either compromise answer quality for speed or introduce pipeline stalls. QCFuse addresses this by employing chunk-anchor query probing, which conditions user queries on compact per-chunk anchors, and critical-layer profiling, which identifies recomputation tokens without extensive layer inspection. Implemented in SGLang, QCFuse was evaluated on four open-weight LLMs (Mistral-v0.3-7B, Llama-3.1-8B, Qwen3-8B, Qwen3-14B) across six datasets. It achieves full-prefill-level quality, demonstrating an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, a strong quality-preserving baseline.

Key takeaway

For Machine Learning Engineers optimizing RAG serving, QCFuse provides a robust solution to reduce prefill latency without sacrificing generation quality. By adopting its compressed-view query-aware selection, you can achieve 1.7x speedup over full prefill and 1.5x over ProphetKV. This approach minimizes I/O bottlenecks and sustains higher throughput under increased request loads, making it ideal for long-context RAG applications.

Key insights

Efficient RAG serving demands query-aware token selection that avoids pipeline stalls through compressed evidence views.

Principles

Query-aware selection is crucial for RAG quality.
Compressed token/layer views prevent pipeline stalls.
Middle Transformer layers often best localize evidence.

Method

QCFuse constructs compact chunk anchors and profiles critical layers offline. Online, it probes queries on anchors and scores context tokens using critical-layer K states.

In practice

Employ KVzip@10% for chunk anchor selection.
Identify top-3 critical layers for token localization.
Integrate selection into layer-wise cache fusion.

Topics

Retrieval-Augmented Generation
KV Cache Fusion
LLM Serving Optimization
Query-Aware Selection
Prefill Latency
SGLang

Code references

uYanJX/QCFuse

Best for: MLOps Engineer, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.