Uncertainty-Aware Hybrid Retrieval for Long-Document RAG
Summary
Uncertainty-aware Multi-Granularity RAG (UMG-RAG) is a training-free hybrid retrieval framework designed to optimize Retrieval Augmented Generation (RAG) for long documents by addressing the granularity tradeoff. It employs existing dense and sparse retrievers across multiple chunk granularities, such as 2, 4, 8, 16, and 32 sentences. For each query, UMG-RAG estimates the reliability of each expert-granularity pair by analyzing the entropy of its candidate score distribution, then fuses candidates based on this query-specific confidence. An extension, UMGP-RAG, further enhances this by promoting fine-grained retrieval hits (from 2 or 4 sentence chunks) to broader parent chunks (8 sentences) and removing redundant overlaps to ensure local coherence. Evaluated on Natural Questions and HotPotQA datasets with models like Qwen2.5-3B-Instruct and BGE-M3, UMGP-RAG consistently improved generation quality. While increasing retrieval preprocessing cost to 5.36 seconds per question, it significantly reduced generation time to 0.30-0.33 seconds and memory usage below 7000 MiB.
Key takeaway
For Machine Learning Engineers optimizing RAG systems for long documents, you should implement uncertainty-aware hybrid retrieval with multi-granularity chunking and parent promotion. This approach, like UMGP-RAG, significantly improves generation quality by delivering compact, coherent contexts to the LLM. While it increases retrieval preprocessing time (e.g., 5.36 seconds per question), it substantially reduces generation time (0.30-0.33 seconds) and memory usage (under 7000 MiB), offering a robust, training-free enhancement to your pipeline.
Key insights
UMG-RAG adaptively fuses multi-granularity, multi-expert retrieval based on query-specific uncertainty to create compact, coherent RAG contexts.
Principles
- Retrieval granularity is query-specific.
- Score distribution entropy indicates reliability.
- Fine-grained hits locate, parent chunks cohere.
Method
Retrieve multi-granularity candidates. Normalize scores to evidence distributions. Estimate query-specific confidence from distribution entropy. Fuse candidates using confidence weights. Rank by evidence utility, then promote fine-grained hits to parent chunks with overlap-aware deduplication.
In practice
- Utilize existing dense/sparse retrievers.
- Segment documents into 2, 4, 8, 16, 32 sentence chunks.
- Promote fine-grained hits to broader parent chunks.
Topics
- Retrieval-Augmented Generation
- Hybrid Retrieval
- Multi-Granularity Retrieval
- Uncertainty Estimation
- Parent Promotion
- Long-Document QA
Best for: AI Architect, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.