Uncertainty-Aware Hybrid Retrieval for Long-Document RAG

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

Uncertainty-aware Multi-Granularity RAG (UMG-RAG) is a training-free hybrid retrieval framework designed to optimize Retrieval Augmented Generation (RAG) for long documents by addressing the granularity tradeoff. It employs existing dense and sparse retrievers across multiple chunk granularities, such as 2, 4, 8, 16, and 32 sentences. For each query, UMG-RAG estimates the reliability of each expert-granularity pair by analyzing the entropy of its candidate score distribution, then fuses candidates based on this query-specific confidence. An extension, UMGP-RAG, further enhances this by promoting fine-grained retrieval hits (from 2 or 4 sentence chunks) to broader parent chunks (8 sentences) and removing redundant overlaps to ensure local coherence. Evaluated on Natural Questions and HotPotQA datasets with models like Qwen2.5-3B-Instruct and BGE-M3, UMGP-RAG consistently improved generation quality. While increasing retrieval preprocessing cost to 5.36 seconds per question, it significantly reduced generation time to 0.30-0.33 seconds and memory usage below 7000 MiB.

Key takeaway

For Machine Learning Engineers optimizing RAG systems for long documents, you should implement uncertainty-aware hybrid retrieval with multi-granularity chunking and parent promotion. This approach, like UMGP-RAG, significantly improves generation quality by delivering compact, coherent contexts to the LLM. While it increases retrieval preprocessing time (e.g., 5.36 seconds per question), it substantially reduces generation time (0.30-0.33 seconds) and memory usage (under 7000 MiB), offering a robust, training-free enhancement to your pipeline.

Key insights

UMG-RAG adaptively fuses multi-granularity, multi-expert retrieval based on query-specific uncertainty to create compact, coherent RAG contexts.

Principles

Retrieval granularity is query-specific.
Score distribution entropy indicates reliability.
Fine-grained hits locate, parent chunks cohere.

Method

Retrieve multi-granularity candidates. Normalize scores to evidence distributions. Estimate query-specific confidence from distribution entropy. Fuse candidates using confidence weights. Rank by evidence utility, then promote fine-grained hits to parent chunks with overlap-aware deduplication.

In practice

Utilize existing dense/sparse retrievers.
Segment documents into 2, 4, 8, 16, 32 sentence chunks.
Promote fine-grained hits to broader parent chunks.

Topics

Retrieval-Augmented Generation
Hybrid Retrieval
Multi-Granularity Retrieval
Uncertainty Estimation
Parent Promotion
Long-Document QA

Best for: AI Architect, AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.