When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, extended

Summary

A diagnostic study evaluated eleven State-of-the-Art LLMs on the chemistry-focused ChemKGMultiHopQA dataset to compare iterative Retrieval Augmented Generation (RAG) against static RAG, including an idealized Gold Context. The research found that iterative RAG consistently outperformed Gold Context, achieving accuracy gains up to 25.6 percentage points, particularly for non-reasoning models. This improvement stems from synchronized retrieval and reasoning, which reduces late-hop failures, mitigates context overload, and corrects early hypothesis drift. The study also identified critical failure modes, such as incomplete hop coverage, distractor latching, early stopping miscalibration, and high composition failure rates, even with perfect retrieval. These results emphasize that the staged retrieval process is more influential than merely providing ideal evidence in complex scientific multi-hop QA.

Key takeaway

For AI Scientists and ML Engineers deploying RAG in specialized scientific multi-hop QA, prioritize iterative retrieval-reasoning loops over static evidence. You should design systems that actively manage context and refine hypotheses across steps, as this process significantly boosts accuracy, especially for non-reasoning models. Be vigilant against Parametric Memory Suppression and Distractor Latch failures, and implement robust diagnostics to ensure high retrieval coverage and mitigate composition errors.

Key insights

Iterative RAG's staged retrieval and reasoning process outperforms static ideal evidence in complex multi-hop QA.

Principles

Method

A training-free iterative RAG controller alternates retrieval, hypothesis refinement, and evidence-aware stopping. Diagnostics cover retrieval coverage, query quality, anchor carry-drop, and composition failure.

In practice

Topics

Code references

Best for: AI Architect, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.