MAGE-RAG: Multigranular Adaptive Graph Evidence for Agentic Multimodal RAG in Long-Document QA
Summary
MAGE-RAG is a multigranular adaptive graph evidence framework designed for agentic multimodal RAG in long-document question answering. It addresses limitations of existing RAG methods that struggle with locating sparse evidence across text, tables, images, charts, and complex layouts in long PDFs, often leading to static trade-offs between evidence coverage, noise, and inference cost. MAGE-RAG uses page retrieval as an entry point, building an offline evidence graph with page and element nodes encoding various relations like containment, reading order, and semantic neighbors. At query time, an online evidence controller iteratively activates, opens, searches, and prunes evidence under explicit budgets, rendering a compact, relevant evidence subgraph for the Large Vision-Language Model (LVLM). Experiments show MAGE-RAG achieves 52.75 overall accuracy on LongDocURL and 53.26 accuracy with 51.19 F1 on MMLongBench-Doc, demonstrating improved balance between dispersed evidence coverage and context-noise control.
Key takeaway
For Machine Learning Engineers developing multimodal RAG systems for long documents, MAGE-RAG offers a robust approach to overcome context limitations and noise. You should consider implementing an adaptive graph-based evidence construction strategy to dynamically balance evidence coverage with inference costs. This method allows your LVLMs to consume compact, relevant information, significantly improving accuracy on complex QA tasks involving diverse document elements like text, tables, and images.
Key insights
MAGE-RAG uses an adaptive evidence graph and query-time control for efficient multimodal RAG in long documents.
Principles
- Multigranular evidence graphs improve context relevance.
- Adaptive evidence construction balances coverage and noise.
- Page retrieval can serve as an effective entry point.
Method
MAGE-RAG builds an offline evidence graph (page/element nodes, various relations). An online controller then iteratively activates, opens, searches, and prunes evidence under budget constraints to form a subgraph.
In practice
- Implement graph-based RAG for complex PDF QA.
- Design adaptive evidence controllers for budget management.
- Integrate page-level visual retrieval as a RAG starting point.
Topics
- Multimodal RAG
- Long-Document QA
- Evidence Graphs
- Large Vision-Language Models
- Information Retrieval
- Adaptive Retrieval
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.