MAGE-RAG: Multigranular Adaptive Graph Evidence for Agentic Multimodal RAG in Long-Document QA

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

MAGE-RAG is a multigranular adaptive graph evidence framework designed for agentic multimodal RAG in long-document question answering. It addresses limitations of existing RAG methods that struggle with locating sparse evidence across text, tables, images, charts, and complex layouts in long PDFs, often leading to static trade-offs between evidence coverage, noise, and inference cost. MAGE-RAG uses page retrieval as an entry point, building an offline evidence graph with page and element nodes encoding various relations like containment, reading order, and semantic neighbors. At query time, an online evidence controller iteratively activates, opens, searches, and prunes evidence under explicit budgets, rendering a compact, relevant evidence subgraph for the Large Vision-Language Model (LVLM). Experiments show MAGE-RAG achieves 52.75 overall accuracy on LongDocURL and 53.26 accuracy with 51.19 F1 on MMLongBench-Doc, demonstrating improved balance between dispersed evidence coverage and context-noise control.

Key takeaway

For Machine Learning Engineers developing multimodal RAG systems for long documents, MAGE-RAG offers a robust approach to overcome context limitations and noise. You should consider implementing an adaptive graph-based evidence construction strategy to dynamically balance evidence coverage with inference costs. This method allows your LVLMs to consume compact, relevant information, significantly improving accuracy on complex QA tasks involving diverse document elements like text, tables, and images.

Key insights

MAGE-RAG uses an adaptive evidence graph and query-time control for efficient multimodal RAG in long documents.

Principles

Method

MAGE-RAG builds an offline evidence graph (page/element nodes, various relations). An online controller then iteratively activates, opens, searches, and prunes evidence under budget constraints to form a subgraph.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.