MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, long

Summary

MARDoc, a Memory-Aware Refinement Agent framework, addresses limitations in existing iterative retrieval-reasoning agents for multimodal long-document question answering. Current systems often suffer from context noise and dilution due to a single growing context that mixes retrieval traces and reasoning. MARDoc decouples the QA process into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. This framework relies on a dynamically updated structured memory, rather than a full accumulated interaction history, to reduce context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench demonstrate that MARDoc achieves strong results, outperforming same-backbone baselines and validating the effectiveness of structured memory for agentic document QA.

Key takeaway

For Machine Learning Engineers developing multimodal long-document QA systems, MARDoc offers a robust framework to mitigate context dilution and improve multi-hop reasoning. You should consider adopting a decoupled agent architecture with structured memory to enhance evidence distillation and iterative refinement. This approach can significantly boost performance on complex benchmarks like MMLongBench-Doc and DocBench, ensuring more accurate and reliable answers from your models.

Key insights

MARDoc uses a structured, dynamically updated memory to overcome context dilution in multimodal long-document QA agents.

Principles

Decouple retrieval, refinement, and reflection.
Structured memory reduces context noise.
Iterative assessment improves reasoning quality.

Method

MARDoc employs an Explore-Refine-Reflect loop. The Explorer retrieves, the Refiner distills traces into structured evidence and reasoning memories, and the Reflector assesses sufficiency and provides feedback.

In practice

Use MinerU2.5 for precise document parsing.
Generate visual descriptions with Qwen3-VL-235B-A22B-Instruct.
Implement multi-granularity toolsets for retrieval.

Topics

Multimodal QA
Long Document Processing
Agentic AI
Structured Memory
Retrieval-Augmented Generation
MMLongBench-Doc

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.