DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
Summary
DocSeeker is a novel Multimodal Large Language Model (MLLM) designed to overcome performance degradation in long document understanding, particularly addressing low Signal-to-Noise Ratio (SNR) and supervision scarcity. Built upon Qwen-2.5-VL-7B, DocSeeker employs an "Analysis, Localization and Reasoning" (ALR) visual reasoning paradigm, inspired by human cognition, which explicitly grounds reasoning on specific document pages. Its two-stage training framework includes Supervised Fine-Tuning (SFT) on high-quality ALR Chain-of-Thought (CoT) data, generated via knowledge distillation from Gemini-2.5-Flash, followed by Evidence-aware Group Relative Policy Optimization (EviGRPO) for joint optimization of evidence localization and answer accuracy. Additionally, an Evidence-Guided Resolution Allocation (EGRA) strategy mitigates memory constraints during training by differentially allocating image resolutions. Experiments show DocSeeker achieves 30-60% performance gains on five document VQA benchmarks, robustly generalizes to ultra-long documents, and synergizes with visual Retrieval-Augmented Generation (RAG) systems.
Key takeaway
For AI Engineers developing MLLMs for long document understanding, DocSeeker's ALR paradigm offers a robust approach to improve accuracy and generalization. You should consider adopting structured reasoning workflows with explicit evidence grounding and a two-stage training process that combines supervised fine-tuning with reinforcement learning. This method, coupled with resolution allocation strategies, can significantly enhance performance on ultra-long documents and improve synergy with RAG systems.
Key insights
DocSeeker enhances long document MLLM performance via structured reasoning, two-stage training, and resolution allocation.
Principles
- Explicit evidence grounding improves interpretability and reduces noise.
- Combining SFT with RL refines reasoning and localization capabilities.
- Differentiated resolution allocation optimizes memory and signal-to-noise ratio.
Method
DocSeeker uses a two-stage training: SFT with distilled ALR CoT data, then EviGRPO for joint localization and answer optimization. EGRA manages memory by varying page resolutions during training.
In practice
- Implement page-aware input representation for visual tokens.
- Use a multi-faceted reward function for evidence localization and answer accuracy.
- Apply differential resolution for evidence vs. non-evidence pages in long documents.
Topics
- DocSeeker
- Long Document Understanding
- Multimodal Large Language Models
- Analysis-Localization-Reasoning
- Evidence Grounding
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.