SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding
Summary
SlideAgent is a hierarchical agentic framework designed for understanding complex multi-page visual documents, particularly slide decks. Developed by researchers from Georgia Institute of Technology and J.P. Morgan AI Research, this system addresses challenges in fine-grained reasoning over visual elements and pages. SlideAgent employs specialized agents and decomposes reasoning into three levels—global, page, and element—to build a structured, query-agnostic knowledge base. During inference, it selectively activates relevant agents for multi-level reasoning, integrating their outputs into coherent answers. Extensive experiments demonstrate SlideAgent's significant performance improvements, achieving +7.9 over proprietary models like GPT-4o and +9.8 over open-source models such as InternVL3-8B. It shows robust gains across diverse domains and query types, including a 9.8-point improvement in multi-hop reasoning and a 7.7-point improvement in visual/layout reasoning.
Key takeaway
For AI Architects designing systems for multi-page visual document understanding, you should consider implementing a hierarchical agentic framework like SlideAgent. This approach significantly improves accuracy by decomposing reasoning into global, page, and element levels, overcoming limitations of traditional MLLMs in fine-grained and domain-specific visual semantics. Adopting this structure can lead to substantial gains in multi-hop reasoning and visual/layout question answering, making your systems more robust and interpretable for complex documents such as financial reports or technical presentations.
Key insights
Hierarchical agentic reasoning across global, page, and element levels significantly enhances multi-page visual document understanding.
Principles
- Decompose document understanding into global, page, and element levels.
- Construct query-agnostic knowledge for efficient, context-aware retrieval.
- Integrate external tools for explicit spatial and structural information.
Method
SlideAgent builds a hierarchical, query-agnostic knowledge base in a "Knowledge Construction" stage, then uses multi-level retrieval and specialized agents for "Retrieval and Question-Answering."
In practice
- Implement layout parsing to decompose pages into fine-grained elements.
- Classify queries to activate specific agents for targeted reasoning.
- Generate subqueries to improve retrieval accuracy for complex queries.
Topics
- SlideAgent
- Hierarchical Agents
- Multi-page Document Understanding
- Visual Question Answering
- Multimodal LLMs
- Fine-grained Reasoning
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.