SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Document Understanding · Depth: Expert, extended

Summary

SlideAgent is a hierarchical agentic framework designed for understanding complex multi-page visual documents, particularly slide decks. Developed by researchers from Georgia Institute of Technology and J.P. Morgan AI Research, this system addresses challenges in fine-grained reasoning over visual elements and pages. SlideAgent employs specialized agents and decomposes reasoning into three levels—global, page, and element—to build a structured, query-agnostic knowledge base. During inference, it selectively activates relevant agents for multi-level reasoning, integrating their outputs into coherent answers. Extensive experiments demonstrate SlideAgent's significant performance improvements, achieving +7.9 over proprietary models like GPT-4o and +9.8 over open-source models such as InternVL3-8B. It shows robust gains across diverse domains and query types, including a 9.8-point improvement in multi-hop reasoning and a 7.7-point improvement in visual/layout reasoning.

Key takeaway

For AI Architects designing systems for multi-page visual document understanding, you should consider implementing a hierarchical agentic framework like SlideAgent. This approach significantly improves accuracy by decomposing reasoning into global, page, and element levels, overcoming limitations of traditional MLLMs in fine-grained and domain-specific visual semantics. Adopting this structure can lead to substantial gains in multi-hop reasoning and visual/layout question answering, making your systems more robust and interpretable for complex documents such as financial reports or technical presentations.

Key insights

Hierarchical agentic reasoning across global, page, and element levels significantly enhances multi-page visual document understanding.

Principles

Method

SlideAgent builds a hierarchical, query-agnostic knowledge base in a "Knowledge Construction" stage, then uses multi-level retrieval and specialized agents for "Retrieval and Question-Answering."

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.