UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards
Summary
UniDoc-RL is a novel reinforcement learning framework designed to enhance Retrieval-Augmented Generation (RAG) for Large Vision-Language Models (LVLMs) by addressing limitations in fine-grained visual semantics. It unifies retrieval, reranking, active visual perception, and reasoning within a single LVLM agent, formulating visual information acquisition as a sequential decision-making problem with a hierarchical action space. This framework progressively refines visual evidence from coarse document retrieval to fine-grained image selection and active region cropping. UniDoc-RL introduces a dense multi-reward scheme for task-aware supervision and utilizes Group Relative Policy Optimization (GRPO) for end-to-end training without a separate value network. The system was trained on a curated dataset of high-quality reasoning trajectories with fine-grained action annotations and demonstrated superior performance, achieving up to 17.7% gains over prior RL-based methods on three benchmarks.
Key takeaway
For Research Scientists developing visual RAG systems, UniDoc-RL demonstrates that integrating hierarchical actions and dense, stage-specific rewards significantly improves performance. You should consider adopting a coarse-to-fine visual processing strategy, incorporating precise selection and active visual perception actions. This approach, coupled with a multi-reward optimization scheme, can lead to substantial gains in reasoning accuracy and visual utilization, especially for complex, visually rich documents.
Key insights
UniDoc-RL improves visual RAG by integrating hierarchical actions and dense multi-rewards for precise visual information acquisition.
Principles
- Hierarchical actions refine visual evidence from coarse to fine.
- Dense multi-rewards provide stage-specific feedback for optimization.
Method
UniDoc-RL employs a "Search-Select-Perceive" action space, combining external tools for coarse retrieval with LVLM-driven precise selection and active visual perception (crop/zoom) for fine-grained evidence extraction, optimized via GRPO with a dense multi-reward system.
In practice
- Implement hierarchical visual search to filter irrelevant content.
- Utilize IoU-based rewards for precise visual perception actions.
Topics
- Visual RAG
- Reinforcement Learning
- Hierarchical Action Space
- Dense Multi-Reward System
- Large Vision-Language Models
Code references
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.