UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

UniDoc-RL is a novel reinforcement learning framework designed to enhance Retrieval-Augmented Generation (RAG) for Large Vision-Language Models (LVLMs) by addressing limitations in fine-grained visual semantics. It unifies retrieval, reranking, active visual perception, and reasoning within a single LVLM agent, formulating visual information acquisition as a sequential decision-making problem with a hierarchical action space. This framework progressively refines visual evidence from coarse document retrieval to fine-grained image selection and active region cropping. UniDoc-RL introduces a dense multi-reward scheme for task-aware supervision and utilizes Group Relative Policy Optimization (GRPO) for end-to-end training without a separate value network. The system was trained on a curated dataset of high-quality reasoning trajectories with fine-grained action annotations and demonstrated superior performance, achieving up to 17.7% gains over prior RL-based methods on three benchmarks.

Key takeaway

For Research Scientists developing visual RAG systems, UniDoc-RL demonstrates that integrating hierarchical actions and dense, stage-specific rewards significantly improves performance. You should consider adopting a coarse-to-fine visual processing strategy, incorporating precise selection and active visual perception actions. This approach, coupled with a multi-reward optimization scheme, can lead to substantial gains in reasoning accuracy and visual utilization, especially for complex, visually rich documents.

Key insights

UniDoc-RL improves visual RAG by integrating hierarchical actions and dense multi-rewards for precise visual information acquisition.

Principles

Method

UniDoc-RL employs a "Search-Select-Perceive" action space, combining external tools for coarse retrieval with LVLM-driven precise selection and active visual perception (crop/zoom) for fine-grained evidence extraction, optimized via GRPO with a dense multi-reward system.

In practice

Topics

Code references

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.