UniDoc-RL: Coarse-to-Fine Visual RAG with Hierarchical Actions and Dense Rewards

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

UniDoc-RL is a novel reinforcement learning framework designed to enhance Retrieval-Augmented Generation (RAG) for Large Vision-Language Models (LVLMs) by integrating external visual knowledge more effectively. It addresses the limitations of existing visual RAG systems that often miss fine-grained visual semantics by formulating visual information acquisition as a sequential decision-making problem. UniDoc-RL employs a hierarchical action space, progressively refining visual evidence from coarse document retrieval to fine-grained image selection and active region cropping. This allows the model to focus on information-dense regions while suppressing irrelevant content. The framework utilizes a dense multi-reward scheme for end-to-end training and is based on Group Relative Policy Optimization (GRPO), enabling alignment with multiple objectives without a separate value network. Experiments on three benchmarks show UniDoc-RL outperforms state-of-the-art baselines, achieving up to 17.7% improvement over previous RL-based methods.

Key takeaway

For research scientists developing advanced RAG systems, UniDoc-RL's approach to visual information acquisition offers a significant performance uplift. You should consider integrating hierarchical action spaces and dense multi-reward schemes into your LVLM agents to improve fine-grained visual reasoning. This method can lead to more precise and contextually aware retrieval, surpassing current RL-based baselines by substantial margins.

Key insights

UniDoc-RL enhances visual RAG by using hierarchical actions and dense rewards for fine-grained visual information acquisition.

Principles

Method

UniDoc-RL formulates visual information acquisition as a sequential decision-making problem with a hierarchical action space, using dense multi-rewards and Group Relative Policy Optimization (GRPO) for training.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.