Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

The "Decompose, Look, and Reason" (DLR) framework is a new reinforced latent reasoning approach for Vision-Language Models (VLMs) designed to overcome limitations in complex visual reasoning, such as visual information loss in textual Chain-of-Thought (CoT) and the constraints of patch-based or tool-calling methods. DLR dynamically breaks down queries into textual premises, extracts premise-conditioned continuous visual latents, and generates answers through grounded rationales. It features a three-stage training pipeline, including pretraining for cross-modal alignment, supervised finetuning (SFT) for structured reasoning, and reinforcement learning with a novel Spherical Gaussian Latent Policy (SGLP) for effective latent space exploration. Experiments on benchmarks like V* Bench, MathVista, MMMU-Pro, and MMStar show DLR consistently outperforms strong baselines, including text-only, interleaved multimodal CoT, and other latent reasoning methods, while offering superior stepwise interpretability.

Key takeaway

For research scientists developing advanced Vision-Language Models, DLR offers a robust framework to enhance complex visual reasoning. Its dynamic query decomposition and premise-conditioned latent visual grounding, coupled with a three-stage training pipeline including Spherical Gaussian Latent Policy, significantly improve performance over existing methods. You should consider integrating DLR's principles to achieve more accurate and interpretable multi-step multimodal reasoning, especially for tasks requiring fine-grained visual detail understanding and mathematical reasoning in visual contexts.

Key insights

DLR enhances VLM visual reasoning by dynamically decomposing queries and extracting premise-conditioned continuous visual latents.

Principles

Method

DLR uses a three-stage pipeline: pretraining for cross-modal alignment, SFT for structured reasoning, and RL with Spherical Gaussian Latent Policy (SGLP) for latent exploration, guided by outcome and focus rewards.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.