Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs

2026-04-10 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

The "Decompose, Look, and Reason" (DLR) framework is a new reinforced latent reasoning approach for Vision-Language Models (VLMs) designed to overcome limitations in complex visual reasoning, such as visual information loss in textual Chain-of-Thought (CoT) and the constraints of patch-based or tool-calling methods. DLR dynamically breaks down queries into textual premises, extracts premise-conditioned continuous visual latents, and generates answers through grounded rationales. It features a three-stage training pipeline, including pretraining for cross-modal alignment, supervised finetuning (SFT) for structured reasoning, and reinforcement learning with a novel Spherical Gaussian Latent Policy (SGLP) for effective latent space exploration. Experiments on benchmarks like V* Bench, MathVista, MMMU-Pro, and MMStar show DLR consistently outperforms strong baselines, including text-only, interleaved multimodal CoT, and other latent reasoning methods, while offering superior stepwise interpretability.

Key takeaway

For research scientists developing advanced Vision-Language Models, DLR offers a robust framework to enhance complex visual reasoning. Its dynamic query decomposition and premise-conditioned latent visual grounding, coupled with a three-stage training pipeline including Spherical Gaussian Latent Policy, significantly improve performance over existing methods. You should consider integrating DLR's principles to achieve more accurate and interpretable multi-step multimodal reasoning, especially for tasks requiring fine-grained visual detail understanding and mathematical reasoning in visual contexts.

Key insights

DLR enhances VLM visual reasoning by dynamically decomposing queries and extracting premise-conditioned continuous visual latents.

Principles

Dynamic decomposition improves visual grounding.
Continuous latent embeddings are more efficient than patch-based.
Reinforcement learning enables active latent space exploration.

Method

DLR uses a three-stage pipeline: pretraining for cross-modal alignment, SFT for structured reasoning, and RL with Spherical Gaussian Latent Policy (SGLP) for latent exploration, guided by outcome and focus rewards.

In practice

Use DLR for complex visual question answering.
Apply SGLP for efficient latent space exploration.
Implement multi-step premise-conditioned visual grounding.

Topics

Decompose, Look, and Reason
Vision-Language Models
Reinforced Latent Reasoning
Spherical Gaussian Latent Policy
Multimodal Chain-of-Thought

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.