Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new study identifies "Silenced Visual Latents," an optimization pathology in existing latent visual reasoning methods for multimodal models. This pathology causes semantically enriched visual latents to be systematically suppressed during final answer prediction, as the autoregressive objective favors direct visual input over latent reasoning. To counteract this, the researchers propose disentangling conflicting objectives by optimizing latent reasoning at inference time, while keeping backbone parameters frozen. Their two-stage approach involves warming up visual latents via query-guided contrastive latent-visual alignment in Stage I to improve semantic quality, and then optimizing latent reasoning in Stage II using a confidence-progression reward. This reward incentivizes predicted token distributions along the latent span to become progressively more concentrated, ensuring predictions route through latent reasoning. Experiments across eight benchmarks and four model backbones demonstrate that this inference-time optimization effectively unleashes suppressed reasoning capacity.

Key takeaway

For AI Engineers and Research Scientists working with Multimodal Large Language Models (MLLMs), you should investigate inference-time latent optimization to improve reasoning. This technique can unlock suppressed visual reasoning capacity without retraining, potentially enhancing model performance on complex visual tasks. Consider implementing query-guided contrastive alignment and confidence-progression rewards to activate these "silenced" latents effectively.

Key insights

Visual latents in MLLMs can be "silenced" by training objectives, but their reasoning capacity can be unleashed at inference time.

Principles

Method

The proposed method optimizes visual latents at inference time in two stages: query-guided contrastive latent-visual alignment, followed by confidence-progression reward to concentrate token distributions.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.