Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs
Summary
A new study identifies "Silenced Visual Latents," an optimization pathology in existing latent visual reasoning methods for multimodal models. This pathology causes semantically enriched visual latents to be systematically suppressed during final answer prediction, as the autoregressive objective favors direct visual input over latent reasoning. To counteract this, the researchers propose disentangling conflicting objectives by optimizing latent reasoning at inference time, while keeping backbone parameters frozen. Their two-stage approach involves warming up visual latents via query-guided contrastive latent-visual alignment in Stage I to improve semantic quality, and then optimizing latent reasoning in Stage II using a confidence-progression reward. This reward incentivizes predicted token distributions along the latent span to become progressively more concentrated, ensuring predictions route through latent reasoning. Experiments across eight benchmarks and four model backbones demonstrate that this inference-time optimization effectively unleashes suppressed reasoning capacity.
Key takeaway
For AI Engineers and Research Scientists working with Multimodal Large Language Models (MLLMs), you should investigate inference-time latent optimization to improve reasoning. This technique can unlock suppressed visual reasoning capacity without retraining, potentially enhancing model performance on complex visual tasks. Consider implementing query-guided contrastive alignment and confidence-progression rewards to activate these "silenced" latents effectively.
Key insights
Visual latents in MLLMs can be "silenced" by training objectives, but their reasoning capacity can be unleashed at inference time.
Principles
- Autoregressive objectives can create shortcut reliance.
- Disentangle conflicting optimization objectives.
- Inference-time optimization can activate dormant capacities.
Method
The proposed method optimizes visual latents at inference time in two stages: query-guided contrastive latent-visual alignment, followed by confidence-progression reward to concentrate token distributions.
In practice
- Apply inference-time latent optimization.
- Use contrastive alignment for latent warmup.
- Incentivize concentrated token distributions.
Topics
- Multimodal Large Language Models
- Visual Latent Reasoning
- Silenced Visual Latents
- Inference-Time Optimization
- Contrastive Latent-Visual Alignment
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.