Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents
Summary
A study on latent visual reasoning (LVR) in vision-language models (VLMs) challenges the assumption that better alignment between supervised latent tokens and visual targets improves answer accuracy. Researchers tested five LVR variants and found a strong negative correlation (r=-0.94) between cosine alignment and accuracy. They introduced PRISM, a pair of inference-time diagnostics, revealing that supervised latents are largely bypassed, with corruption shifting accuracy by at most four points. The answer is decodable downstream of the latent, not at it, and this decodability gap predicts latent reliance. The findings suggest auxiliary objectives reshape the language model via shared parameters, rather than primarily through the nominally optimized latent variable.
Key takeaway
For AI Scientists and Research Scientists designing or evaluating vision-language models with latent visual reasoning, you should critically re-evaluate the common assumption that improved latent alignment directly enhances VLM accuracy. The evidence suggests auxiliary objectives primarily reshape the language model through shared parameters, bypassing the latents. Focus your efforts on understanding these broader architectural impacts rather than solely optimizing latent variables, and consider using diagnostics like PRISM to assess true latent utility.
Key insights
The core assumption that better latent alignment improves vision-language model accuracy is inverted.
Principles
- Cosine alignment negatively correlates with LVR accuracy.
- Supervised latents in LVR are often bypassed.
- Auxiliary losses reshape VLMs via shared parameters.
Method
PRISM diagnostics involve a linear probe to assess answer decodability and a corruption test to evaluate latent variable load-bearing capacity.
In practice
- Re-evaluate LVR latent alignment metrics.
- Test latent variable reliance with corruption.
- Analyze answer decodability downstream.
Topics
- Vision-Language Models
- Latent Visual Reasoning
- Auxiliary Losses
- Model Diagnostics
- Cosine Similarity
- Information Bottleneck
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.