Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

2026-06-04 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study on latent visual reasoning (LVR) in vision-language models (VLMs) challenges the assumption that better alignment between supervised latent tokens and visual targets improves answer accuracy. Researchers tested five LVR variants and found a strong negative correlation (r=-0.94) between cosine alignment and accuracy. They introduced PRISM, a pair of inference-time diagnostics, revealing that supervised latents are largely bypassed, with corruption shifting accuracy by at most four points. The answer is decodable downstream of the latent, not at it, and this decodability gap predicts latent reliance. The findings suggest auxiliary objectives reshape the language model via shared parameters, rather than primarily through the nominally optimized latent variable.

Key takeaway

For AI Scientists and Research Scientists designing or evaluating vision-language models with latent visual reasoning, you should critically re-evaluate the common assumption that improved latent alignment directly enhances VLM accuracy. The evidence suggests auxiliary objectives primarily reshape the language model through shared parameters, bypassing the latents. Focus your efforts on understanding these broader architectural impacts rather than solely optimizing latent variables, and consider using diagnostics like PRISM to assess true latent utility.

Key insights

The core assumption that better latent alignment improves vision-language model accuracy is inverted.

Principles

Cosine alignment negatively correlates with LVR accuracy.
Supervised latents in LVR are often bypassed.
Auxiliary losses reshape VLMs via shared parameters.

Method

PRISM diagnostics involve a linear probe to assess answer decodability and a corruption test to evaluate latent variable load-bearing capacity.

In practice

Re-evaluate LVR latent alignment metrics.
Test latent variable reliance with corruption.
Analyze answer decodability downstream.

Topics

Vision-Language Models
Latent Visual Reasoning
Auxiliary Losses
Model Diagnostics
Cosine Similarity
Information Bottleneck

Best for: AI Scientist, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.