Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

Vision-Language Models (VLMs) demonstrate a significant failure in perceiving surface slant from texture, a task where human perception exhibits systematic, graded biases. Across multiple VLM families and scales, including GPT-4o and Gemini Flash-Lite, zero-shot and in-context prompting consistently result in "anchoring," where slant is predicted at only a small set of discrete values like 0°, ±25°, and ±45°. These predictions show minimal dependence on stimulus parameters such as field of view or optical slant. While supervised fine-tuning (SFT) on Qwen2.5-VL-3B reduced mean slant error from 45.1° to 15.3° and improved curvature-sign accuracy to 86.10%, residual anchoring persists. Layer-wise probing of the vision encoder, like in Qwen2.5-VL-3B, reveals that geometric information (e.g., physical slant R²=0.826) is well-encoded, suggesting the issue is a "language readout problem" rather than a lack of visual understanding.

Key takeaway

For AI Scientists developing or deploying Vision-Language Models in applications demanding fine-grained geometric perception, recognize that current VLMs struggle with continuous geometric expression, exhibiting "anchoring" to discrete values. Your models may encode geometric data effectively, but the language interface often fails to translate this into graded outputs. Consider targeted fine-tuning and explicit geometric output formats to improve performance, and rigorously test with psychophysical stimuli to reveal these subtle but critical limitations.

Key insights

VLMs fail to express continuous geometric perception due to a language readout bottleneck, despite encoding the information.

Principles

Method

Evaluated VLMs on synthetic dot-textured surfaces varying optical slant, FOV, and curvature. Used zero-shot, in-context, and supervised fine-tuning with prompt variations. Probed vision encoder layers via ridge regression.

In practice

Topics

Best for: Computer Vision Engineer, AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.