Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception
Summary
Vision-Language Models (VLMs) demonstrate a significant failure in perceiving surface slant from texture, a task where human perception exhibits systematic, graded biases. Unlike humans and unsupervised Convolutional Neural Networks (CNNs), VLMs, across various families and scales, predict slant only at specific "anchors" such as 0°, ±25°, and ±45°. This anchoring behavior shows little dependence on crucial stimulus properties like field of view, optical slant, or surface curvature. While supervised fine-tuning can partially mitigate this issue, residual anchoring remains. Researchers interpret this as a failure in the VLM's representation-to-output language interface, suggesting an inability to express geometric encoding in a graded form, rather than a complete absence of geometric understanding. This finding indicates that strong performance on high-level vision-language benchmarks does not guarantee sensitivity to low-level geometric cues.
Key takeaway
For Machine Learning Engineers developing or deploying Vision-Language Models, recognize that current VLMs may lack fine-grained geometric perception, specifically for tasks like slant-from-texture. Your models might default to anchored, discrete predictions (e.g., 0°, ±25°, ±45°) rather than continuous, graded responses. If your application requires precise low-level spatial understanding, you should implement targeted supervised fine-tuning and rigorous evaluation to address these inherent limitations.
Key insights
VLMs struggle to express graded geometric perception, anchoring predictions to discrete values.
Principles
- VLMs anchor slant predictions to discrete values.
- High-level VLM success doesn't imply low-level geometric sensitivity.
- Graded geometric perception is a representation-to-output interface challenge.
In practice
- Supervised fine-tuning partially remediates VLM slant anchoring.
- Evaluate VLMs for low-level geometric perception tasks.
Topics
- Vision-Language Models
- Geometric Perception
- Slant from Texture
- Anchoring Bias
- Supervised Fine-tuning
- Psychophysical Biases
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.