Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

2026-06-04 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

Vision-Language Models (VLMs) demonstrate a significant failure in perceiving surface slant from texture, a task where human perception exhibits systematic, graded biases. Unlike humans and unsupervised Convolutional Neural Networks (CNNs), VLMs, across various families and scales, predict slant only at specific "anchors" such as 0°, ±25°, and ±45°. This anchoring behavior shows little dependence on crucial stimulus properties like field of view, optical slant, or surface curvature. While supervised fine-tuning can partially mitigate this issue, residual anchoring remains. Researchers interpret this as a failure in the VLM's representation-to-output language interface, suggesting an inability to express geometric encoding in a graded form, rather than a complete absence of geometric understanding. This finding indicates that strong performance on high-level vision-language benchmarks does not guarantee sensitivity to low-level geometric cues.

Key takeaway

For Machine Learning Engineers developing or deploying Vision-Language Models, recognize that current VLMs may lack fine-grained geometric perception, specifically for tasks like slant-from-texture. Your models might default to anchored, discrete predictions (e.g., 0°, ±25°, ±45°) rather than continuous, graded responses. If your application requires precise low-level spatial understanding, you should implement targeted supervised fine-tuning and rigorous evaluation to address these inherent limitations.

Key insights

VLMs struggle to express graded geometric perception, anchoring predictions to discrete values.

Principles

VLMs anchor slant predictions to discrete values.
High-level VLM success doesn't imply low-level geometric sensitivity.
Graded geometric perception is a representation-to-output interface challenge.

In practice

Supervised fine-tuning partially remediates VLM slant anchoring.
Evaluate VLMs for low-level geometric perception tasks.

Topics

Vision-Language Models
Geometric Perception
Slant from Texture
Anchoring Bias
Supervised Fine-tuning
Psychophysical Biases

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.