[R] Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

2026-02-20 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

An experiment evaluating Vision-Language Models (VLMs) on spatial reasoning tasks revealed a significant performance disparity between text-rendered and square-rendered binary grids. Frontier VLMs, including Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking, achieved approximately 84% F1 when reading 15x15 binary grids rendered as text characters ("." and "#"). However, performance collapsed to 29-39% F1 when the identical grids were rendered as filled squares, despite both inputs being images processed by the same visual encoder. This 34-54 point F1 gap suggests a severe degradation in spatial localization without textual anchors. Each model exhibited distinct failure modes: Claude under-counted, ChatGPT over-counted, and Gemini produced structured hallucinations, particularly above 32% density, despite showing stronger visual pathway performance at low densities.

Key takeaway

For AI Engineers developing applications that process charts, spreadsheets, or diagrams, recognize that current VLMs possess a strong implicit OCR capability but lack equivalent robustness for non-textual spatial features. You should anticipate severely degraded performance on structural content without textual anchors and consider alternative approaches, such as introducing discrete visual tokens or pre-processing visual data into symbolic representations, to bridge this gap.

Key insights

VLMs excel at text-based spatial tasks but struggle with non-textual visual patterns, indicating a strong implicit OCR pipeline.

Principles

Textual anchors significantly enhance VLM spatial localization.
VLMs exhibit diverse failure modes on non-textual spatial tasks.

Method

Binary grids (15x15) were rendered as both text symbols and filled squares, then fed to frontier VLMs to transcribe, measuring F1 scores.

In practice

Avoid relying on VLMs for precise non-textual spatial reasoning.
Consider pre-processing non-textual diagrams into symbolic representations.

Topics

Vision-Language Models
Spatial Reasoning
Text Recognition
Visual Perception
Model Evaluation

Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Researcher, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.