[R] Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families
Summary
An experiment evaluating Vision-Language Models (VLMs) on spatial reasoning tasks revealed a significant performance disparity between text-rendered and square-rendered binary grids. Frontier VLMs, including Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking, achieved approximately 84% F1 when reading 15x15 binary grids rendered as text characters ("." and "#"). However, performance collapsed to 29-39% F1 when the identical grids were rendered as filled squares, despite both inputs being images processed by the same visual encoder. This 34-54 point F1 gap suggests a severe degradation in spatial localization without textual anchors. Each model exhibited distinct failure modes: Claude under-counted, ChatGPT over-counted, and Gemini produced structured hallucinations, particularly above 32% density, despite showing stronger visual pathway performance at low densities.
Key takeaway
For AI Engineers developing applications that process charts, spreadsheets, or diagrams, recognize that current VLMs possess a strong implicit OCR capability but lack equivalent robustness for non-textual spatial features. You should anticipate severely degraded performance on structural content without textual anchors and consider alternative approaches, such as introducing discrete visual tokens or pre-processing visual data into symbolic representations, to bridge this gap.
Key insights
VLMs excel at text-based spatial tasks but struggle with non-textual visual patterns, indicating a strong implicit OCR pipeline.
Principles
- Textual anchors significantly enhance VLM spatial localization.
- VLMs exhibit diverse failure modes on non-textual spatial tasks.
Method
Binary grids (15x15) were rendered as both text symbols and filled squares, then fed to frontier VLMs to transcribe, measuring F1 scores.
In practice
- Avoid relying on VLMs for precise non-textual spatial reasoning.
- Consider pre-processing non-textual diagrams into symbolic representations.
Topics
- Vision-Language Models
- Spatial Reasoning
- Text Recognition
- Visual Perception
- Model Evaluation
Best for: Research Scientist, AI Engineer, Computer Vision Engineer, AI Researcher, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.