Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models
Summary
The Grid2Matrix (G2M) benchmark reveals a "Digital Agnosia" in Vision-Language Models (VLMs), where they fail to faithfully transcribe dense spatial information from images into text, even for surprisingly small grids. This diagnostic tool presents VLMs with a color grid and a color-to-number mapping, requiring them to output the corresponding matrix. While VLMs excel on many multimodal reasoning tasks, G2M demonstrates a sharp, early collapse in zero-shot end-to-end performance across proprietary models like GPT-5 and Gemini-3, and open-weight models like InternVL3.5 and Qwen3-VL. Probing isolated vision encoders shows that they retain substantially more grid information than the full VLM outputs, indicating a bottleneck in how visual features are accessed or expressed in language. Errors are structured, influenced by patch-grid alignment, and not fully mitigated by model scaling or multimodal alignment, highlighting a critical limitation for tasks like table parsing or GUI interaction.
Key takeaway
For Computer Vision Engineers developing or deploying VLMs for layout-sensitive applications like forms or charts, this research indicates that current models struggle with dense spatial transcription. You should prioritize diagnostic benchmarks like G2M to identify and address "Digital Agnosia" early in the development cycle. Relying solely on high-level semantic benchmarks may obscure critical failures in faithfully capturing fine-grained visual details, leading to unreliable performance in practical scenarios requiring pixel-perfect readout.
Key insights
VLMs exhibit "Digital Agnosia," failing to express fine-grained visual details despite their vision encoders retaining the information.
Principles
- Dense spatial perception is a distinct VLM challenge.
- Vision encoder information is often lost in language generation.
- Patch-grid alignment critically impacts spatial fidelity.
Method
Grid2Matrix (G2M) is a diagnostic benchmark using synthetic color grids and color-to-number mappings to test VLM dense spatial transcription, varying grid size and color count.
In practice
- Use G2M to diagnose VLM spatial fidelity.
- Analyze error heatmaps for architectural biases.
- Consider patch-grid alignment in VLM design.
Topics
- Grid2Matrix Benchmark
- Vision-Language Models
- Digital Agnosia
- Vision Encoders
- Dense Spatial Perception
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.