Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Research Methodology & Innovation · Depth: Expert, extended

Summary

The Grid2Matrix (G2M) benchmark reveals a "Digital Agnosia" in Vision-Language Models (VLMs), where they fail to faithfully transcribe dense spatial information from images into text, even for surprisingly small grids. This diagnostic tool presents VLMs with a color grid and a color-to-number mapping, requiring them to output the corresponding matrix. While VLMs excel on many multimodal reasoning tasks, G2M demonstrates a sharp, early collapse in zero-shot end-to-end performance across proprietary models like GPT-5 and Gemini-3, and open-weight models like InternVL3.5 and Qwen3-VL. Probing isolated vision encoders shows that they retain substantially more grid information than the full VLM outputs, indicating a bottleneck in how visual features are accessed or expressed in language. Errors are structured, influenced by patch-grid alignment, and not fully mitigated by model scaling or multimodal alignment, highlighting a critical limitation for tasks like table parsing or GUI interaction.

Key takeaway

For Computer Vision Engineers developing or deploying VLMs for layout-sensitive applications like forms or charts, this research indicates that current models struggle with dense spatial transcription. You should prioritize diagnostic benchmarks like G2M to identify and address "Digital Agnosia" early in the development cycle. Relying solely on high-level semantic benchmarks may obscure critical failures in faithfully capturing fine-grained visual details, leading to unreliable performance in practical scenarios requiring pixel-perfect readout.

Key insights

VLMs exhibit "Digital Agnosia," failing to express fine-grained visual details despite their vision encoders retaining the information.

Principles

Dense spatial perception is a distinct VLM challenge.
Vision encoder information is often lost in language generation.
Patch-grid alignment critically impacts spatial fidelity.

Method

Grid2Matrix (G2M) is a diagnostic benchmark using synthetic color grids and color-to-number mappings to test VLM dense spatial transcription, varying grid size and color count.

In practice

Use G2M to diagnose VLM spatial fidelity.
Analyze error heatmaps for architectural biases.
Consider patch-grid alignment in VLM design.

Topics

Grid2Matrix Benchmark
Vision-Language Models
Digital Agnosia
Vision Encoders
Dense Spatial Perception

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.