Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

2025-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Microsoft Research India and IIT Hyderabad researchers introduce Mind's Eye, a new multiple-choice benchmark designed to evaluate the visuo-cognitive and visuospatial reasoning capabilities of Multimodal Large Language Models (MLLMs). Inspired by classic human intelligence tests, the benchmark features eight tasks categorized under a novel Abstraction-Relation-Transformation (ART) taxonomy, probing fluid intelligence processes like pattern induction, analogical mapping, and mental transformation. Evaluations of 18 diverse MLLMs, including GPT-4o, Gemini-2.5 Pro, LLaVA-1.6-7B, and Qwen2.5-VL-32B, reveal significant underperformance compared to human participants, who achieve 80% accuracy while top MLLMs remain below 50%. Error analysis indicates MLLMs struggle with visual attention allocation, internal perceptual manipulation, and abstracting underlying visual concepts, exhibiting flat performance curves across difficulty levels, unlike humans.

Key takeaway

For Computer Vision Engineers and Research Scientists developing or evaluating MLLMs, this research highlights that current models fundamentally lack human-like visuospatial reasoning. You should prioritize architectural innovations that enable genuine internal perceptual manipulation and cognitive simulation, rather than solely scaling model parameters or relying on prompt engineering. Consider integrating Mind's Eye or similar cognitively grounded benchmarks into your evaluation pipeline to diagnose specific reasoning deficits and guide future model improvements beyond superficial performance gains.

Key insights

Current MLLMs lack foundational visuospatial reasoning, performing significantly below humans on cognitive benchmarks.

Principles

Visuospatial reasoning requires internal simulation, not just surface pattern matching.
Prompting effects are task-dependent, not universally beneficial for MLLMs.
Model performance does not scale monotonically with parameter size for cognitive tasks.

Method

Mind's Eye uses programmatically generated SVG stimuli with diagnostic distractors, organized by an Abstraction-Relation-Transformation (ART) taxonomy, to isolate and evaluate visuo-cognitive processes in MLLMs.

In practice

Focus MLLM development on true perceptual transformation and cognitive simulation.
Design benchmarks with diagnostic distractors to reveal specific reasoning errors.
Use synthetic, parametrically controlled stimuli to isolate cognitive abilities.

Topics

Multimodal Large Language Models
Mind's Eye Benchmark
Visuospatial Reasoning
ART Taxonomy
Cognitive Evaluation

Code references

microsoft/Mind-s-Eye

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.