Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Microsoft Research India and IIT Hyderabad researchers introduce Mind's Eye, a new multiple-choice benchmark designed to evaluate the visuo-cognitive and visuospatial reasoning capabilities of Multimodal Large Language Models (MLLMs). Inspired by classic human intelligence tests, the benchmark features eight tasks categorized under a novel Abstraction-Relation-Transformation (ART) taxonomy, probing fluid intelligence processes like pattern induction, analogical mapping, and mental transformation. Evaluations of 18 diverse MLLMs, including GPT-4o, Gemini-2.5 Pro, LLaVA-1.6-7B, and Qwen2.5-VL-32B, reveal significant underperformance compared to human participants, who achieve 80% accuracy while top MLLMs remain below 50%. Error analysis indicates MLLMs struggle with visual attention allocation, internal perceptual manipulation, and abstracting underlying visual concepts, exhibiting flat performance curves across difficulty levels, unlike humans.

Key takeaway

For Computer Vision Engineers and Research Scientists developing or evaluating MLLMs, this research highlights that current models fundamentally lack human-like visuospatial reasoning. You should prioritize architectural innovations that enable genuine internal perceptual manipulation and cognitive simulation, rather than solely scaling model parameters or relying on prompt engineering. Consider integrating Mind's Eye or similar cognitively grounded benchmarks into your evaluation pipeline to diagnose specific reasoning deficits and guide future model improvements beyond superficial performance gains.

Key insights

Current MLLMs lack foundational visuospatial reasoning, performing significantly below humans on cognitive benchmarks.

Principles

Method

Mind's Eye uses programmatically generated SVG stimuli with diagnostic distractors, organized by an Abstraction-Relation-Transformation (ART) taxonomy, to isolate and evaluate visuo-cognitive processes in MLLMs.

In practice

Topics

Code references

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.