Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs
Summary
ShanghaiTech University researchers introduce KidGym, a novel 2D grid-based benchmark designed to evaluate Multimodal Large Language Models (MLLMs) across five core cognitive capabilities: Execution, Perception Reasoning, Learning, Memory, and Planning. Inspired by the Wechsler Intelligence Scales for children, KidGym features 12 unique tasks, each with three difficulty levels (L1, L2, L3), diverse semantic scenes like supermarkets and farms, and randomized layouts to prevent memorization. The benchmark also incorporates a "backpack" and "hint bar" to address MLLM contextual consistency issues and uses high-level actions to focus on meaningful outcomes. Initial evaluations of nine state-of-the-art MLLMs, including closed-source models like o3, GPT-5, GPT-4o, Gemini-2.5-Pro, Gemini-2.5-Flash, Claude-3.7-Sonnet, and open-source models like DeepseekVL-2, QwenVL-2.5, and InternVL-3, reveal that while closed-source models generally outperform open-source ones and excel in learning tasks, all MLLMs struggle with abstract visual reasoning, item quantity identification, and composite tasks requiring multiple abilities.
Key takeaway
Research Scientists developing MLLMs should prioritize improving models' capabilities in handling non-semantic visual information, accurately identifying item quantities, and performing composite tasks that demand the integration of multiple cognitive abilities. Your current models, even top-tier closed-source ones, show significant deficiencies in these areas, indicating a need for architectural or training advancements to bridge the gap with human performance in complex, dynamic environments.
Key insights
KidGym evaluates MLLM cognitive abilities using a 2D grid-based benchmark inspired by child intelligence tests.
Principles
- MLLM evaluation benefits from human cognitive testing frameworks.
- Dynamic, interactive tasks are crucial for assessing MLLM adaptability.
- Composite tasks reveal MLLM limitations in integrating multiple abilities.
Method
KidGym uses 12 grid-based tasks with randomized layouts and three difficulty levels to assess MLLMs on Execution, Perception Reasoning, Learning, Memory, and Planning, mirroring child cognitive growth stages.
In practice
- Use KidGym to benchmark MLLMs on dynamic, interactive tasks.
- Focus MLLM development on abstract visual reasoning and quantity perception.
- Design MLLMs to handle composite tasks requiring multiple integrated abilities.
Topics
- MLLM Benchmarking
- Cognitive AI
- Visual Reasoning
- AI Planning
- Multimodal Large Language Models
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.