Children's Intelligence Tests Pose Challenges for MLLMs? KidGym: A 2D Grid-Based Reasoning Benchmark for MLLMs

2026-03-24 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

ShanghaiTech University researchers introduce KidGym, a novel 2D grid-based benchmark designed to evaluate Multimodal Large Language Models (MLLMs) across five core cognitive capabilities: Execution, Perception Reasoning, Learning, Memory, and Planning. Inspired by the Wechsler Intelligence Scales for children, KidGym features 12 unique tasks, each with three difficulty levels (L1, L2, L3), diverse semantic scenes like supermarkets and farms, and randomized layouts to prevent memorization. The benchmark also incorporates a "backpack" and "hint bar" to address MLLM contextual consistency issues and uses high-level actions to focus on meaningful outcomes. Initial evaluations of nine state-of-the-art MLLMs, including closed-source models like o3, GPT-5, GPT-4o, Gemini-2.5-Pro, Gemini-2.5-Flash, Claude-3.7-Sonnet, and open-source models like DeepseekVL-2, QwenVL-2.5, and InternVL-3, reveal that while closed-source models generally outperform open-source ones and excel in learning tasks, all MLLMs struggle with abstract visual reasoning, item quantity identification, and composite tasks requiring multiple abilities.

Key takeaway

Research Scientists developing MLLMs should prioritize improving models' capabilities in handling non-semantic visual information, accurately identifying item quantities, and performing composite tasks that demand the integration of multiple cognitive abilities. Your current models, even top-tier closed-source ones, show significant deficiencies in these areas, indicating a need for architectural or training advancements to bridge the gap with human performance in complex, dynamic environments.

Key insights

KidGym evaluates MLLM cognitive abilities using a 2D grid-based benchmark inspired by child intelligence tests.

Principles

MLLM evaluation benefits from human cognitive testing frameworks.
Dynamic, interactive tasks are crucial for assessing MLLM adaptability.
Composite tasks reveal MLLM limitations in integrating multiple abilities.

Method

KidGym uses 12 grid-based tasks with randomized layouts and three difficulty levels to assess MLLMs on Execution, Perception Reasoning, Learning, Memory, and Planning, mirroring child cognitive growth stages.

In practice

Use KidGym to benchmark MLLMs on dynamic, interactive tasks.
Focus MLLM development on abstract visual reasoning and quantity perception.
Design MLLMs to handle composite tasks requiring multiple integrated abilities.

Topics

MLLM Benchmarking
Cognitive AI
Visual Reasoning
AI Planning
Multimodal Large Language Models

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.