ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop
Summary
ESI-BENCH is a new, comprehensive benchmark designed to evaluate embodied spatial intelligence in agents, moving beyond prior formulations that assumed oracle observations. Built on OmniGibson and grounded in Spelke's core knowledge systems, it features 10 task categories and 29 subcategories. The benchmark requires agents to actively deploy perception, locomotion, and manipulation abilities to gather task-relevant evidence. Experiments with state-of-the-art Multimodal Large Language Models (MLLMs) show that active exploration significantly outperforms passive methods, with agents developing emergent spatial strategies. A key finding is that most failures stem from "action blindness" rather than weak perception, as poor action choices lead to insufficient observations. While explicit 3D grounding can stabilize reasoning for depth-sensitive tasks, imperfect 3D representations can be detrimental, distorting spatial relations. Human studies highlight a metacognitive gap in models, which commit prematurely with high confidence, unlike humans who seek falsifying viewpoints.
Key takeaway
For research scientists developing embodied AI, this work highlights that improving an agent's active exploration and action-selection capabilities is more critical than solely enhancing passive perception. You should focus on designing systems that can dynamically acquire task-relevant evidence and revise beliefs, rather than relying on static observations, to overcome the identified "action blindness" and metacognitive gaps in current MLLMs.
Key insights
Embodied spatial intelligence requires active perception-action loops, where models currently exhibit "action blindness" and metacognitive gaps.
Principles
- Active exploration outperforms passive sensing.
- Action choices dictate observation quality.
- Imperfect 3D models can harm spatial reasoning.
Method
ESI-BENCH evaluates embodied spatial intelligence by requiring agents to actively sequence perception, locomotion, and manipulation to accumulate task-relevant evidence across 10 categories and 29 subcategories.
In practice
- Prioritize active exploration in embodied AI design.
- Focus on improving action choices for better observations.
- Validate 3D representations rigorously before deployment.
Topics
- Embodied Spatial Intelligence
- ESI-BENCH
- Perception-Action Loop
- Multimodal Large Language Models
- Action Blindness
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.