ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, medium

Summary

ESI-BENCH is a new, comprehensive benchmark designed to evaluate embodied spatial intelligence in agents, moving beyond prior formulations that assumed oracle observations. Built on OmniGibson and grounded in Spelke's core knowledge systems, it features 10 task categories and 29 subcategories. The benchmark requires agents to actively deploy perception, locomotion, and manipulation abilities to gather task-relevant evidence. Experiments with state-of-the-art Multimodal Large Language Models (MLLMs) show that active exploration significantly outperforms passive methods, with agents developing emergent spatial strategies. A key finding is that most failures stem from "action blindness" rather than weak perception, as poor action choices lead to insufficient observations. While explicit 3D grounding can stabilize reasoning for depth-sensitive tasks, imperfect 3D representations can be detrimental, distorting spatial relations. Human studies highlight a metacognitive gap in models, which commit prematurely with high confidence, unlike humans who seek falsifying viewpoints.

Key takeaway

For research scientists developing embodied AI, this work highlights that improving an agent's active exploration and action-selection capabilities is more critical than solely enhancing passive perception. You should focus on designing systems that can dynamically acquire task-relevant evidence and revise beliefs, rather than relying on static observations, to overcome the identified "action blindness" and metacognitive gaps in current MLLMs.

Key insights

Embodied spatial intelligence requires active perception-action loops, where models currently exhibit "action blindness" and metacognitive gaps.

Principles

Method

ESI-BENCH evaluates embodied spatial intelligence by requiring agents to actively sequence perception, locomotion, and manipulation to accumulate task-relevant evidence across 10 categories and 29 subcategories.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.