ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Computer Vision & Pattern Recognition · Depth: Expert, quick

Summary

ESI-BENCH is a new benchmark designed to evaluate embodied spatial intelligence in AI agents, moving beyond passive observation to active perception-action loops. Built on OmniGibson and grounded in Spelke's core knowledge systems, it features 10 task categories and 29 subcategories requiring agents to deploy perception, locomotion, and manipulation abilities to gather task-relevant evidence. Experiments with state-of-the-art MLLMs on ESI-BENCH demonstrate that active exploration significantly outperforms passive methods, with agents developing emergent spatial strategies without explicit instruction. A key finding is that most failures arise from "action blindness" rather than weak perception, as poor action choices lead to insufficient observations and subsequent errors. While explicit 3D grounding can stabilize reasoning for depth-sensitive tasks, imperfect 3D representations can distort spatial relations, performing worse than 2D baselines. Human studies also reveal a metacognitive gap in models, which commit to beliefs prematurely regardless of evidence quality, unlike humans who actively seek falsifying viewpoints.

Key takeaway

For research scientists developing embodied AI agents, you should prioritize designing systems that actively engage in perception-action loops rather than relying solely on passive sensing. Your focus should extend beyond improving perceptual capabilities to enhancing an agent's ability to make effective action choices, as "action blindness" is a primary cause of failure. Consider the trade-offs of 3D grounding, as imperfect representations can be detrimental, and explore methods to address the metacognitive gap in models to enable more robust belief revision.

Key insights

Embodied spatial intelligence requires active perception-action loops, where agents actively seek information rather than passively processing it.

Principles

Method

ESI-BENCH evaluates embodied spatial intelligence across 10 task categories and 29 subcategories, requiring agents to sequence perception, locomotion, and manipulation to accumulate task-relevant evidence.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.