Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Ego2World is a novel executable benchmark that transforms egocentric cooking videos from the HD-EPIC dataset into interactive symbolic worlds for evaluating embodied agents. Unlike passive video datasets or synthetic simulators, Ego2World addresses the challenge of planning under partial observation by separating a hidden world graph (${G_{\mathrm{w}}}_{t}$) maintained by the simulator from an agent's belief graph (${G_{\mathrm{b}}}_{t}$). The benchmark compiles 101 videos, 9,130 action groups, and 426 goal-task instances into graph-transition rules, allowing agents to act, receive feedback, and replan without full world state knowledge. Experiments reveal that action-overlap scores often overestimate physical-state success, and persistent belief memory significantly improves task completion while reducing visual exploration. The compilation process itself is robust, with direct LLM graph synthesis showing a 48% hallucination rate, underscoring the need for annotation-grounded construction.

Key takeaway

For research scientists developing embodied AI agents, Ego2World highlights the critical need to move beyond action prediction towards robust belief maintenance and state-change reasoning. You should prioritize designing agents that can effectively update and utilize an internal belief graph under partial observation, as this significantly improves task completion and reduces costly visual exploration, even if it means sacrificing some local action plausibility.

Key insights

Ego2World enables embodied agents to plan and adapt in partially observed, dynamic environments by compiling real egocentric videos into executable symbolic worlds.

Principles

Method

Ego2World compiles HD-EPIC video annotations into graph-transition rules, creating an executable symbolic simulator (VCSS). Agents plan over a partial belief graph, receiving local observations and execution feedback, while task success is judged against the hidden world graph.

In practice

Topics

Best for: Research Scientist, AI Scientist, AI Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.