Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning

2026-05-15 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Ego2World is a novel executable benchmark that transforms egocentric cooking videos from the HD-EPIC dataset into interactive symbolic worlds for evaluating embodied agents. Unlike passive video datasets or synthetic simulators, Ego2World addresses the challenge of planning under partial observation by separating a hidden world graph (${G_{\mathrm{w}}}_{t}$) maintained by the simulator from an agent's belief graph (${G_{\mathrm{b}}}_{t}$). The benchmark compiles 101 videos, 9,130 action groups, and 426 goal-task instances into graph-transition rules, allowing agents to act, receive feedback, and replan without full world state knowledge. Experiments reveal that action-overlap scores often overestimate physical-state success, and persistent belief memory significantly improves task completion while reducing visual exploration. The compilation process itself is robust, with direct LLM graph synthesis showing a 48% hallucination rate, underscoring the need for annotation-grounded construction.

Key takeaway

For research scientists developing embodied AI agents, Ego2World highlights the critical need to move beyond action prediction towards robust belief maintenance and state-change reasoning. You should prioritize designing agents that can effectively update and utilize an internal belief graph under partial observation, as this significantly improves task completion and reduces costly visual exploration, even if it means sacrificing some local action plausibility.

Key insights

Ego2World enables embodied agents to plan and adapt in partially observed, dynamic environments by compiling real egocentric videos into executable symbolic worlds.

Principles

Separate hidden world state from agent belief state.
Action plausibility does not equate to state success.
Memory selection is crucial for long-horizon planning.

Method

Ego2World compiles HD-EPIC video annotations into graph-transition rules, creating an executable symbolic simulator (VCSS). Agents plan over a partial belief graph, receiving local observations and execution feedback, while task success is judged against the hidden world graph.

In practice

Implement belief maintenance architectures for embodied agents.
Co-optimize for both local executability and global task completion.
Design memory systems with uncertainty-aware retrieval and forgetting.

Topics

Ego2World Benchmark
Belief-State Planning
Egocentric Video Analysis
Graph-Transition Rules
Partial Observation

Best for: Research Scientist, AI Scientist, AI Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.