AsgardBench: A benchmark for visually grounded interactive planning
Summary
AsgardBench is a new open-source benchmark designed to evaluate the ability of embodied AI agents to adapt their plans based on visual observations in dynamic environments. Built on AI2-THOR, an interactive 3D simulation, AsgardBench features 108 controlled task instances across 12 types, focusing specifically on plan adaptation rather than navigation or manipulation. Agents are given a task, observe the environment through images, and must revise their proposed action sequences when perceptions contradict expectations. The benchmark provides minimal feedback (success/failure signals) and limits steps to prevent scripting, forcing agents to continuously re-evaluate. Testing leading vision-capable models revealed that visual input substantially improved success rates, often doubling them compared to text-only descriptions, and highlighted common weaknesses in visual distinction, state tracking, and plan revision.
Key takeaway
For research scientists developing embodied AI, AsgardBench offers a critical diagnostic and development tool. You should utilize this benchmark to isolate and improve your agents' capabilities in visually grounded interactive planning, particularly focusing on enhancing visual distinction in cluttered scenes, maintaining accurate task progress, and enabling robust mid-task plan revision to prepare for real-world environmental variability.
Key insights
AsgardBench evaluates embodied AI's ability to adapt plans using visual feedback in dynamic simulated environments.
Principles
- Visual grounding is critical for robust embodied AI performance.
- Minimal feedback forces agents to rely on perception for plan adaptation.
Method
AsgardBench positions agents in AI2-THOR, provides color images and action history, and executes only the first step of a proposed plan, forcing continuous visual re-evaluation.
In practice
- Use AsgardBench to diagnose perception, memory, or planning weaknesses.
- Focus training on mid-task plan repair and subtle visual cues.
Topics
- AsgardBench
- Embodied AI
- Visually Grounded Planning
- AI2-THOR Simulation
- Plan Adaptation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Research.