AsgardBench: A benchmark for visually grounded interactive planning

· Source: Microsoft Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, quick

Summary

AsgardBench is a new open-source benchmark designed to evaluate the ability of embodied AI agents to adapt their plans based on visual observations in dynamic environments. Built on AI2-THOR, an interactive 3D simulation, AsgardBench features 108 controlled task instances across 12 types, focusing specifically on plan adaptation rather than navigation or manipulation. Agents are given a task, observe the environment through images, and must revise their proposed action sequences when perceptions contradict expectations. The benchmark provides minimal feedback (success/failure signals) and limits steps to prevent scripting, forcing agents to continuously re-evaluate. Testing leading vision-capable models revealed that visual input substantially improved success rates, often doubling them compared to text-only descriptions, and highlighted common weaknesses in visual distinction, state tracking, and plan revision.

Key takeaway

For research scientists developing embodied AI, AsgardBench offers a critical diagnostic and development tool. You should utilize this benchmark to isolate and improve your agents' capabilities in visually grounded interactive planning, particularly focusing on enhancing visual distinction in cluttered scenes, maintaining accurate task progress, and enabling robust mid-task plan revision to prepare for real-world environmental variability.

Key insights

AsgardBench evaluates embodied AI's ability to adapt plans using visual feedback in dynamic simulated environments.

Principles

Method

AsgardBench positions agents in AI2-THOR, provides color images and action history, and executes only the first step of a proposed plan, forcing continuous visual re-evaluation.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Research.