GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation

2026-03-26 · Source: Microsoft Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Intermediate, quick

Summary

Vision-language models (VLMs) struggle with long, complex robot tasks due to ambiguous natural-language plans that decouple action planning from spatial grounding. To address this, researchers developed GroundedPlanBench, a new benchmark for evaluating VLM planning and spatial grounding across diverse real-world robot scenarios. They also introduced Video-to-Spatially Grounded Planning (V2GP), a framework that converts robot demonstration videos into spatially grounded training data. V2GP enables models to learn planning and grounding jointly, significantly improving task success and action accuracy compared to decoupled approaches. The benchmark includes 1,009 tasks ranging from 1 to 26 actions, derived from 308 robot manipulation scenes in the DROID dataset, with both explicit and implicit instructions.

Key takeaway

For research scientists developing robot manipulation systems, integrating planning and spatial grounding within a single model is crucial. Decoupled approaches, which separate action planning from location determination, lead to significant failures in complex, real-world tasks due to linguistic ambiguity. You should explore frameworks like V2GP to train VLMs for joint planning and grounding, enhancing task success and action recall rates in your robotic applications.

Key insights

Jointly planning actions and spatial grounding improves robot task success and action accuracy for VLMs.

Principles

Decoupled planning propagates errors.
Ambiguous language hinders robot execution.
Grounded planning enhances reliability.

Method

V2GP processes robot videos to detect object interactions, generates text descriptions, tracks objects using SAM3, and constructs grounded plans by identifying grasp and placement locations.

In practice

Use V2GP for robot training data.
Evaluate with GroundedPlanBench.
Integrate planning and grounding.

Topics

GroundedPlanBench
V2GP Framework
Spatially Grounded Planning
Robot Manipulation
Vision-Language Models

Code references

QwenLM/Qwen3-VL

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Research.