GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation
Summary
Vision-language models (VLMs) struggle with long, complex robot tasks due to ambiguous natural-language plans that decouple action planning from spatial grounding. To address this, researchers developed GroundedPlanBench, a new benchmark for evaluating VLM planning and spatial grounding across diverse real-world robot scenarios. They also introduced Video-to-Spatially Grounded Planning (V2GP), a framework that converts robot demonstration videos into spatially grounded training data. V2GP enables models to learn planning and grounding jointly, significantly improving task success and action accuracy compared to decoupled approaches. The benchmark includes 1,009 tasks ranging from 1 to 26 actions, derived from 308 robot manipulation scenes in the DROID dataset, with both explicit and implicit instructions.
Key takeaway
For research scientists developing robot manipulation systems, integrating planning and spatial grounding within a single model is crucial. Decoupled approaches, which separate action planning from location determination, lead to significant failures in complex, real-world tasks due to linguistic ambiguity. You should explore frameworks like V2GP to train VLMs for joint planning and grounding, enhancing task success and action recall rates in your robotic applications.
Key insights
Jointly planning actions and spatial grounding improves robot task success and action accuracy for VLMs.
Principles
- Decoupled planning propagates errors.
- Ambiguous language hinders robot execution.
- Grounded planning enhances reliability.
Method
V2GP processes robot videos to detect object interactions, generates text descriptions, tracks objects using SAM3, and constructs grounded plans by identifying grasp and placement locations.
In practice
- Use V2GP for robot training data.
- Evaluate with GroundedPlanBench.
- Integrate planning and grounding.
Topics
- GroundedPlanBench
- V2GP Framework
- Spatially Grounded Planning
- Robot Manipulation
- Vision-Language Models
Code references
Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Research.