Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning
Summary
GRASP is a lightweight neuro-symbolic framework designed for open-vocabulary tabletop manipulation, translating natural-language queries into neuro-symbolic goal states grounded via a bounding-box detection pipeline. This approach, which uses a pretrained Vision-Language Model (GroundingDINO) and an LLM (GPT-5.2), enables robots to interpret abstract spatial concepts like "top shelf" and execute tasks without additional fine-tuning or extensive training. Experiments on a differential claw arm across 90 trials and three difficulty levels (easy, medium, hard) yielded overall success rates of 86.67%, 76.67%, and 56.67% respectively, averaging 73.33%. Ablation studies confirmed the critical role of closed-loop control, smoothing and deadband, and highest-logit target selection for reliable grasping performance.
Key takeaway
For AI Engineers developing robotic manipulation systems, GRASP offers a lightweight, training-free approach to integrate natural language commands. You should consider adopting neuro-symbolic planning with pretrained VLMs like GroundingDINO to achieve zero-shot generalization and interpret abstract spatial concepts. This framework reduces computational overhead and training data requirements, enabling faster deployment of robust, language-conditioned robots in dynamic environments.
Key insights
GRASP combines VLMs and symbolic planning for zero-shot, language-conditioned robotic manipulation using bounding box goals and closed-loop control.
Principles
- Decouple high-level reasoning from low-level control.
- Closed-loop feedback is critical for robust grasping.
- Neuro-symbolic approaches enhance interpretability and efficiency.
Method
GRASP uses GPT-5.2 for goal state JSON generation and GroundingDINO for real-time bounding box detection. A proportional RPY controller, with smoothing and deadband, aligns the claw based on object-to-center distance.
In practice
- Use GroundingDINO for open-vocabulary object detection.
- Implement proportional RPY control for precise adjustments.
- Employ smoothing and deadband to stabilize robot movements.
Topics
- Neuro-Symbolic AI
- Robotic Manipulation
- Vision-Language Models
- GroundingDINO
- Zero-Shot Learning
- Proportional Control
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Robotics Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.