Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Robotics & Autonomous Systems, Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

GRASP is a lightweight neuro-symbolic framework designed for open-vocabulary tabletop manipulation, translating natural-language queries into neuro-symbolic goal states grounded via a bounding-box detection pipeline. This approach, which uses a pretrained Vision-Language Model (GroundingDINO) and an LLM (GPT-5.2), enables robots to interpret abstract spatial concepts like "top shelf" and execute tasks without additional fine-tuning or extensive training. Experiments on a differential claw arm across 90 trials and three difficulty levels (easy, medium, hard) yielded overall success rates of 86.67%, 76.67%, and 56.67% respectively, averaging 73.33%. Ablation studies confirmed the critical role of closed-loop control, smoothing and deadband, and highest-logit target selection for reliable grasping performance.

Key takeaway

For AI Engineers developing robotic manipulation systems, GRASP offers a lightweight, training-free approach to integrate natural language commands. You should consider adopting neuro-symbolic planning with pretrained VLMs like GroundingDINO to achieve zero-shot generalization and interpret abstract spatial concepts. This framework reduces computational overhead and training data requirements, enabling faster deployment of robust, language-conditioned robots in dynamic environments.

Key insights

GRASP combines VLMs and symbolic planning for zero-shot, language-conditioned robotic manipulation using bounding box goals and closed-loop control.

Principles

Decouple high-level reasoning from low-level control.
Closed-loop feedback is critical for robust grasping.
Neuro-symbolic approaches enhance interpretability and efficiency.

Method

GRASP uses GPT-5.2 for goal state JSON generation and GroundingDINO for real-time bounding box detection. A proportional RPY controller, with smoothing and deadband, aligns the claw based on object-to-center distance.

In practice

Use GroundingDINO for open-vocabulary object detection.
Implement proportional RPY control for precise adjustments.
Employ smoothing and deadband to stabilize robot movements.

Topics

Neuro-Symbolic AI
Robotic Manipulation
Vision-Language Models
GroundingDINO
Zero-Shot Learning
Proportional Control

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Robotics Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.