3D Instruction Ambiguity Detection

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Peking University researchers Jiayu Ding, Haoran Tang, and Ge Li introduce "Open-Vocabulary 3D Instruction Ambiguity Detection," a new task for embodied AI to identify ambiguous commands in 3D environments, crucial for safety-critical applications. They developed Ambi3D, a large-scale benchmark comprising over 700 diverse 3D scenes and approximately 22,000 human-annotated instructions, categorized into referential (instance, attribute, spatial) and execution ambiguities. Their analysis revealed that existing state-of-the-art 3D Large Language Models (LLMs) struggle with this task. To address this, the team proposes AmbiVer, a two-stage framework that collects explicit visual evidence from multiple views and employs a Vision-Language Model (VLM) for zero-shot ambiguity adjudication. AmbiVer significantly outperforms both zero-shot and LoRA fine-tuned 3D LLM baselines on the Ambi3D benchmark, demonstrating its effectiveness in enabling more reliable ambiguity detection.

Key takeaway

For research scientists developing embodied AI systems, recognizing and addressing linguistic ambiguity in 3D environments is paramount for safety. You should consider integrating explicit ambiguity detection mechanisms, like the AmbiVer framework, into your models rather than relying on implicit resolution or assuming unambiguous instructions. This approach ensures that your agents can proactively seek clarification, preventing potentially catastrophic errors in safety-critical domains and building more trustworthy intelligent systems.

Key insights

Detecting instruction ambiguity in 3D environments is critical for safe embodied AI, a gap current 3D LLMs fail to address.

Principles

Ambiguity is a joint property of instruction and 3D scene.
Objective ambiguity detection precedes execution.
Decouple perception from logical reasoning.

Method

AmbiVer uses a two-stage framework: a perception engine extracts structured visual evidence (BEV map, object instances) from video streams, which a zero-shot VLM then adjudicates for ambiguity based on a multimodal prompt.

In practice

Use Grounding DINO for open-vocabulary 2D detection.
Employ ray-based fusion for 3D instance unification.
Qwen-3-VL can perform zero-shot ambiguity reasoning.

Topics

3D Instruction Ambiguity Detection
Ambi3D Benchmark
AmbiVer Framework
Embodied AI Safety
Vision-Language Models

Best for: Research Scientist, AI Scientist, Robotics Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.