Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization
Summary
Semantic Flip is a novel framework designed to enable robust refusal in embodied agents, addressing the critical issue of overconfident vision-language models (VLMs) providing misleading answers when visual memory is insufficient. This method synthesizes auxiliary out-of-distribution (OOD) samples by independently transforming user queries and video memory to construct pairs that lack sufficient visual grounding. These synthesized OOD pairs are then used to train a lightweight rejection module, which attaches to any existing VLM-based pipeline without requiring retraining of the underlying pretrained VLM. Semantic Flip consistently outperforms strong prompting baselines across two complementary benchmarks. The work also introduces SpaceReject, a new refusal benchmark specifically for spatial localization with deliberately unanswerable queries over long video memory, where Semantic Flip achieves an F1 score of 0.9559. Source codes and datasets are publicly available.
Key takeaway
For Machine Learning Engineers deploying embodied agents, Semantic Flip provides a critical solution for robust refusal. If your VLM-based pipeline struggles with overconfident answers to unanswerable queries, you should consider integrating this lightweight rejection module. It enhances reliability in embodied question answering and spatial localization by preventing misleading information, without requiring extensive retraining of your core VLM. This approach improves user trust and agent safety.
Key insights
Semantic Flip synthesizes OOD samples by transforming queries and video memory to train a lightweight refusal module for embodied VLMs.
Principles
- VLMs need robust refusal for unanswerable queries.
- Synthesize OOD samples without external annotations.
- Decouple refusal module from core VLM training.
Method
Semantic Flip independently transforms user queries and video memory to create auxiliary OOD pairs lacking visual grounding. These pairs train a lightweight rejection module on a frozen pretrained VLM.
In practice
- Integrate refusal module into existing VLM pipelines.
- Improve reliability of embodied QA agents.
- Enhance spatial localization with refusal capabilities.
Topics
- Embodied AI
- Vision-Language Models
- Out-of-Distribution Detection
- Refusal Systems
- Spatial Localization
- SpaceReject Benchmark
Code references
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.