LaViSA: A Language and Vision Structural Ambiguity Benchmark
Summary
The LaViSA (Language and Vision Structural Ambiguity) benchmark evaluates Vision and Language Models' (VLMs) ability to resolve structural ambiguity using visual scenes. It comprises 1,503 samples across seven categories, featuring ambiguous sentences, their disambiguated forms, and corresponding images. Researchers conducted a comprehensive evaluation of diverse VLMs, including proprietary models like GPT-5.2, Gemini 3.1 Pro, and Gemini 3.1 Flash-Lite, alongside open-source models such as LLaVA-OneVision-1.5, Qwen3-VL, and Gemma3, with varying parameter scales. Results indicate that while VLMs can leverage visual cues to some extent, they still struggle with specific ambiguity types, notably Conjunction and Ellipsis, and subtle visual semantic distinctions, highlighting current limitations in visual disambiguation.
Key takeaway
For AI Scientists and NLP Engineers developing Vision and Language Models, this research highlights that current VLMs, despite leveraging visual cues, still exhibit significant limitations in resolving complex structural ambiguities like Conjunction and Ellipsis. You should prioritize efforts to improve models' ability to consistently track semantic differences across minimally contrasted visual scenes and to correctly integrate visual objects into predicate-argument structures for more robust real-world applications.
Key insights
VLMs struggle with visual disambiguation of structural ambiguity, especially subtle semantic distinctions.
Principles
- Structural ambiguity is a core challenge for language understanding.
- Visual scenes are critical for resolving linguistic ambiguity.
- VLM performance varies significantly across ambiguity types.
Method
LaViSA evaluates VLMs by requiring them to select the correct interpretation from candidate disambiguated sentences, given an ambiguous sentence and a clarifying image.
In practice
- Prioritize VLM research on Conjunction and Ellipsis ambiguities.
- Analyze VLM performance across different image styles.
- Focus on predicate-argument structure grounding in VLMs.
Topics
- Vision-Language Models
- Structural Ambiguity
- Visual Disambiguation
- Benchmark Datasets
- Natural Language Processing
- Multimodal AI
Code references
Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.