LaViSA: A Language and Vision Structural Ambiguity Benchmark

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The LaViSA (Language and Vision Structural Ambiguity) benchmark evaluates Vision and Language Models' (VLMs) ability to resolve structural ambiguity using visual scenes. It comprises 1,503 samples across seven categories, featuring ambiguous sentences, their disambiguated forms, and corresponding images. Researchers conducted a comprehensive evaluation of diverse VLMs, including proprietary models like GPT-5.2, Gemini 3.1 Pro, and Gemini 3.1 Flash-Lite, alongside open-source models such as LLaVA-OneVision-1.5, Qwen3-VL, and Gemma3, with varying parameter scales. Results indicate that while VLMs can leverage visual cues to some extent, they still struggle with specific ambiguity types, notably Conjunction and Ellipsis, and subtle visual semantic distinctions, highlighting current limitations in visual disambiguation.

Key takeaway

For AI Scientists and NLP Engineers developing Vision and Language Models, this research highlights that current VLMs, despite leveraging visual cues, still exhibit significant limitations in resolving complex structural ambiguities like Conjunction and Ellipsis. You should prioritize efforts to improve models' ability to consistently track semantic differences across minimally contrasted visual scenes and to correctly integrate visual objects into predicate-argument structures for more robust real-world applications.

Key insights

VLMs struggle with visual disambiguation of structural ambiguity, especially subtle semantic distinctions.

Principles

Method

LaViSA evaluates VLMs by requiring them to select the correct interpretation from candidate disambiguated sentences, given an ambiguous sentence and a clarifying image.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Scientist, NLP Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.