Disparities In Negation Understanding Across Languages In Vision-Language Models

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computational Linguistics · Depth: Expert, quick

Summary

Vision-language models (VLMs) frequently exhibit an "affirmation bias," favoring positive captions over negative ones, a phenomenon previously studied primarily in English. A new human-verified multilingual negation benchmark has been introduced, covering seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. This benchmark was used to evaluate three VLMs: CLIP, SigLIP, and MultiCLIP. Results indicate that standard CLIP performs at or below chance for non-Latin-script languages, whereas MultiCLIP demonstrates the highest and most consistent accuracy across all languages. Additionally, the SpaceVLM negation correction method was assessed, showing significant improvements for English, Greek, Spanish, and Tagalog, but varied effectiveness in other languages. This variability highlights how linguistic features like morphology, script, and negation structure influence model performance and the efficacy of corrective measures, underscoring the need for multilingual benchmarks as VLMs are deployed globally.

Key takeaway

For research scientists developing or deploying vision-language models globally, you should prioritize comprehensive multilingual evaluation beyond English. Your models' negation understanding will vary significantly across languages due to morphological and script differences, impacting fairness. Use benchmarks like the one presented to ensure solutions like SpaceVLM are effective for your target linguistic communities, rather than assuming universal applicability.

Key insights

VLMs show affirmation bias, with negation understanding varying significantly across diverse languages and models.

Principles

Method

A human-verified multilingual negation benchmark was created across seven languages to evaluate CLIP, SigLIP, MultiCLIP, and SpaceVLM for negation understanding.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, NLP Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.