Disparities In Negation Understanding Across Languages In Vision-Language Models
Summary
Vision-language models (VLMs) exhibit "affirmation bias," a tendency to select positive captions even when the correct description contains negation. This study introduces the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating CLIP, SigLIP, and MultiCLIP, the research found that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieved the highest and most uniform accuracy. Applying SpaceVLM, a proposed negation correction, yielded substantial improvements for several languages, particularly English, Greek, Spanish, and Tagalog, but showed varied effectiveness across typologically different languages. This variation indicates that linguistic properties like morphology, script, and negation structure interact with model improvements in fairness-relevant ways, highlighting the necessity of multilingual benchmarks for equitable VLM deployment.
Key takeaway
For research scientists and engineers deploying VLMs globally, you should recognize that current models exhibit significant cross-lingual negation gaps, particularly in non-Latin script languages. Your fairness audits must consider linguistic structures beyond English, as solutions like SpaceVLM show varied effectiveness based on a language's negation morphology. Prioritize curating negation-rich multilingual pretraining data and developing context-aware tokenization strategies to ensure equitable performance across diverse linguistic communities.
Key insights
VLMs exhibit affirmation bias, with negation understanding varying significantly across typologically diverse languages.
Principles
- Affirmation bias is a systematic VLM failure mode.
- Linguistic typology influences VLM negation understanding.
- Multilingual benchmarks are crucial for equitable VLM deployment.
Method
A multilingual negation benchmark was constructed by extending English NegBench to seven languages, using Google Translate and human verification. Three VLMs (CLIP, SigLIP, MultiCLIP) were evaluated for top-1 accuracy, with SpaceVLM applied as a negation correction.
In practice
- MultiCLIP offers the most consistent cross-lingual negation performance.
- SpaceVLM improves negation for adverbial negation languages.
- Consider language-specific tuning for negation correction methods.
Topics
- Vision-Language Models
- Affirmation Bias
- Multilingual Negation Benchmark
- Linguistic Typology
- SpaceVLM
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.