Disparities In Negation Understanding Across Languages In Vision-Language Models

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, medium

Summary

Vision-language models (VLMs) exhibit "affirmation bias," a tendency to select positive captions even when the correct description contains negation. This study introduces the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating CLIP, SigLIP, and MultiCLIP, the research found that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieved the highest and most uniform accuracy. Applying SpaceVLM, a proposed negation correction, yielded substantial improvements for several languages, particularly English, Greek, Spanish, and Tagalog, but showed varied effectiveness across typologically different languages. This variation indicates that linguistic properties like morphology, script, and negation structure interact with model improvements in fairness-relevant ways, highlighting the necessity of multilingual benchmarks for equitable VLM deployment.

Key takeaway

For research scientists and engineers deploying VLMs globally, you should recognize that current models exhibit significant cross-lingual negation gaps, particularly in non-Latin script languages. Your fairness audits must consider linguistic structures beyond English, as solutions like SpaceVLM show varied effectiveness based on a language's negation morphology. Prioritize curating negation-rich multilingual pretraining data and developing context-aware tokenization strategies to ensure equitable performance across diverse linguistic communities.

Key insights

VLMs exhibit affirmation bias, with negation understanding varying significantly across typologically diverse languages.

Principles

Method

A multilingual negation benchmark was constructed by extending English NegBench to seven languages, using Google Translate and human verification. Three VLMs (CLIP, SigLIP, MultiCLIP) were evaluated for top-1 accuracy, with SpaceVLM applied as a negation correction.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.