Disparities In Negation Understanding Across Languages In Vision-Language Models

2026-04-22 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, medium

Summary

Vision-language models (VLMs) exhibit "affirmation bias," a tendency to select positive captions even when the correct description contains negation. This study introduces the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating CLIP, SigLIP, and MultiCLIP, the research found that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieved the highest and most uniform accuracy. Applying SpaceVLM, a proposed negation correction, yielded substantial improvements for several languages, particularly English, Greek, Spanish, and Tagalog, but showed varied effectiveness across typologically different languages. This variation indicates that linguistic properties like morphology, script, and negation structure interact with model improvements in fairness-relevant ways, highlighting the necessity of multilingual benchmarks for equitable VLM deployment.

Key takeaway

For research scientists and engineers deploying VLMs globally, you should recognize that current models exhibit significant cross-lingual negation gaps, particularly in non-Latin script languages. Your fairness audits must consider linguistic structures beyond English, as solutions like SpaceVLM show varied effectiveness based on a language's negation morphology. Prioritize curating negation-rich multilingual pretraining data and developing context-aware tokenization strategies to ensure equitable performance across diverse linguistic communities.

Key insights

VLMs exhibit affirmation bias, with negation understanding varying significantly across typologically diverse languages.

Principles

Affirmation bias is a systematic VLM failure mode.
Linguistic typology influences VLM negation understanding.
Multilingual benchmarks are crucial for equitable VLM deployment.

Method

A multilingual negation benchmark was constructed by extending English NegBench to seven languages, using Google Translate and human verification. Three VLMs (CLIP, SigLIP, MultiCLIP) were evaluated for top-1 accuracy, with SpaceVLM applied as a negation correction.

In practice

MultiCLIP offers the most consistent cross-lingual negation performance.
SpaceVLM improves negation for adverbial negation languages.
Consider language-specific tuning for negation correction methods.

Topics

Vision-Language Models
Affirmation Bias
Multilingual Negation Benchmark
Linguistic Typology
SpaceVLM

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.