Ukrainian Visual Word Sense Disambiguation Benchmark

2026-03-26 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

A new benchmark has been developed for the Visual Word Sense Disambiguation (Visual-WSD) task in Ukrainian, addressing the challenge of identifying the correct meaning of ambiguous words from a set of ten images with minimal context. This benchmark, constructed using a methodology similar to Raganato et al. (2023) for English, Italian, and Farsi, allows for cross-language model performance comparisons. Data was collected semi-automatically and refined by Ukrainian philology experts, focusing on high-frequency noun homonyms. Eight multilingual and multimodal large language models were evaluated, all performing worse than the zero-shot CLIP-based baseline model used for English Visual-WSD. The analysis revealed a significant performance gap in Visual-WSD between Ukrainian and English, highlighting issues with MLLMs in low-resource languages and their susceptibility to hallucination.

Key takeaway

For AI Scientists developing or deploying multimodal LLMs for low-resource languages, you should prioritize creating and utilizing language-specific benchmarks like the U-VWSD. The observed performance disparity between Ukrainian and English models indicates that direct application of models trained on high-resource languages is insufficient. Focus on domain adaptation and data augmentation strategies tailored to the unique semantic nuances of target languages to mitigate hallucination and improve accuracy.

Key insights

Multimodal LLMs significantly underperform in Ukrainian Visual-WSD compared to English, revealing a critical language resource gap.

Principles

Low-resource languages challenge MLLM performance.
Homonym frequency impacts hallucination generation.

Method

The benchmark construction involved semi-automatic data collection from a digitized Ukrainian homonym dictionary, expert refinement, and generation of positive and negative image samples from Wikipedia, along with challenging trigger words.

In practice

Use MRR and HIT@1 for Visual-WSD evaluation.
Consider language-specific polysemy in model assessment.

Topics

Visual Word Sense Disambiguation
Multimodal Large Language Models
Ukrainian NLP
Low-Resource Languages
AI Benchmarking

Best for: AI Scientist, AI Researcher, NLP Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.