Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation
Summary
A new benchmark, PuMVR (Punjabi Multimodal Visual Reasoning), exposes a significant "Script Gap" in state-of-the-art Vision-Language Models (VLMs). It uses 1,000 parallel image-text instances across Punjabi's Gurmukhi, Shahmukhi, and Roman scripts. This challenges the "One Language, One Script" evaluation paradigm. Evaluating 10 VLMs revealed accuracy deltas up to 16.26% (Llama-3.2-11B-Vision). Script Consistency Rates (SCR) were as low as 24.8%. Visual input boosts performance but does not close the orthographic gap. Cross-script in-context transfer proved brittle. McNemar tests confirmed these degradations are statistically robust for 8 of 10 models. The study proposes SCR as a mandatory metric for script-agnostic VLM evaluation to ensure equitable AI access.
Key takeaway
For AI Scientists and ML Engineers deploying multilingual Vision-Language Models, this research highlights a critical blind spot: current evaluations overlook script consistency. You must move beyond the "One Language, One Script" assumption and explicitly test for orthographic robustness using metrics like Script Consistency Rate (SCR). Failing to do so risks deploying models unreliable for billions of multi-script users, leading to fragmented performance and inequitable AI access. Prioritize script-agnostic evaluation for true multilingual competence.
Key insights
VLMs exhibit substantial script-dependent bias, failing to consistently process identical content across different orthographies.
Principles
- The "One Language, One Script" paradigm masks VLM orthographic bias.
- Visual grounding adds performance but does not close script gaps.
- VLM knowledge can be script-locked, hindering cross-script transfer.
Method
PuMVR benchmark uses 1,000 parallel image-reasoning tasks in Gurmukhi, Shahmukhi, and Roman scripts to isolate orthography. It measures Script Accuracy, Script Consistency Rate (SCR), and Performance Delta.
In practice
- Evaluate VLMs using parallel-script benchmarks like PuMVR.
- Incorporate Script Consistency Rate (SCR) as a mandatory metric.
- Assess cross-script transfer efficiency with few-shot learning.
Topics
- Vision-Language Models
- Multilingual AI Evaluation
- Orthographic Bias
- Script Consistency Rate
- PuMVR Benchmark
- Cross-Script Transfer
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.