Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation
Summary
A new study introduces PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark designed to evaluate Vision-Language Models' (VLMs) performance across multi-script languages. This benchmark comprises 1,000 strictly parallel image-text instances in Punjabi, covering its three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, researchers identified a significant "Script Gap," where models frequently solve visual tasks in one script but fail identical tasks in another, with accuracy differences reaching 16%. While visual input uniformly improves absolute performance, it does not mitigate this orthographic disparity. The study also found that cross-script in-context transfer is highly brittle, indicating script-locked knowledge representation. These findings, supported by McNemar tests, demonstrate that current "multilingual" VLMs are not truly multi-script. The authors propose the Script Consistency Rate (SCR), which recorded values as low as 24.8% on PuMVR, as a crucial metric for ensuring equitable AI access through script-agnostic evaluation.
Key takeaway
For Machine Learning Engineers developing or evaluating multilingual Vision-Language Models, you must recognize that current models are not truly multi-script. Your evaluations should incorporate benchmarks like PuMVR and metrics such as the Script Consistency Rate (SCR) to expose and address significant "Script Gaps." This ensures your models provide equitable access and perform reliably across diverse orthographies, preventing script-locked knowledge representation.
Key insights
Current "multilingual" VLMs exhibit a significant "Script Gap," failing identical visual tasks across different scripts of the same language.
Principles
- VLMs assume single-script language mapping.
- Visual input doesn't resolve script-specific failures.
- Cross-script knowledge transfer is highly brittle.
Method
The PuMVR benchmark, with 1,000 parallel image-text instances across Punjabi's three scripts, evaluates 10 state-of-the-art VLMs. It measures the "Script Gap" and proposes the Script Consistency Rate (SCR) for script-agnostic evaluation.
In practice
- Evaluate VLMs using multi-script benchmarks.
- Implement Script Consistency Rate (SCR) metric.
- Focus VLM development on script-agnostic learning.
Topics
- Vision-Language Models
- Multilingual Evaluation
- Script Consistency
- PuMVR Benchmark
- Script Consistency Rate
- Cross-script Transfer
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.