Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new study introduces PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark designed to evaluate Vision-Language Models' (VLMs) performance across multi-script languages. This benchmark comprises 1,000 strictly parallel image-text instances in Punjabi, covering its three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, researchers identified a significant "Script Gap," where models frequently solve visual tasks in one script but fail identical tasks in another, with accuracy differences reaching 16%. While visual input uniformly improves absolute performance, it does not mitigate this orthographic disparity. The study also found that cross-script in-context transfer is highly brittle, indicating script-locked knowledge representation. These findings, supported by McNemar tests, demonstrate that current "multilingual" VLMs are not truly multi-script. The authors propose the Script Consistency Rate (SCR), which recorded values as low as 24.8% on PuMVR, as a crucial metric for ensuring equitable AI access through script-agnostic evaluation.

Key takeaway

For Machine Learning Engineers developing or evaluating multilingual Vision-Language Models, you must recognize that current models are not truly multi-script. Your evaluations should incorporate benchmarks like PuMVR and metrics such as the Script Consistency Rate (SCR) to expose and address significant "Script Gaps." This ensures your models provide equitable access and perform reliably across diverse orthographies, preventing script-locked knowledge representation.

Key insights

Current "multilingual" VLMs exhibit a significant "Script Gap," failing identical visual tasks across different scripts of the same language.

Principles

Method

The PuMVR benchmark, with 1,000 parallel image-text instances across Punjabi's three scripts, evaluates 10 state-of-the-art VLMs. It measures the "Script Gap" and proposes the Script Consistency Rate (SCR) for script-agnostic evaluation.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.