Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

2026-06-15 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new study introduces PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark designed to evaluate Vision-Language Models' (VLMs) performance across multi-script languages. This benchmark comprises 1,000 strictly parallel image-text instances in Punjabi, covering its three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, researchers identified a significant "Script Gap," where models frequently solve visual tasks in one script but fail identical tasks in another, with accuracy differences reaching 16%. While visual input uniformly improves absolute performance, it does not mitigate this orthographic disparity. The study also found that cross-script in-context transfer is highly brittle, indicating script-locked knowledge representation. These findings, supported by McNemar tests, demonstrate that current "multilingual" VLMs are not truly multi-script. The authors propose the Script Consistency Rate (SCR), which recorded values as low as 24.8% on PuMVR, as a crucial metric for ensuring equitable AI access through script-agnostic evaluation.

Key takeaway

For Machine Learning Engineers developing or evaluating multilingual Vision-Language Models, you must recognize that current models are not truly multi-script. Your evaluations should incorporate benchmarks like PuMVR and metrics such as the Script Consistency Rate (SCR) to expose and address significant "Script Gaps." This ensures your models provide equitable access and perform reliably across diverse orthographies, preventing script-locked knowledge representation.

Key insights

Current "multilingual" VLMs exhibit a significant "Script Gap," failing identical visual tasks across different scripts of the same language.

Principles

VLMs assume single-script language mapping.
Visual input doesn't resolve script-specific failures.
Cross-script knowledge transfer is highly brittle.

Method

The PuMVR benchmark, with 1,000 parallel image-text instances across Punjabi's three scripts, evaluates 10 state-of-the-art VLMs. It measures the "Script Gap" and proposes the Script Consistency Rate (SCR) for script-agnostic evaluation.

In practice

Evaluate VLMs using multi-script benchmarks.
Implement Script Consistency Rate (SCR) metric.
Focus VLM development on script-agnostic learning.

Topics

Vision-Language Models
Multilingual Evaluation
Script Consistency
PuMVR Benchmark
Script Consistency Rate
Cross-script Transfer

Code references

prabhjotschugh/Not-Truly-Multilingual-PuMVR

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.