UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

2026-05-23 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Data Science & Analytics · Depth: Expert, extended

Summary

UrduMMLU is a new, massive multitask benchmark designed to evaluate Large Language Models (LLMs) on Urdu language understanding, comprising 26,431 multiple-choice questions (MCQs) across 26 subjects and five domains. Collected from native Urdu MCQ banks and public examination PDFs, it uniquely covers both standard academic subjects and Urdu- and region-specific content, unlike translation-based benchmarks. The exam-derived portion was meticulously labeled through dual human annotation with strict consensus filtering. Evaluations of 30 LLMs revealed that Gemini-3.5-Flash performed best, achieving 90.20% and 90.34% accuracy with English and Urdu prompts, respectively, while no other model exceeded 85%. Open-source models trailed by 7.79 to 8.92 points. A significant finding was the substantial performance drop (25 to 40 points) on Urdu-centered Humanities subjects compared to STEM, indicating uneven Urdu knowledge in current LLMs, particularly for culturally grounded content. Few-shot prompting yielded only modest gains.

Key takeaway

For NLP engineers developing or deploying LLMs for Urdu-speaking populations, you must prioritize evaluation with native, culturally-grounded benchmarks like UrduMMLU. Your models will likely exhibit significantly lower performance on Urdu-centered humanities subjects compared to STEM, even with strong English performance. Focus your fine-tuning efforts on rich Urdu literary and cultural datasets to bridge this critical knowledge gap and ensure more equitable model capabilities.

Key insights

Native, culturally-grounded benchmarks reveal significant LLM knowledge gaps in non-English, humanities-focused content.

Principles

Multilingual LLM evaluation requires native educational and cultural context.
English-centric benchmark performance does not reliably transfer to regional knowledge.
LLMs consistently struggle more with humanities than STEM in new languages.

Method

UrduMMLU was constructed by extracting 26,431 MCQs from native Urdu educational PDFs and MCQ websites, followed by dual human annotation with strict consensus filtering for exam-derived questions.

In practice

Utilize native benchmarks like UrduMMLU to assess LLM cultural knowledge.
Prioritize Urdu-specific fine-tuning for humanities and culturally-grounded content.
Anticipate lower LLM accuracy on non-STEM, regionally-specific subjects.

Topics

Urdu Language
LLM Evaluation
Multilingual NLP
Benchmark Datasets
Cultural Bias
Natural Language Understanding

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.