UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
Summary
UrduMMLU is a new, massive multitask benchmark designed to evaluate Large Language Models (LLMs) on Urdu language understanding, comprising 26,431 multiple-choice questions (MCQs) across 26 subjects and five domains. Collected from native Urdu MCQ banks and public examination PDFs, it uniquely covers both standard academic subjects and Urdu- and region-specific content, unlike translation-based benchmarks. The exam-derived portion was meticulously labeled through dual human annotation with strict consensus filtering. Evaluations of 30 LLMs revealed that Gemini-3.5-Flash performed best, achieving 90.20% and 90.34% accuracy with English and Urdu prompts, respectively, while no other model exceeded 85%. Open-source models trailed by 7.79 to 8.92 points. A significant finding was the substantial performance drop (25 to 40 points) on Urdu-centered Humanities subjects compared to STEM, indicating uneven Urdu knowledge in current LLMs, particularly for culturally grounded content. Few-shot prompting yielded only modest gains.
Key takeaway
For NLP engineers developing or deploying LLMs for Urdu-speaking populations, you must prioritize evaluation with native, culturally-grounded benchmarks like UrduMMLU. Your models will likely exhibit significantly lower performance on Urdu-centered humanities subjects compared to STEM, even with strong English performance. Focus your fine-tuning efforts on rich Urdu literary and cultural datasets to bridge this critical knowledge gap and ensure more equitable model capabilities.
Key insights
Native, culturally-grounded benchmarks reveal significant LLM knowledge gaps in non-English, humanities-focused content.
Principles
- Multilingual LLM evaluation requires native educational and cultural context.
- English-centric benchmark performance does not reliably transfer to regional knowledge.
- LLMs consistently struggle more with humanities than STEM in new languages.
Method
UrduMMLU was constructed by extracting 26,431 MCQs from native Urdu educational PDFs and MCQ websites, followed by dual human annotation with strict consensus filtering for exam-derived questions.
In practice
- Utilize native benchmarks like UrduMMLU to assess LLM cultural knowledge.
- Prioritize Urdu-specific fine-tuning for humanities and culturally-grounded content.
- Anticipate lower LLM accuracy on non-STEM, regionally-specific subjects.
Topics
- Urdu Language
- LLM Evaluation
- Multilingual NLP
- Benchmark Datasets
- Cultural Bias
- Natural Language Understanding
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.