UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
Summary
UrduMMLU is a newly introduced benchmark comprising 26,431 Urdu multiple-choice questions across 26 subjects and five domains, designed to evaluate large language models' understanding of Urdu. Developed from native Urdu MCQ banks and public examination PDFs, it uniquely includes both standard academic and region-specific content, avoiding translation-based limitations. The benchmark involved dual human annotation for exam-derived questions. Evaluations of 30 LLMs, including 60 zero-shot tests and few-shot settings for four open-source models, revealed Gemini-3.5-Flash as the top performer with 90.20% and 90.34% accuracy. Other models did not exceed 85%, with the strongest open-source model trailing by 7.79 to 8.92 points. Notably, many models showed a significant performance drop of 25 to 40 points on Urdu-centered Humanities subjects compared to STEM.
Key takeaway
For NLP Engineers developing or deploying LLMs for Urdu-speaking populations, this benchmark highlights critical performance gaps. You should prioritize models like Gemini-3.5-Flash that demonstrate strong native language understanding, especially for culturally specific content. Be aware that many current LLMs struggle significantly with Urdu-centered Humanities subjects, losing 25 to 40 points compared to STEM. Consider fine-tuning or selecting models specifically evaluated on such diverse, native benchmarks to ensure robust application performance.
Key insights
UrduMMLU reveals current LLMs have uneven Urdu knowledge, particularly for regionally grounded content, despite strong performance from top models.
Principles
- Multilingual evaluation needs native context.
- Translation-based benchmarks are insufficient.
- Region-specific content challenges LLMs.
Method
UrduMMLU was built from native Urdu MCQ banks and public examination PDFs, covering 26 subjects and five domains, with exam-derived questions undergoing dual human annotation and strict consensus filtering.
In practice
- Prioritize native-sourced data for benchmarks.
- Test LLMs on region-specific humanities.
- Evaluate models with Urdu prompts.
Topics
- Urdu Language Understanding
- Multilingual LLM Evaluation
- Benchmark Datasets
- Gemini-3.5-Flash
- Zero-shot Learning
- Few-shot Learning
Best for: AI Engineer, Machine Learning Engineer, Research Scientist, AI Scientist, NLP Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.