UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

2026-06-05 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, quick

Summary

UrduMMLU is a newly introduced benchmark comprising 26,431 Urdu multiple-choice questions across 26 subjects and five domains, designed to evaluate large language models' understanding of Urdu. Developed from native Urdu MCQ banks and public examination PDFs, it uniquely includes both standard academic and region-specific content, avoiding translation-based limitations. The benchmark involved dual human annotation for exam-derived questions. Evaluations of 30 LLMs, including 60 zero-shot tests and few-shot settings for four open-source models, revealed Gemini-3.5-Flash as the top performer with 90.20% and 90.34% accuracy. Other models did not exceed 85%, with the strongest open-source model trailing by 7.79 to 8.92 points. Notably, many models showed a significant performance drop of 25 to 40 points on Urdu-centered Humanities subjects compared to STEM.

Key takeaway

For NLP Engineers developing or deploying LLMs for Urdu-speaking populations, this benchmark highlights critical performance gaps. You should prioritize models like Gemini-3.5-Flash that demonstrate strong native language understanding, especially for culturally specific content. Be aware that many current LLMs struggle significantly with Urdu-centered Humanities subjects, losing 25 to 40 points compared to STEM. Consider fine-tuning or selecting models specifically evaluated on such diverse, native benchmarks to ensure robust application performance.

Key insights

UrduMMLU reveals current LLMs have uneven Urdu knowledge, particularly for regionally grounded content, despite strong performance from top models.

Principles

Multilingual evaluation needs native context.
Translation-based benchmarks are insufficient.
Region-specific content challenges LLMs.

Method

UrduMMLU was built from native Urdu MCQ banks and public examination PDFs, covering 26 subjects and five domains, with exam-derived questions undergoing dual human annotation and strict consensus filtering.

In practice

Prioritize native-sourced data for benchmarks.
Test LLMs on region-specific humanities.
Evaluate models with Urdu prompts.

Topics

Urdu Language Understanding
Multilingual LLM Evaluation
Benchmark Datasets
Gemini-3.5-Flash
Zero-shot Learning
Few-shot Learning

Best for: AI Engineer, Machine Learning Engineer, Research Scientist, AI Scientist, NLP Engineer, Director of AI/ML

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.