UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Data Science & Analytics · Depth: Expert, extended

Summary

UrduMMLU is a new, massive multitask benchmark designed to evaluate Large Language Models (LLMs) on Urdu language understanding, comprising 26,431 multiple-choice questions (MCQs) across 26 subjects and five domains. Collected from native Urdu MCQ banks and public examination PDFs, it uniquely covers both standard academic subjects and Urdu- and region-specific content, unlike translation-based benchmarks. The exam-derived portion was meticulously labeled through dual human annotation with strict consensus filtering. Evaluations of 30 LLMs revealed that Gemini-3.5-Flash performed best, achieving 90.20% and 90.34% accuracy with English and Urdu prompts, respectively, while no other model exceeded 85%. Open-source models trailed by 7.79 to 8.92 points. A significant finding was the substantial performance drop (25 to 40 points) on Urdu-centered Humanities subjects compared to STEM, indicating uneven Urdu knowledge in current LLMs, particularly for culturally grounded content. Few-shot prompting yielded only modest gains.

Key takeaway

For NLP engineers developing or deploying LLMs for Urdu-speaking populations, you must prioritize evaluation with native, culturally-grounded benchmarks like UrduMMLU. Your models will likely exhibit significantly lower performance on Urdu-centered humanities subjects compared to STEM, even with strong English performance. Focus your fine-tuning efforts on rich Urdu literary and cultural datasets to bridge this critical knowledge gap and ensure more equitable model capabilities.

Key insights

Native, culturally-grounded benchmarks reveal significant LLM knowledge gaps in non-English, humanities-focused content.

Principles

Method

UrduMMLU was constructed by extracting 26,431 MCQs from native Urdu educational PDFs and MCQ websites, followed by dual human annotation with strict consensus filtering for exam-derived questions.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.