Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

A new research paper introduces a comprehensive method and dataset for evaluating Large Language Models (LLMs) on culturally grounded and dialectal Arabic content. The method involves translating Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, converting them into open-ended questions (OEQs), and benchmarking various zero-shot and fine-tuned LLMs. It also generates Chain-of-Thought (CoT) rationales to fine-tune models for step-by-step reasoning. This approach extends an existing dataset, creating the first parallelly aligned QA resource across multiple language varieties for Arabic cultural OEQs. Experiments with models like Falcon3-10B-Instruct, Qwen2.5-7B, GPT-4.1, and GPT-5 reveal that LLMs underperform on Arabic dialects, struggle with OEQs despite performing well on MCQs, and that CoT improves judged correctness but yields mixed n-gram-based metrics.

Key takeaway

For AI Engineers and Research Scientists developing multilingual LLMs, recognize that current models, even fine-tuned ones, significantly underperform on Arabic dialects and open-ended cultural questions compared to Modern Standard Arabic. Prioritize developing and evaluating models with robust dialectal and open-ended reasoning capabilities, as traditional MCQ benchmarks may not accurately reflect real-world performance. Consider incorporating CoT fine-tuning for better semantic correctness in generative tasks, while being aware it may not boost n-gram overlap metrics.

Key insights

LLMs exhibit significant performance gaps in culturally-grounded, dialectal Arabic, especially with open-ended questions.

Principles

Method

The method translates MSA MCQs to English and Arabic dialects, converts them to OEQs, benchmarks LLMs, and generates CoT rationales for fine-tuning, creating a parallel, culturally-aligned QA dataset.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.