Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants
Summary
A new research paper introduces a comprehensive method and dataset for evaluating Large Language Models (LLMs) on culturally grounded and dialectal Arabic content. The method involves translating Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, converting them into open-ended questions (OEQs), and benchmarking various zero-shot and fine-tuned LLMs. It also generates Chain-of-Thought (CoT) rationales to fine-tune models for step-by-step reasoning. This approach extends an existing dataset, creating the first parallelly aligned QA resource across multiple language varieties for Arabic cultural OEQs. Experiments with models like Falcon3-10B-Instruct, Qwen2.5-7B, GPT-4.1, and GPT-5 reveal that LLMs underperform on Arabic dialects, struggle with OEQs despite performing well on MCQs, and that CoT improves judged correctness but yields mixed n-gram-based metrics.
Key takeaway
For AI Engineers and Research Scientists developing multilingual LLMs, recognize that current models, even fine-tuned ones, significantly underperform on Arabic dialects and open-ended cultural questions compared to Modern Standard Arabic. Prioritize developing and evaluating models with robust dialectal and open-ended reasoning capabilities, as traditional MCQ benchmarks may not accurately reflect real-world performance. Consider incorporating CoT fine-tuning for better semantic correctness in generative tasks, while being aware it may not boost n-gram overlap metrics.
Key insights
LLMs exhibit significant performance gaps in culturally-grounded, dialectal Arabic, especially with open-ended questions.
Principles
- MCQ evaluations can mask LLM reasoning deficiencies.
- CoT improves semantic acceptability but not always lexical overlap.
- Dialectal Arabic poses greater challenges than MSA for LLMs.
Method
The method translates MSA MCQs to English and Arabic dialects, converts them to OEQs, benchmarks LLMs, and generates CoT rationales for fine-tuning, creating a parallel, culturally-aligned QA dataset.
In practice
- Use OEQs for rigorous LLM evaluation.
- Apply CoT fine-tuning for improved semantic correctness.
- Benchmark LLMs on dialectal content to reveal knowledge gaps.
Topics
- Arabic Cultural QA
- Dialectal Arabic
- Open-Ended Questions
- Multiple-Choice Questions
- Large Language Models
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.