Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

A new research paper introduces a comprehensive method and dataset for evaluating Large Language Models (LLMs) on culturally grounded and dialectal Arabic content. The method involves translating Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, converting them into open-ended questions (OEQs), and benchmarking various zero-shot and fine-tuned LLMs. It also generates Chain-of-Thought (CoT) rationales to fine-tune models for step-by-step reasoning. This approach extends an existing dataset, creating the first parallelly aligned QA resource across multiple language varieties for Arabic cultural OEQs. Experiments with models like Falcon3-10B-Instruct, Qwen2.5-7B, GPT-4.1, and GPT-5 reveal that LLMs underperform on Arabic dialects, struggle with OEQs despite performing well on MCQs, and that CoT improves judged correctness but yields mixed n-gram-based metrics.

Key takeaway

For AI Engineers and Research Scientists developing multilingual LLMs, recognize that current models, even fine-tuned ones, significantly underperform on Arabic dialects and open-ended cultural questions compared to Modern Standard Arabic. Prioritize developing and evaluating models with robust dialectal and open-ended reasoning capabilities, as traditional MCQ benchmarks may not accurately reflect real-world performance. Consider incorporating CoT fine-tuning for better semantic correctness in generative tasks, while being aware it may not boost n-gram overlap metrics.

Key insights

LLMs exhibit significant performance gaps in culturally-grounded, dialectal Arabic, especially with open-ended questions.

Principles

MCQ evaluations can mask LLM reasoning deficiencies.
CoT improves semantic acceptability but not always lexical overlap.
Dialectal Arabic poses greater challenges than MSA for LLMs.

Method

The method translates MSA MCQs to English and Arabic dialects, converts them to OEQs, benchmarks LLMs, and generates CoT rationales for fine-tuning, creating a parallel, culturally-aligned QA dataset.

In practice

Use OEQs for rigorous LLM evaluation.
Apply CoT fine-tuning for improved semantic correctness.
Benchmark LLMs on dialectal content to reveal knowledge gaps.

Topics

Arabic Cultural QA
Dialectal Arabic
Open-Ended Questions
Multiple-Choice Questions
Large Language Models

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.