HalluScore: Large Language Model Hallucination Question Answering Benchmark

2025-07-02 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, extended

Summary

HalluScore is a new Arabic question answering benchmark designed to evaluate and mitigate hallucination in large language models (LLMs). This benchmark, comprising 827 carefully curated questions, addresses the significant gap in Arabic-specific hallucination evaluation, which has been underrepresented due to scarce annotated resources and the language's morphological complexity. HalluScore assesses LLMs across various reasoning difficulties, knowledge domains, historical timelines, and culturally grounded Arabic scenarios. The dataset includes verified ground-truth evidence, answer explanations, and multi-label annotations. A comprehensive empirical analysis using HalluScore evaluated 17 Arabic, multilingual, and reasoning LLMs, revealing that hallucination in Arabic LLMs extends beyond factual inaccuracies to include challenges in cultural understanding, linguistic reasoning, and logical consistency. GPT-5 and Claude models generally exhibited lower hallucination rates, while other models showed higher vulnerability to adversarial phrasing and culturally specific knowledge.

Key takeaway

Research Scientists developing or deploying Arabic LLMs should integrate HalluScore into their evaluation pipelines to thoroughly assess hallucination risks. This benchmark highlights that cultural understanding, linguistic reasoning, and logical validation are as critical as factual accuracy. You should specifically test for "reality violation" and "anthropomorphism hallucination" to ensure models do not fabricate impossible scenarios or human-like traits, which can undermine trustworthiness in sensitive applications like healthcare or law.

Key insights

HalluScore is a new Arabic QA benchmark for evaluating LLM hallucination, emphasizing cultural and linguistic nuances.

Principles

Hallucination extends beyond factual errors to cultural and linguistic reasoning.
Adversarial phrasing and false premises consistently trigger hallucinations.
Culturally grounded knowledge is a significant challenge for LLMs.

Method

HalluScore was constructed via crowdsourcing, quality assurance, hallucination-driven selection, and manual refinement, ensuring diverse, hallucination-relevant QA pairs with multi-label annotations and ground-truth evidence.

In practice

Test LLMs with adversarial and culturally specific questions.
Prioritize models with lower hallucination rates in reasoning tasks.
Consider prompt sensitivity when evaluating LLM responses.

Topics

HalluScore
LLM Hallucination
Arabic Language Models
Question Answering Benchmarks
Cultural Competence

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.