A multilingual hallucination benchmark: MultiWikiQHalluA
Summary
A new multilingual hallucination benchmark, MultiWikiQHalluA, has been developed to assess the faithfulness of large language models across 306 languages, addressing a gap in English-centric evaluations. The benchmark utilizes the MultiWikiQA dataset and the LettuceDetect framework to generate synthetic hallucination data. Token-level hallucination classifiers were trained for 30 European languages, and evaluations were conducted on English, Danish, German, and Icelandic. The study assessed Qwen3-0.6B, Qwen3-14B, Gemma-3-12B-IT, cogito-v1-preview-qwen-32B, and cogito-v1-preview-llama-70B. Results indicate significantly higher hallucination rates for smaller models like Qwen3-0.6B, which showed up to 60% of answers containing at least one hallucination, particularly in Icelandic. Larger models, specifically cogito-v1-preview-qwen-32B and cogito-v1-preview-llama-70B, generally exhibited lower hallucination rates across most languages, with lower-resource languages consistently showing higher rates.
Key takeaway
For AI Engineers deploying multilingual LLMs, understanding hallucination tendencies across diverse languages is critical. Your model selection should account for the observed higher hallucination rates in smaller models and lower-resource languages like Icelandic. Prioritize larger, more robust models such as cogito-v1-preview-qwen-32B or cogito-v1-preview-llama-70B for production to minimize faithfulness issues, especially when supporting a broad linguistic user base.
Key insights
Multilingual hallucination benchmarks reveal higher rates in smaller models and lower-resource languages.
Principles
- Hallucination rates increase in lower-resource languages.
- Larger models generally exhibit lower hallucination rates.
Method
The MultiWikiQHalluA benchmark uses MultiWikiQA and LettuceDetect to create synthetic hallucination datasets for 306 languages, then trains token-level classifiers for evaluation.
In practice
- Prioritize larger models for multilingual applications.
- Focus evaluation on lower-resource languages.
Topics
- MultiWikiQHalluA
- Multilingual Hallucination
- LettuceDetect Framework
- Lower-Resource Languages
- Qwen3
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.