A multilingual hallucination benchmark: MultiWikiQHalluA

2026-05-04 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

A new multilingual hallucination benchmark, MultiWikiQHalluA, has been developed to assess the faithfulness of large language models across 306 languages, addressing a gap in English-centric evaluations. The benchmark utilizes the MultiWikiQA dataset and the LettuceDetect framework to generate synthetic hallucination data. Token-level hallucination classifiers were trained for 30 European languages, and evaluations were conducted on English, Danish, German, and Icelandic. The study assessed Qwen3-0.6B, Qwen3-14B, Gemma-3-12B-IT, cogito-v1-preview-qwen-32B, and cogito-v1-preview-llama-70B. Results indicate significantly higher hallucination rates for smaller models like Qwen3-0.6B, which showed up to 60% of answers containing at least one hallucination, particularly in Icelandic. Larger models, specifically cogito-v1-preview-qwen-32B and cogito-v1-preview-llama-70B, generally exhibited lower hallucination rates across most languages, with lower-resource languages consistently showing higher rates.

Key takeaway

For AI Engineers deploying multilingual LLMs, understanding hallucination tendencies across diverse languages is critical. Your model selection should account for the observed higher hallucination rates in smaller models and lower-resource languages like Icelandic. Prioritize larger, more robust models such as cogito-v1-preview-qwen-32B or cogito-v1-preview-llama-70B for production to minimize faithfulness issues, especially when supporting a broad linguistic user base.

Key insights

Multilingual hallucination benchmarks reveal higher rates in smaller models and lower-resource languages.

Principles

Hallucination rates increase in lower-resource languages.
Larger models generally exhibit lower hallucination rates.

Method

The MultiWikiQHalluA benchmark uses MultiWikiQA and LettuceDetect to create synthetic hallucination datasets for 306 languages, then trains token-level classifiers for evaluation.

In practice

Prioritize larger models for multilingual applications.
Focus evaluation on lower-resource languages.

Topics

MultiWikiQHalluA
Multilingual Hallucination
LettuceDetect Framework
Lower-Resource Languages
Qwen3

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.