BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

BenHalluEval is introduced as the first fine-grained hallucination evaluation framework for large language models (LLMs) in Bengali, the sixth most spoken language globally. This framework covers four distinct tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning. Researchers constructed 12,000 hallucinated candidates using GPT-5.4, encompassing twelve task-specific hallucination types derived from three existing Bengali datasets. Seven LLMs, categorized as reasoning-oriented, multilingual, and Bengali-centric, were evaluated using a dual-track protocol. Track A measured the false-positive rate on ground-truth instances, while Track B assessed the hallucination detection rate on hallucinated candidates. The study proposes BenHalluScore, a dual-track calibration metric ranging from 7.72% to 55.42%, revealing significant variation in hallucination calibration across models and tasks. Chain-of-thought prompting, tested as a mitigation strategy, shifted response distributions but did not consistently improve hallucination discrimination.

Key takeaway

For NLP Engineers developing or deploying large language models for low-resource languages like Bengali, your current hallucination evaluation methods may be inadequate. You should adopt multi-task, dual-track frameworks, such as BenHalluEval, to accurately assess model reliability and prevent inflated scores. Be aware that Chain-of-thought prompting alone may not consistently improve hallucination discrimination, necessitating more robust mitigation strategies beyond simple prompting.

Key insights

Hallucination evaluation for Bengali LLMs requires a multi-task, dual-track framework due to significant model variation.

Principles

Method

BenHalluEval uses 12,000 GPT-5.4 generated candidates across 12 types for GQA, Code-Mixed QA, Summarization, and Reasoning, evaluated via dual-track protocol.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.