Zero-source LLM Hallucination Detection with Human-like Criteria Probing

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

Human-like Criteria Probing for Hallucination Detection (HCPD) introduces an interpretable, zero-source method for identifying factually incorrect or unfaithful content generated by Large Language Models. Operating solely on query-answer pairs, HCPD employs an LLM agent that adaptively decomposes truthfulness judgments into a weighted set of interpretable criteria, such as factual accuracy and logical consistency, then aggregates criterion-specific scores. This adaptive capability is achieved through a reward-based alignment scheme utilizing weak supervision from semantic consistency metrics like BLEURT. At inference, a multi-sampling aggregation strategy ensures robust decisions. HCPD consistently outperforms state-of-the-art baselines, achieving an average AUROC of 88.19% on LLaMA-3.1-8b and 88.02% on Qwen-3-8b across datasets like TriviaQA, SciQ, NQ Open, and CoQA, demonstrating its effectiveness and explainability.

Key takeaway

For Machine Learning Engineers deploying LLMs in safety-critical applications or auditing black-box models, HCPD provides a robust and interpretable hallucination detection solution. Its zero-source, multi-criteria approach, validated with high AUROC scores, means you can assess model truthfulness without internal access or external knowledge. Consider integrating HCPD into your CI/CD pipeline for pre-deployment auditing or continuous monitoring to enhance trust and debug model outputs effectively.

Key insights

Zero-source LLM hallucination detection is enhanced by emulating human multi-criteria reasoning with adaptive, weakly-supervised agents.

Principles

Method

An LLM agent adaptively generates context-aware criteria and weights, scores responses against them, and aggregates for a final truthfulness measure, trained via GRPO with weak semantic consistency supervision.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.