Zero-source LLM Hallucination Detection with Human-like Criteria Probing
Summary
Human-like Criteria Probing for Hallucination Detection (HCPD) introduces an interpretable, zero-source method for identifying factually incorrect or unfaithful content generated by Large Language Models. Operating solely on query-answer pairs, HCPD employs an LLM agent that adaptively decomposes truthfulness judgments into a weighted set of interpretable criteria, such as factual accuracy and logical consistency, then aggregates criterion-specific scores. This adaptive capability is achieved through a reward-based alignment scheme utilizing weak supervision from semantic consistency metrics like BLEURT. At inference, a multi-sampling aggregation strategy ensures robust decisions. HCPD consistently outperforms state-of-the-art baselines, achieving an average AUROC of 88.19% on LLaMA-3.1-8b and 88.02% on Qwen-3-8b across datasets like TriviaQA, SciQ, NQ Open, and CoQA, demonstrating its effectiveness and explainability.
Key takeaway
For Machine Learning Engineers deploying LLMs in safety-critical applications or auditing black-box models, HCPD provides a robust and interpretable hallucination detection solution. Its zero-source, multi-criteria approach, validated with high AUROC scores, means you can assess model truthfulness without internal access or external knowledge. Consider integrating HCPD into your CI/CD pipeline for pre-deployment auditing or continuous monitoring to enhance trust and debug model outputs effectively.
Key insights
Zero-source LLM hallucination detection is enhanced by emulating human multi-criteria reasoning with adaptive, weakly-supervised agents.
Principles
- Decompose LLM evaluation into weighted, interpretable criteria.
- Align LLM evaluators using weak semantic consistency supervision.
- Multi-sampling aggregation stabilizes stochastic LLM judgments.
Method
An LLM agent adaptively generates context-aware criteria and weights, scores responses against them, and aggregates for a final truthfulness measure, trained via GRPO with weak semantic consistency supervision.
In practice
- Instantiate a Qwen-2.5-7b agent for zero-source detection.
- Apply multi-sampling (e.g., K=5) for robust hallucination scores.
- Utilize BLEURT or DeepSeek-V3 as weak supervision signals.
Topics
- LLM Hallucination Detection
- Zero-source Evaluation
- Human-like Criteria Probing
- Reward-based Alignment
- Group Relative Policy Optimization
- Explainable AI
Code references
- TRISKEL10N/HCPD
- huggingface/open-r1
- D2I-ai/eigenscore
- jlko/semantic_uncertainty
- collin-burns/discovering_latent_knowledge
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.