Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

The Evidence Graph Consistency (EGC) framework proposes a novel approach to detect hallucination in Retrieval-Augmented Generation (RAG) by constructing local evidence graphs for each response and computing five structural consistency measures. Evaluated on 5,767 responses from the RAGTruth question answering split across six LLMs, EGC revealed a consistent model-family split. For Llama-2 models, graph consistency features showed the expected diagnostic direction, with grounded answers receiving higher EGC scores. However, for GPT-4, GPT-3.5, and Mistral-7B, a systematic reversal occurred, where hallucinated answers scored higher. This indicates qualitatively different hallucination patterns across model families and suggests that embedding-based graph consistency alone cannot serve as a model-independent hallucination detection signal, particularly for models like GPT-4 which produce fluent, evidence-proximate hallucinations.

Key takeaway

For AI Scientists and Machine Learning Engineers designing or evaluating RAG hallucination detection systems, you should adopt a model-aware strategy. For Llama-2 models, structural graph probes like EGC are effective for identifying lexically divergent hallucinations. However, for GPT-class models, where hallucinations are often fluent and semantically proximate to evidence, prioritize semantic verification or natural language inference (NLI) methods. Avoid relying on a single, universal detection signal across diverse LLM families to ensure robust hallucination mitigation.

Key insights

Graph-based RAG hallucination detection is model-dependent, reversing for GPT-class models due to differing hallucination styles.

Principles

LLM hallucination patterns vary qualitatively across model families.
Embedding similarity alone is insufficient for universal hallucination detection.
Aggregate benchmark scores can obscure model-specific diagnostic signals.

Method

EGC constructs a local graph from question, retrieved passages, and answer claims using all-MiniLM-L6-v2 embeddings and a $\tau=0.4$ cosine similarity threshold, then computes five structural consistency features.

In practice

Apply EGC for Llama-2 models to detect structural disconnects.
For GPT-class models, combine EGC with semantic verification.
Evaluate RAG detectors per-model to avoid masking heterogeneous signals.

Topics

Retrieval-Augmented Generation
Hallucination Detection
Evidence Graphs
Large Language Models
Model-Dependent Analysis
Llama-2
GPT-4

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.