SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs

2026-05-19 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Natural Language Processing · Depth: Expert, extended

Summary

SKG-Eval is a novel, quasi-deterministic, and interpretable evaluation framework for multi-turn dialogue systems that addresses the limitations of existing methods, which often fail to detect cross-turn inconsistencies like contradiction, topic drift, and entity inconsistency. It models dialogue as an evolving Semantic Knowledge Graph (SKG) by incrementally updating entities, relations, and commitments at each turn via structured triple extraction. The framework computes three signals: local relevance, historical consistency, and logical coherence, which are then fused using a regime-adaptive mechanism and aggregated into a length-invariant session score via recency-weighted trend analysis. SKG-Eval achieves higher correlation with human judgments and significantly improves the recall of long-range inconsistencies on benchmarks like MT-Bench and MultiChallenge, particularly in extended conversations where other evaluators degrade. It also provides explicit contradiction certificates and deterministic scores, enabling reproducible and auditable evaluation.

Key takeaway

For research scientists developing or evaluating multi-turn dialogue systems, you should consider adopting SKG-Eval to overcome the limitations of turn-isolated or LLM-as-a-judge evaluation. This framework offers superior detection of long-range inconsistencies like contradictions and semantic drift, providing auditable, deterministic results. Integrating SKG-Eval can lead to more robust model development by surfacing critical failure modes that implicit, black-box evaluators often miss, especially in extended conversations.

Key insights

SKG-Eval uses evolving knowledge graphs and geometric reasoning for stateful, interpretable multi-turn dialogue evaluation.

Principles

Dialogue quality is intrinsically stateful and temporal.
Explicit state tracking improves long-horizon consistency detection.
Geometric reasoning enhances contradiction recall over NLI models.

Method

SKG-Eval incrementally builds a Semantic Knowledge Graph, extracts local relevance, historical consistency, and logical coherence signals, then fuses and aggregates them into a session score with recency weighting.

In practice

Use structured triple extraction for dialogue state.
Employ geometric contradiction detection for numeric/antonymic conflicts.
Apply recency-weighted aggregation for session-level scores.

Topics

Semantic Knowledge Graphs
Multi-Turn Dialogue Evaluation
Geometric Contradiction Engine
Cross-Turn Consistency
Deterministic Evaluation

Best for: Research Scientist, AI Scientist, NLP Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.