SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs
Summary
SKG-Eval is a novel, quasi-deterministic, and interpretable evaluation framework for multi-turn dialogue systems that addresses the limitations of existing methods, which often fail to detect cross-turn inconsistencies like contradiction, topic drift, and entity inconsistency. It models dialogue as an evolving Semantic Knowledge Graph (SKG) by incrementally updating entities, relations, and commitments at each turn via structured triple extraction. The framework computes three signals: local relevance, historical consistency, and logical coherence, which are then fused using a regime-adaptive mechanism and aggregated into a length-invariant session score via recency-weighted trend analysis. SKG-Eval achieves higher correlation with human judgments and significantly improves the recall of long-range inconsistencies on benchmarks like MT-Bench and MultiChallenge, particularly in extended conversations where other evaluators degrade. It also provides explicit contradiction certificates and deterministic scores, enabling reproducible and auditable evaluation.
Key takeaway
For research scientists developing or evaluating multi-turn dialogue systems, you should consider adopting SKG-Eval to overcome the limitations of turn-isolated or LLM-as-a-judge evaluation. This framework offers superior detection of long-range inconsistencies like contradictions and semantic drift, providing auditable, deterministic results. Integrating SKG-Eval can lead to more robust model development by surfacing critical failure modes that implicit, black-box evaluators often miss, especially in extended conversations.
Key insights
SKG-Eval uses evolving knowledge graphs and geometric reasoning for stateful, interpretable multi-turn dialogue evaluation.
Principles
- Dialogue quality is intrinsically stateful and temporal.
- Explicit state tracking improves long-horizon consistency detection.
- Geometric reasoning enhances contradiction recall over NLI models.
Method
SKG-Eval incrementally builds a Semantic Knowledge Graph, extracts local relevance, historical consistency, and logical coherence signals, then fuses and aggregates them into a session score with recency weighting.
In practice
- Use structured triple extraction for dialogue state.
- Employ geometric contradiction detection for numeric/antonymic conflicts.
- Apply recency-weighted aggregation for session-level scores.
Topics
- Semantic Knowledge Graphs
- Multi-Turn Dialogue Evaluation
- Geometric Contradiction Engine
- Cross-Turn Consistency
- Deterministic Evaluation
Best for: Research Scientist, AI Scientist, NLP Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.