MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

MORTAR, a Metamorphic multi-TuRn diAlogue testing appRoach, addresses the persistent Oracle problem in evaluating LLM-based dialogue systems, particularly for multi-turn interactions which are largely underexplored. Unlike existing single-turn methods, MORTAR automates the generation of follow-up question-answer test cases using five novel dialogue-level perturbations and metamorphic relations. It employs a knowledge graph-based dialogue information model for low-cost test dataset generation and bug detection, crucially avoiding LLM judges to eliminate evaluation biases. Experiments on multiple LLM-based dialogue systems, including Meta-Llama-3-8B-Instruct and Gemma-2-9b-it, demonstrate MORTAR's superior bug detection capabilities. It explores more unique bugs, detecting up to four times more severe bugs than the most effective existing metamorphic testing approaches.

Key takeaway

For AI Scientists and Machine Learning Engineers developing or deploying LLM-based dialogue systems, you should integrate multi-turn metamorphic testing to uncover critical defects. MORTAR's dialogue-level perturbations and knowledge graph-based answerability checks offer a robust, bias-free method to generate challenging test cases, revealing up to four times more severe bugs than single-turn approaches. Prioritize testing with MR3 for larger models, as it effectively exposes issues related to knowledge retrieval versus contextual reasoning.

Key insights

MORTAR offers a bias-free, cost-effective metamorphic testing approach for multi-turn LLM dialogue systems.

Principles

Method

MORTAR generates perturbed QA dialogue test cases, performs semantic and ontology-based answerability checks, and detects bugs by identifying metamorphic relation conflicts without an LLM judge.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.