MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

2024-10-29 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

MORTAR, a Metamorphic multi-TuRn diAlogue testing appRoach, addresses the persistent Oracle problem in evaluating LLM-based dialogue systems, particularly for multi-turn interactions which are largely underexplored. Unlike existing single-turn methods, MORTAR automates the generation of follow-up question-answer test cases using five novel dialogue-level perturbations and metamorphic relations. It employs a knowledge graph-based dialogue information model for low-cost test dataset generation and bug detection, crucially avoiding LLM judges to eliminate evaluation biases. Experiments on multiple LLM-based dialogue systems, including Meta-Llama-3-8B-Instruct and Gemma-2-9b-it, demonstrate MORTAR's superior bug detection capabilities. It explores more unique bugs, detecting up to four times more severe bugs than the most effective existing metamorphic testing approaches.

Key takeaway

For AI Scientists and Machine Learning Engineers developing or deploying LLM-based dialogue systems, you should integrate multi-turn metamorphic testing to uncover critical defects. MORTAR's dialogue-level perturbations and knowledge graph-based answerability checks offer a robust, bias-free method to generate challenging test cases, revealing up to four times more severe bugs than single-turn approaches. Prioritize testing with MR3 for larger models, as it effectively exposes issues related to knowledge retrieval versus contextual reasoning.

Key insights

MORTAR offers a bias-free, cost-effective metamorphic testing approach for multi-turn LLM dialogue systems.

Principles

Multi-turn dialogue testing requires strong context reliance verification.
Metamorphic testing mitigates the test oracle problem.
LLM judges introduce bias and non-determinism in evaluation.

Method

MORTAR generates perturbed QA dialogue test cases, performs semantic and ontology-based answerability checks, and detects bugs by identifying metamorphic relation conflicts without an LLM judge.

In practice

Apply dialogue-level perturbations (shuffle, reduce, duplicate rounds).
Use knowledge graphs for context-aware answerability checks.
Detect bugs by measuring semantic similarity (MSS) between expected and generated answers.

Topics

Metamorphic Testing
LLM Dialogue Systems
Multi-turn QA
Knowledge Graphs
Software Quality Assurance
Bug Detection

Code references

meta-llama/llama3

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.