MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

2026-04-21 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Technology · Depth: Expert, long

Summary

MTR-DuplexBench is a new benchmark designed to comprehensively evaluate Full-Duplex Speech Language Models (FD-SLMs) in multi-round conversational settings. Unlike existing benchmarks that primarily focus on single-round interactions and basic conversational features, MTR-DuplexBench addresses the complexities of continuous full-duplex dialogues, including blurred turn boundaries and context inconsistency. The benchmark introduces a novel methodology to segment continuous dialogues into discrete turns, enabling turn-by-turn evaluation across four critical dimensions: dialogue quality, conversational dynamics (smooth turn-taking, interruption, pause handling, background speech, backchanneling), instruction following, and safety. Initial experimental results using the Moshi FD-SLM indicate that current models struggle to maintain consistent performance across multiple rounds and various evaluation dimensions, underscoring the necessity and effectiveness of this new comprehensive evaluation framework.

Key takeaway

For research scientists developing or deploying Full-Duplex Speech Language Models, you should prioritize multi-round evaluation using benchmarks like MTR-DuplexBench. This will reveal critical performance inconsistencies in dialogue quality, instruction following, and safety that single-round evaluations miss, helping you build more robust and reliable conversational AI systems.

Key insights

MTR-DuplexBench offers a comprehensive, multi-round evaluation for Full-Duplex Speech Language Models.

Principles

Multi-round evaluation is crucial for FD-SLM reliability.
Blurred turn boundaries and context inconsistency are key challenges.
FD-SLMs need evaluation beyond basic conversational features.

Method

MTR-DuplexBench segments continuous full-duplex dialogues into discrete turns using GPT-4o and VAD, then evaluates FD-SLMs turn-by-turn across dialogue quality, conversational features, instruction following, and safety.

In practice

Use GPT-4o for robust turn segmentation in full-duplex audio.
Evaluate FD-SLMs on instruction following and safety in multi-round contexts.
Consider multi-round performance consistency as a key metric.

Topics

Full-Duplex Speech Language Models
Multi-Round Conversation Evaluation
Turn Segmentation Methodology
Dialogue Quality Assessment
Conversational Features

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.