MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Technology · Depth: Expert, long

Summary

MTR-DuplexBench is a new benchmark designed to comprehensively evaluate Full-Duplex Speech Language Models (FD-SLMs) in multi-round conversational settings. Unlike existing benchmarks that primarily focus on single-round interactions and basic conversational features, MTR-DuplexBench addresses the complexities of continuous full-duplex dialogues, including blurred turn boundaries and context inconsistency. The benchmark introduces a novel methodology to segment continuous dialogues into discrete turns, enabling turn-by-turn evaluation across four critical dimensions: dialogue quality, conversational dynamics (smooth turn-taking, interruption, pause handling, background speech, backchanneling), instruction following, and safety. Initial experimental results using the Moshi FD-SLM indicate that current models struggle to maintain consistent performance across multiple rounds and various evaluation dimensions, underscoring the necessity and effectiveness of this new comprehensive evaluation framework.

Key takeaway

For research scientists developing or deploying Full-Duplex Speech Language Models, you should prioritize multi-round evaluation using benchmarks like MTR-DuplexBench. This will reveal critical performance inconsistencies in dialogue quality, instruction following, and safety that single-round evaluations miss, helping you build more robust and reliable conversational AI systems.

Key insights

MTR-DuplexBench offers a comprehensive, multi-round evaluation for Full-Duplex Speech Language Models.

Principles

Method

MTR-DuplexBench segments continuous full-duplex dialogues into discrete turns using GPT-4o and VAD, then evaluates FD-SLMs turn-by-turn across dialogue quality, conversational features, instruction following, and safety.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.