Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, long

Summary

Amazon AGI and IIT Kharagpur researchers introduced ModeratorLM, a novel role-playing voice agent designed for real-time multi-party conversations. This system conditions turn-taking behavior on an explicitly assigned role, built upon a speech large language model operating in a chunk-wise streaming manner. A reasoning-augmented variant, ModeratorLM-Think, further incorporates chain-of-thought reasoning over conversational context and the assigned role. To facilitate training, the team constructed RolePlayConv, a large-scale synthetic dataset comprising approximately 75,000 multi-party conversations with diverse assistant roles. Experiments on real-world meeting data (NOTSOFAR-1) and RolePlayConv demonstrated significant improvements, with turn-taking precision increasing by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines like Moshi. ModeratorLM-Think achieved the highest role fidelity scores in LLM-as-a-Judge evaluations.

Key takeaway

For Machine Learning Engineers developing multi-party voice agents, you should integrate explicit role conditioning and dynamic chunking into your speech LLM architectures. This approach, exemplified by ModeratorLM-Think, significantly improves turn-taking precision by over 40% and recall by more than 70%, while reducing false-positive interruptions. Incorporating chain-of-thought reasoning further enhances role fidelity, enabling your agents to engage more naturally and contextually in complex conversational settings.

Key insights

Role-conditioned speech LLMs significantly improve turn-taking and response generation in multi-party voice agents by adapting to explicit roles.

Principles

Method

ModeratorLM employs a speech LLM processing chunk-wise streaming audio to make turn-taking decisions and generate responses based on an assigned role, with a variant adding chain-of-thought reasoning.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.