Adaptive Turn-Taking for Real-time Multi-Party Voice Agents
Summary
Amazon AGI and IIT Kharagpur researchers introduced ModeratorLM, a novel role-playing voice agent designed for real-time multi-party conversations. This system conditions turn-taking behavior on an explicitly assigned role, built upon a speech large language model operating in a chunk-wise streaming manner. A reasoning-augmented variant, ModeratorLM-Think, further incorporates chain-of-thought reasoning over conversational context and the assigned role. To facilitate training, the team constructed RolePlayConv, a large-scale synthetic dataset comprising approximately 75,000 multi-party conversations with diverse assistant roles. Experiments on real-world meeting data (NOTSOFAR-1) and RolePlayConv demonstrated significant improvements, with turn-taking precision increasing by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines like Moshi. ModeratorLM-Think achieved the highest role fidelity scores in LLM-as-a-Judge evaluations.
Key takeaway
For Machine Learning Engineers developing multi-party voice agents, you should integrate explicit role conditioning and dynamic chunking into your speech LLM architectures. This approach, exemplified by ModeratorLM-Think, significantly improves turn-taking precision by over 40% and recall by more than 70%, while reducing false-positive interruptions. Incorporating chain-of-thought reasoning further enhances role fidelity, enabling your agents to engage more naturally and contextually in complex conversational settings.
Key insights
Role-conditioned speech LLMs significantly improve turn-taking and response generation in multi-party voice agents by adapting to explicit roles.
Principles
- Role conditioning enhances multi-party turn-taking precision.
- Chain-of-thought reasoning improves role fidelity and recall.
- Dynamic chunking prevents overfitting to chunk length.
Method
ModeratorLM employs a speech LLM processing chunk-wise streaming audio to make turn-taking decisions and generate responses based on an assigned role, with a variant adding chain-of-thought reasoning.
In practice
- Synthesize large-scale datasets for role-playing agents.
- Use dynamic chunking for robust turn-taking models.
- Integrate ASR hypotheses for real-time textual context.
Topics
- Multi-party Conversation
- Speech LLMs
- Turn-taking
- Role-playing AI
- Chain-of-Thought Reasoning
- ModeratorLM
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.