TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving
Summary
TurnGuide is a novel planning-inspired approach designed to enhance the conversational abilities of end-to-end Full-Duplex Speech Language Models (e2e FD-SLMs). It addresses challenges in text-guided speech generation, specifically insertion timing and length issues, which often degrade dialogue quality compared to pure-text models. TurnGuide dynamically segments assistant speech into dialogue turns and generates turn-level text guidance before speech output. This method, using GLM-4-Voice as a backbone and trained on the Fisher dataset for 2 epochs on 8 A800 GPUs with a global batch size of 256, a learning rate of 4e-6, and 20 warmup steps, significantly improves semantic meaningfulness and coherence. Experiments show over 30% performance gain in semantic evaluation by GPT-4o, while preserving natural conversational flow, as validated on the Candor dataset.
Key takeaway
For AI Scientists and ML Engineers developing full-duplex speech language models, consider integrating TurnGuide's planning-inspired text guidance. Your models can achieve significantly higher semantic meaningfulness and coherence, demonstrated by over 30% performance gains, without sacrificing natural conversational flow. This approach resolves critical text insertion timing and length issues, leading to more human-like and effective spoken interactions in real-time dialogue systems.
Key insights
TurnGuide enhances full-duplex SLMs by planning turn-level text guidance before speech, improving semantic coherence and natural flow.
Principles
- Mimic human conversational planning for dialogue.
- Dynamically segment speech into coherent dialogue turns.
- Generate turn-level text guidance prior to speech output.
Method
TurnGuide uses VAD and ASR with word-level timestamps to segment speech into Inter-Pausal Units (IPUs), then splits at punctuation for turns. It applies channel-wise and text-speech chunk interleaving (5:5 ratio) with GLM-4-Voice.
In practice
- Utilize Whisper medium model for ASR and timestamping.
- Train on Fisher dataset for 2 epochs on A800 GPUs.
- Employ GPT-4o for semantic evaluation (0-5 scale).
Topics
- Full-Duplex Speech Language Models
- Text-Guided Speech Generation
- Dialogue Systems
- Speech-Text Alignment
- Conversational AI
- GLM-4-Voice
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.