TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Conversational AI · Depth: Expert, extended

Summary

TurnGuide is a novel planning-inspired approach designed to enhance the conversational abilities of end-to-end Full-Duplex Speech Language Models (e2e FD-SLMs). It addresses challenges in text-guided speech generation, specifically insertion timing and length issues, which often degrade dialogue quality compared to pure-text models. TurnGuide dynamically segments assistant speech into dialogue turns and generates turn-level text guidance before speech output. This method, using GLM-4-Voice as a backbone and trained on the Fisher dataset for 2 epochs on 8 A800 GPUs with a global batch size of 256, a learning rate of 4e-6, and 20 warmup steps, significantly improves semantic meaningfulness and coherence. Experiments show over 30% performance gain in semantic evaluation by GPT-4o, while preserving natural conversational flow, as validated on the Candor dataset.

Key takeaway

For AI Scientists and ML Engineers developing full-duplex speech language models, consider integrating TurnGuide's planning-inspired text guidance. Your models can achieve significantly higher semantic meaningfulness and coherence, demonstrated by over 30% performance gains, without sacrificing natural conversational flow. This approach resolves critical text insertion timing and length issues, leading to more human-like and effective spoken interactions in real-time dialogue systems.

Key insights

TurnGuide enhances full-duplex SLMs by planning turn-level text guidance before speech, improving semantic coherence and natural flow.

Principles

Method

TurnGuide uses VAD and ASR with word-level timestamps to segment speech into Inter-Pausal Units (IPUs), then splits at punctuation for turns. It applies channel-wise and text-speech chunk interleaving (5:5 ratio) with GLM-4-Voice.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.