TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Conversational AI · Depth: Expert, extended

Summary

TurnGuide is a novel planning-inspired approach designed to enhance the conversational abilities of end-to-end Full-Duplex Speech Language Models (e2e FD-SLMs). It addresses challenges in text-guided speech generation, specifically insertion timing and length issues, which often degrade dialogue quality compared to pure-text models. TurnGuide dynamically segments assistant speech into dialogue turns and generates turn-level text guidance before speech output. This method, using GLM-4-Voice as a backbone and trained on the Fisher dataset for 2 epochs on 8 A800 GPUs with a global batch size of 256, a learning rate of 4e-6, and 20 warmup steps, significantly improves semantic meaningfulness and coherence. Experiments show over 30% performance gain in semantic evaluation by GPT-4o, while preserving natural conversational flow, as validated on the Candor dataset.

Key takeaway

For AI Scientists and ML Engineers developing full-duplex speech language models, consider integrating TurnGuide's planning-inspired text guidance. Your models can achieve significantly higher semantic meaningfulness and coherence, demonstrated by over 30% performance gains, without sacrificing natural conversational flow. This approach resolves critical text insertion timing and length issues, leading to more human-like and effective spoken interactions in real-time dialogue systems.

Key insights

TurnGuide enhances full-duplex SLMs by planning turn-level text guidance before speech, improving semantic coherence and natural flow.

Principles

Mimic human conversational planning for dialogue.
Dynamically segment speech into coherent dialogue turns.
Generate turn-level text guidance prior to speech output.

Method

TurnGuide uses VAD and ASR with word-level timestamps to segment speech into Inter-Pausal Units (IPUs), then splits at punctuation for turns. It applies channel-wise and text-speech chunk interleaving (5:5 ratio) with GLM-4-Voice.

In practice

Utilize Whisper medium model for ASR and timestamping.
Train on Fisher dataset for 2 epochs on A800 GPUs.
Employ GPT-4o for semantic evaluation (0-5 scale).

Topics

Full-Duplex Speech Language Models
Text-Guided Speech Generation
Dialogue Systems
Speech-Text Alignment
Conversational AI
GLM-4-Voice

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.