Syn-TurnTurk: A Synthetic Dataset for Turn-Taking Prediction in Turkish Dialogues
Summary
Syn-TurnTurk is a new synthetic Turkish dialogue dataset designed to improve turn-taking prediction in voice-based chatbots, addressing the critical lack of high-quality, labeled conversational data for Turkish. Generated using five Qwen Large Language Models (LLMs) including qwen3-max-2026-01-23 and qwen3.5-397b-a17b, the dataset comprises 1,625 dialogues across 79 unique topics, featuring 12,560 speaker changes and 5,305 instances of speech overlaps. The generation process incorporated human-centric speech characteristics like strategic silences and interjections, with a temperature of 0.7 for most outputs. Evaluation using traditional and deep learning models, including Decision Tree, Random Forest, Logistic Regression, and BI-LSTM, showed that BI-LSTM achieved the highest accuracy of 0.839, while an Ensemble (LR+RF) model reached a peak AUC of 0.910, demonstrating the dataset's effectiveness for training models to understand linguistic cues.
Key takeaway
For research scientists developing conversational AI for less-resourced languages, this work demonstrates that synthetic data generation with LLMs can effectively create high-quality datasets for complex tasks like turn-taking prediction. You should consider using models like BI-LSTM or Ensemble methods, which showed superior performance on the Syn-TurnTurk dataset, to build more natural and responsive voice-based systems, especially for languages with unique grammatical structures like Turkish.
Key insights
Synthetic datasets can effectively address data scarcity for turn-taking prediction in under-resourced languages like Turkish.
Principles
- Dialogue timing is crucial for natural chatbot interaction.
- Linguistic flow is key for managing speech transitions.
- Synthetic data can mirror real-life verbal exchanges.
Method
Generate diverse two-person dialogues using multiple LLMs with varied topics and controlled temperature, incorporating human speech characteristics like overlaps and silences, then label turn transitions for classification.
In practice
- Use Qwen LLMs for synthetic dialogue generation.
- Incorporate 79+ topics to ensure dataset diversity.
- Apply BI-LSTM for high accuracy in turn prediction.
Topics
- Turn-Taking Prediction
- Synthetic Dataset
- Turkish Dialogues
- Qwen LLMs
- BI-LSTM
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.