Syn-TurnTurk: A Synthetic Dataset for Turn-Taking Prediction in Turkish Dialogues

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

Syn-TurnTurk is a new synthetic Turkish dialogue dataset designed to improve turn-taking prediction in voice-based chatbots, addressing the critical lack of high-quality, labeled conversational data for Turkish. Generated using five Qwen Large Language Models (LLMs) including qwen3-max-2026-01-23 and qwen3.5-397b-a17b, the dataset comprises 1,625 dialogues across 79 unique topics, featuring 12,560 speaker changes and 5,305 instances of speech overlaps. The generation process incorporated human-centric speech characteristics like strategic silences and interjections, with a temperature of 0.7 for most outputs. Evaluation using traditional and deep learning models, including Decision Tree, Random Forest, Logistic Regression, and BI-LSTM, showed that BI-LSTM achieved the highest accuracy of 0.839, while an Ensemble (LR+RF) model reached a peak AUC of 0.910, demonstrating the dataset's effectiveness for training models to understand linguistic cues.

Key takeaway

For research scientists developing conversational AI for less-resourced languages, this work demonstrates that synthetic data generation with LLMs can effectively create high-quality datasets for complex tasks like turn-taking prediction. You should consider using models like BI-LSTM or Ensemble methods, which showed superior performance on the Syn-TurnTurk dataset, to build more natural and responsive voice-based systems, especially for languages with unique grammatical structures like Turkish.

Key insights

Synthetic datasets can effectively address data scarcity for turn-taking prediction in under-resourced languages like Turkish.

Principles

Method

Generate diverse two-person dialogues using multiple LLMs with varied topics and controlled temperature, incorporating human speech characteristics like overlaps and silences, then label turn transitions for classification.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.