Syn-TurnTurk: A Synthetic Dataset for Turn-Taking Prediction in Turkish Dialogues

2026-01-23 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, medium

Summary

Syn-TurnTurk is a new synthetic Turkish dialogue dataset designed to improve turn-taking prediction in voice-based chatbots, addressing the critical lack of high-quality, labeled conversational data for Turkish. Generated using five Qwen Large Language Models (LLMs) including qwen3-max-2026-01-23 and qwen3.5-397b-a17b, the dataset comprises 1,625 dialogues across 79 unique topics, featuring 12,560 speaker changes and 5,305 instances of speech overlaps. The generation process incorporated human-centric speech characteristics like strategic silences and interjections, with a temperature of 0.7 for most outputs. Evaluation using traditional and deep learning models, including Decision Tree, Random Forest, Logistic Regression, and BI-LSTM, showed that BI-LSTM achieved the highest accuracy of 0.839, while an Ensemble (LR+RF) model reached a peak AUC of 0.910, demonstrating the dataset's effectiveness for training models to understand linguistic cues.

Key takeaway

For research scientists developing conversational AI for less-resourced languages, this work demonstrates that synthetic data generation with LLMs can effectively create high-quality datasets for complex tasks like turn-taking prediction. You should consider using models like BI-LSTM or Ensemble methods, which showed superior performance on the Syn-TurnTurk dataset, to build more natural and responsive voice-based systems, especially for languages with unique grammatical structures like Turkish.

Key insights

Synthetic datasets can effectively address data scarcity for turn-taking prediction in under-resourced languages like Turkish.

Principles

Dialogue timing is crucial for natural chatbot interaction.
Linguistic flow is key for managing speech transitions.
Synthetic data can mirror real-life verbal exchanges.

Method

Generate diverse two-person dialogues using multiple LLMs with varied topics and controlled temperature, incorporating human speech characteristics like overlaps and silences, then label turn transitions for classification.

In practice

Use Qwen LLMs for synthetic dialogue generation.
Incorporate 79+ topics to ensure dataset diversity.
Apply BI-LSTM for high accuracy in turn prediction.

Topics

Turn-Taking Prediction
Synthetic Dataset
Turkish Dialogues
Qwen LLMs
BI-LSTM

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.