Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language
Summary
Research by Yunkai Xu and Saeed Abdullah investigates the efficacy of creating multilingual mental health dialogue datasets using persona-based localization. They modified nationality and language parameters in clinically validated English personas to generate dialogues in Mandarin, Bengali, and Hindi. These dialogues were then evaluated for depression severity by various LLM judges, including GPT-4o-mini, DeepSeek-V3.2, LLaMA3.1-8B, Qwen3-8B, and DeepSeek-R1-8B. Findings indicate that merely adding nationality and language parameters introduces clinical inconsistency across languages. LLM judges often showed inaccuracies in non-English texts, with performance varying significantly, especially for smaller models like DeepSeek-R1-8B and Llama3-8B, which exhibited substantial accuracy drops and higher cross-severity errors in non-English contexts. This highlights systemic limitations of applying English-centric personas to multilingual settings, underscoring the need for culturally responsive data generation.
Key takeaway
For AI Scientists and NLP Engineers developing LLM-based mental health support systems for global populations, relying on simple persona localization by modifying nationality and language in English-centric templates is insufficient. You must treat multilingual persona construction as a distinct design and validation process, incorporating culturally grounded expression and rigorous output-level evaluation. This approach ensures clinical consistency and mitigates systemic biases, leading to more equitable and effective digital mental health solutions.
Key insights
Simple nationality and language parameter changes in English-centric personas fail to preserve clinical consistency in multilingual mental health dialogues.
Principles
- Minimal persona localization introduces clinical inconsistency across languages.
- LLM judge performance varies significantly across languages and models for mental health assessment.
- Multilingual clinical personas require output-level validation, not just template extension.
Method
An LLM-based therapist agent generated dialogues from personas with modified nationality/language. Independent LLM judges then performed blind pairwise severity comparisons using overall accuracy, same-level error rate, and tie distance metrics.
In practice
- Rigorously validate multilingual synthetic data outputs for clinical consistency.
- Avoid direct translation or minimal parameter changes for culturally sensitive data.
- Employ multiple LLM judges and human review for robust cross-lingual evaluation.
Topics
- Multilingual LLMs
- Mental Health Datasets
- Synthetic Data Generation
- Persona-based AI
- Cross-cultural Bias
- Depression Severity Assessment
Code references
Best for: AI Scientist, NLP Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.