Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking
Summary
HEALTHDIAL is a newly introduced large-scale, multilingual, and multi-parallel dataset designed for developing and evaluating retrieval-augmented generation (RAG)-based spoken dialogue systems. This dataset features 6,000 information-seeking dialogues, with 1,500 dialogues provided for each of four official WHO languages: Arabic, Chinese, English, and Spanish. The dialogues are grounded in trusted content from the World Health Organization and include 163 hours of user speech recorded from native speakers across diverse dialects. Each speaker is meticulously annotated with demographic details like gender and age, alongside sociolinguistic variables such as primary language and region of origin. Initial benchmark results using HEALTHDIAL reveal consistent performance disparities across these languages, even among those considered high-resource. To foster further research, the creators are releasing the complete dataset, a prototype system, and a comprehensive toolkit for both data collection and system evaluation.
Key takeaway
For NLP Engineers and AI Scientists developing multilingual RAG-based spoken dialogue systems, you should integrate the HEALTHDIAL dataset into your training and evaluation pipelines. Its extensive multilingual and multi-parallel structure, grounded in WHO content, offers a robust resource for improving system robustness. Utilize the provided toolkit to streamline data collection and evaluation processes, and specifically address the identified performance disparities across Arabic, Chinese, English, and Spanish to enhance global applicability.
Key insights
Creating large-scale, multilingual, multi-parallel spoken dialogue datasets is methodologically challenging, revealing performance disparities.
Principles
- Multilingual dataset creation is complex.
- Performance varies across languages.
- Grounding in trusted content is key.
Method
The work describes creating a large-scale, multilingual, multi-parallel spoken dialogue dataset by recording 163 hours of user speech across four languages, grounding dialogues in WHO content, and annotating speakers with demographic and sociolinguistic variables.
In practice
- Use HEALTHDIAL for RAG system training.
- Evaluate systems across diverse languages.
- Analyze sociolinguistic impact on dialogue.
Topics
- Spoken Dialogue Systems
- Multilingual Datasets
- Retrieval-Augmented Generation
- WHO Health Information
- Language Performance Disparities
- Dataset Annotation
Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.