Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & & Analytics · Depth: Expert, quick

Summary

HEALTHDIAL is a newly introduced large-scale, multilingual, and multi-parallel dataset designed for developing and evaluating retrieval-augmented generation (RAG)-based spoken dialogue systems. This dataset features 6,000 information-seeking dialogues, with 1,500 dialogues provided for each of four official WHO languages: Arabic, Chinese, English, and Spanish. The dialogues are grounded in trusted content from the World Health Organization and include 163 hours of user speech recorded from native speakers across diverse dialects. Each speaker is meticulously annotated with demographic details like gender and age, alongside sociolinguistic variables such as primary language and region of origin. Initial benchmark results using HEALTHDIAL reveal consistent performance disparities across these languages, even among those considered high-resource. To foster further research, the creators are releasing the complete dataset, a prototype system, and a comprehensive toolkit for both data collection and system evaluation.

Key takeaway

For NLP Engineers and AI Scientists developing multilingual RAG-based spoken dialogue systems, you should integrate the HEALTHDIAL dataset into your training and evaluation pipelines. Its extensive multilingual and multi-parallel structure, grounded in WHO content, offers a robust resource for improving system robustness. Utilize the provided toolkit to streamline data collection and evaluation processes, and specifically address the identified performance disparities across Arabic, Chinese, English, and Spanish to enhance global applicability.

Key insights

Creating large-scale, multilingual, multi-parallel spoken dialogue datasets is methodologically challenging, revealing performance disparities.

Principles

Method

The work describes creating a large-scale, multilingual, multi-parallel spoken dialogue dataset by recording 163 hours of user speech across four languages, grounding dialogues in WHO content, and annotating speakers with demographic and sociolinguistic variables.

In practice

Topics

Best for: Research Scientist, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.