Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning

2026-06-18 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Audio and Speech Processing · Depth: Expert, quick

Summary

A study investigates zero-shot voice cloning as a low-burden data augmentation strategy to improve Automatic Speech Recognition (ASR) for dysarthric speech, which suffers from data scarcity and high inter-speaker variability. Researchers used Higgs Audio V2 to clone speakers from the TORGO dataset and fine-tuned Whisper-medium on cloned, real, and hybrid datasets. Evaluating on held-out real speech, the Clone Fine-Tuning (FT) model achieved a 26.00% Word Error Rate (WER), closely matching the 24.44% from Real FT and 25.12% from Hybrid FT, and significantly outperforming the 31.62% zero-shot baseline. Notably, Clone and Hybrid FT models demonstrated superior performance for moderate-severe dysarthric speakers. Furthermore, Clone FT achieved the best results with an 11.45% relative improvement in cross-corpus evaluation on the SAP-1102 dataset, suggesting zero-shot cloning offers a scalable solution to the expensive data collection bottleneck.

Key takeaway

For Machine Learning Engineers developing ASR systems for dysarthric speech, you should integrate zero-shot voice cloning into your data augmentation pipeline. This approach, demonstrated with Higgs Audio V2 and Whisper-medium, significantly reduces data collection burden. It achieves competitive Word Error Rates, particularly for moderate-severe speakers. Consider hybrid datasets to maximize performance and scalability. This allows your team to deploy more robust, inclusive speech recognition solutions without extensive manual data acquisition.

Key insights

Zero-shot voice cloning effectively augments dysarthric speech data, improving ASR performance and bypassing costly data collection bottlenecks.

Principles

Zero-shot cloning scales data augmentation.
Hybrid data improves ASR for severe dysarthria.
Data scarcity is a primary ASR bottleneck.

Method

Zero-shot voice cloning using Higgs Audio V2 generates synthetic dysarthric speech. Fine-tune Whisper-medium on cloned, real, or hybrid data, then evaluate on held-out real speech.

In practice

Use Higgs Audio V2 for voice cloning.
Fine-tune Whisper-medium with augmented data.
Evaluate on TORGO and SAP-1102 datasets.

Topics

Dysarthric Speech
Automatic Speech Recognition
Zero-Shot Voice Cloning
Data Augmentation
Whisper-medium
Higgs Audio V2

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.