Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation
Summary
Research on dysarthric speech recognition addresses communication challenges for individuals with dysarthria, a condition complicated by varying severity and limited data. This study fine-tuned pre-trained End-to-End Wav2Vec2 models, focusing on severity levels, using four data augmentation methods: Speaking-Rate Modification (SRM), Pitch Modification (PM), Formant Modification (FM), and Vocal Tract Length Perturbation (VTLP). The investigation used individually fine-tuned Wav2Vec2 models for each severity class as baselines. Results showed distinct efficacy for each technique across severity levels. The best Word Error Rates (WERs) were achieved with SRM ($s$=0.8) for low (9.02%) and medium (38.11%) severities, and with PM ($τ$=0.8) for high severity (55.15%), yielding relative improvements of 30.02%, 16.64%, and 15.47%, respectively. These findings confirm the augmentation methods' effectiveness in improving dysarthric ASR performance.
Key takeaway
For machine learning engineers developing Automatic Speech Recognition systems for dysarthric speech, you should implement severity-specific data augmentation strategies. Tailoring techniques like Speaking-Rate Modification ($s$=0.8) for low and medium severities, and Pitch Modification ($τ$=0.8) for high severity, can substantially reduce Word Error Rates. This targeted approach is crucial for improving the accuracy and accessibility of communication technologies for individuals with dysarthria.
Key insights
Severity-specific data augmentation significantly improves dysarthric Automatic Speech Recognition performance, especially with Wav2Vec2 fine-tuning.
Principles
- Data augmentation efficacy varies by dysarthria severity.
- Severity-specific fine-tuning enhances ASR performance.
Method
Fine-tuning pre-trained Wav2Vec2 models for dysarthric ASR using Speaking-Rate Modification (SRM), Pitch Modification (PM), Formant Modification (FM), and Vocal Tract Length Perturbation (VTLP) tailored to severity.
In practice
- Apply SRM ($s$=0.8) for low/medium dysarthria.
- Use PM ($τ$=0.8) for high dysarthria severity.
Topics
- Dysarthric Speech Recognition
- Wav2Vec2 Fine-tuning
- Data Augmentation
- Speech Modification Techniques
- Automatic Speech Recognition
- Word Error Rate
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.