Towards Personalized Federated Learning for Dysarthric Speech Recognition
Summary
This paper introduces two novel personalized federated learning (FL) aggregation strategies for dysarthric speech recognition, addressing challenges like data scarcity and speaker heterogeneity while preserving privacy. The proposed methods, parameter-based averaging and embedding-based averaging, divide the model into speaker-independent (SI) and speaker-dependent (SD) components. For the SI part, standard FedAvg is used, while the SD part employs similarity-based averaging guided by either model parameters or output embeddings from the SI component. Experiments on the UASpeech and TORGO dysarthric speech corpora demonstrate significant performance improvements. Specifically, the methods achieved statistically significant Word Error Rate (WER) reductions of up to 0.99% absolute (3.15% relative) on UASpeech and 0.56% absolute (4.73% relative) on TORGO, compared to a regularized FedAvg baseline. The research utilized a HuBERT model with 24 Transformer blocks, trained over 100 communication rounds on 16 UASpeech and 8 TORGO dysarthric speakers using 2 Nvidia A40 GPUs.
Key takeaway
For Machine Learning Engineers developing ASR systems for dysarthric speakers, you should consider personalized federated learning to overcome data heterogeneity and privacy concerns. Implementing similarity-aware aggregation strategies, such as parameter-based or embedding-based averaging for speaker-dependent model components, can yield significant WER reductions. This approach offers performance comparable to centralized training while preserving data privacy, especially benefiting users with very low speech intelligibility. You should experiment with the trade-off weight β to optimize performance for your specific dataset.
Key insights
Personalized federated learning significantly improves dysarthric speech recognition by addressing speaker heterogeneity while maintaining privacy.
Principles
- Separate speaker-independent and speaker-dependent model components.
- Use similarity-based aggregation for speaker-dependent parts.
- Privacy amplification via data subsampling.
Method
The approach splits the ASR model into SI and SD components. SI parts use FedAvg, while SD parts use parameter-based or embedding-based similarity averaging, guided by a trade-off weight β.
In practice
- Apply parameter-based averaging for SD model components.
- Implement embedding-based averaging for SD model components.
- Subsample 20% of private data for embedding calculation.
Topics
- Federated Learning
- Dysarthric Speech Recognition
- Personalized ASR
- Model Aggregation Strategies
- HuBERT Model
- Word Error Rate
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.