Towards Personalized Federated Learning for Dysarthric Speech Recognition

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Speech Recognition & Processing · Depth: Expert, long

Summary

This paper introduces two novel personalized federated learning (FL) aggregation strategies for dysarthric speech recognition, addressing challenges like data scarcity and speaker heterogeneity while preserving privacy. The proposed methods, parameter-based averaging and embedding-based averaging, divide the model into speaker-independent (SI) and speaker-dependent (SD) components. For the SI part, standard FedAvg is used, while the SD part employs similarity-based averaging guided by either model parameters or output embeddings from the SI component. Experiments on the UASpeech and TORGO dysarthric speech corpora demonstrate significant performance improvements. Specifically, the methods achieved statistically significant Word Error Rate (WER) reductions of up to 0.99% absolute (3.15% relative) on UASpeech and 0.56% absolute (4.73% relative) on TORGO, compared to a regularized FedAvg baseline. The research utilized a HuBERT model with 24 Transformer blocks, trained over 100 communication rounds on 16 UASpeech and 8 TORGO dysarthric speakers using 2 Nvidia A40 GPUs.

Key takeaway

For Machine Learning Engineers developing ASR systems for dysarthric speakers, you should consider personalized federated learning to overcome data heterogeneity and privacy concerns. Implementing similarity-aware aggregation strategies, such as parameter-based or embedding-based averaging for speaker-dependent model components, can yield significant WER reductions. This approach offers performance comparable to centralized training while preserving data privacy, especially benefiting users with very low speech intelligibility. You should experiment with the trade-off weight β to optimize performance for your specific dataset.

Key insights

Personalized federated learning significantly improves dysarthric speech recognition by addressing speaker heterogeneity while maintaining privacy.

Principles

Method

The approach splits the ASR model into SI and SD components. SI parts use FedAvg, while SD parts use parameter-based or embedding-based similarity averaging, guided by a trade-off weight β.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.