Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure
Summary
Availability-Weighted Probabilistic Synchronous Parallel (AW-PSP) is a novel technique designed to enhance federated learning (FL) robustness and fairness by addressing the limitations of traditional synchronization methods like Probabilistic Synchronous Parallel (PSP). PSP, which samples a subset of nodes per round, assumes static and independent device behavior, leading to unfair participation and under-representation of certain data classes, especially when device availability and data distribution are correlated. AW-PSP dynamically adjusts node sampling probabilities using real-time availability predictions, historical behavior, and failure correlation metrics. It incorporates a Markov-based predictor to distinguish transient versus chronic failures and utilizes a Distributed Hash Table (DHT) for decentralized metadata sharing, including latency and utility scores. Trace-driven evaluations demonstrate that AW-PSP improves resilience to both independent and correlated failures, increases label coverage, and reduces fairness variance compared to standard PSP, making it suitable for large-scale, heterogeneous, and failure-prone FL environments.
Key takeaway
For research scientists developing federated learning systems, you should consider implementing dynamic, availability-aware node selection mechanisms like AW-PSP. Relying on static or uniformly random sampling in FL deployments can lead to significant accuracy drops and fairness issues, especially in environments with correlated device failures and heterogeneous data. Integrating real-time availability predictions and co-correlation penalties into your client selection strategy will improve model robustness, increase label coverage, and ensure more equitable participation across diverse client populations.
Key insights
AW-PSP improves federated learning robustness by dynamically weighting node selection based on availability and failure correlation.
Principles
- Device availability and data distribution can be co-correlated.
- Correlated failures introduce abrupt system instability.
- Dynamic availability modeling enhances FL fairness and resilience.
Method
AW-PSP uses Markov chains for real-time availability prediction, combines historical and runtime co-failure correlations, and employs a DHT for decentralized metadata, adjusting node sampling probabilities based on these factors.
In practice
- Use Markov chains to predict device availability.
- Implement DHT for decentralized metadata sharing.
- Adjust sampling based on node availability and failure correlation.
Topics
- Federated Learning
- Probabilistic Synchronous Parallel
- Availability-Weighted PSP
- Correlated Device Failure
- Node Availability Modeling
Code references
Best for: Research Scientist, Machine Learning Engineer, AI Scientist, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.