Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Cloud Computing & IT Infrastructure · Depth: Expert, quick

Summary

Probabilistic Synchronous Parallel (PSP) is a distributed learning technique used in Federated Learning (FL) to mitigate synchronization bottlenecks by sampling a subset of nodes per round. While PSP enhances throughput in FL environments with unreliable edge devices, it assumes static and independent device behavior, leading to potential unfair synchronization where highly available nodes dominate training. This can result in under-representation of certain data classes if device availability and data distribution are correlated. To address this, Availability-Weighted PSP (AW-PSP) extends PSP by dynamically adjusting node sampling probabilities. AW-PSP uses real-time availability predictions, historical behavior, and failure correlation metrics, incorporating a Markov-based predictor to differentiate transient from chronic failures. A Distributed Hash Table (DHT) layer decentralizes metadata like latency, freshness, and utility scores. Trace-driven evaluation demonstrates AW-PSP's improved robustness to both independent and correlated failures, increased label coverage, and reduced fairness variance compared to standard PSP.

Key takeaway

For research scientists developing federated learning systems, AW-PSP offers a robust solution to address unfair sampling caused by correlated device failures. You should consider integrating availability-aware sampling protocols, like AW-PSP, to ensure more equitable participation of diverse devices and improve overall model fairness and data representation, especially in heterogeneous and failure-prone environments. This approach can significantly enhance the effectiveness of your FL deployments.

Key insights

AW-PSP enhances federated learning by dynamically adjusting node sampling based on availability and failure correlation.

Principles

Method

AW-PSP uses a Markov-based predictor for failure types and a DHT for metadata, dynamically adjusting node sampling probabilities based on availability predictions, historical behavior, and failure correlations.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.