Decoupled DiLoCo: A new frontier for resilient, distributed AI training
Summary
Google DeepMind and Google Research have introduced Decoupled DiLoCo (Distributed Low-Communication), a new distributed architecture designed to enhance the resilience and flexibility of large language model (LLM) training across geographically dispersed data centers. This system divides extensive training runs into asynchronous "islands" of compute, called learner units, which isolates local disruptions and allows other parts of the system to continue learning efficiently. Building on Pathways and DiLoCo, the architecture significantly reduces bandwidth requirements, operating at 2-5 Gbps for a 12 billion parameter model across four U.S. regions, and maintains high "goodput" even with substantial hardware failures. Decoupled DiLoCo also supports mixing different hardware generations, such as TPU v6e and TPU v5p, within a single training run, extending hardware utility and increasing available compute without compromising ML performance.
Key takeaway
For MLOps Engineers managing large-scale LLM training, Decoupled DiLoCo offers a path to significantly improve system resilience and efficiency. Your teams can now consider training models across globally distributed, potentially heterogeneous, hardware without incurring prohibitive communication delays or risking widespread interruptions from localized failures. This approach extends the useful life of existing hardware and expands your available compute capacity.
Key insights
Decoupled DiLoCo enables resilient, low-bandwidth, and asynchronous LLM training across distributed, heterogeneous hardware.
Principles
- Asynchronous data flow enhances fault tolerance.
- Decoupling compute isolates failures.
- Reduced bandwidth enables global-scale training.
Method
Decoupled DiLoCo uses asynchronous data flow between "islands" of compute (learner units) to isolate failures and integrate communication into longer computation periods, avoiding blocking bottlenecks.
In practice
- Train LLMs across distant data centers.
- Utilize mixed-generation hardware for training.
- Maintain training progress despite chip failures.
Topics
- Decoupled DiLoCo
- Distributed AI Training
- Hardware Resiliency
- Asynchronous Training
- Low-Bandwidth Communication
Best for: MLOps Engineer, Research Scientist, CTO, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Google DeepMind News.