Decoupled DiLoCo: A new frontier for resilient, distributed AI training

· Source: Google DeepMind News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

Google DeepMind and Google Research have introduced Decoupled DiLoCo (Distributed Low-Communication), a new distributed architecture designed to enhance the resilience and flexibility of large language model (LLM) training across geographically dispersed data centers. This system divides extensive training runs into asynchronous "islands" of compute, called learner units, which isolates local disruptions and allows other parts of the system to continue learning efficiently. Building on Pathways and DiLoCo, the architecture significantly reduces bandwidth requirements, operating at 2-5 Gbps for a 12 billion parameter model across four U.S. regions, and maintains high "goodput" even with substantial hardware failures. Decoupled DiLoCo also supports mixing different hardware generations, such as TPU v6e and TPU v5p, within a single training run, extending hardware utility and increasing available compute without compromising ML performance.

Key takeaway

For MLOps Engineers managing large-scale LLM training, Decoupled DiLoCo offers a path to significantly improve system resilience and efficiency. Your teams can now consider training models across globally distributed, potentially heterogeneous, hardware without incurring prohibitive communication delays or risking widespread interruptions from localized failures. This approach extends the useful life of existing hardware and expands your available compute capacity.

Key insights

Decoupled DiLoCo enables resilient, low-bandwidth, and asynchronous LLM training across distributed, heterogeneous hardware.

Principles

Method

Decoupled DiLoCo uses asynchronous data flow between "islands" of compute (learner units) to isolate failures and integrate communication into longer computation periods, avoiding blocking bottlenecks.

In practice

Topics

Best for: MLOps Engineer, Research Scientist, CTO, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Google DeepMind News.