Decoupled DiLoCo: A new frontier for resilient, distributed AI training

2026-04-23 · Source: Google DeepMind News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

Google DeepMind and Google Research have introduced Decoupled DiLoCo (Distributed Low-Communication), a new distributed architecture designed to enhance the resilience and flexibility of large language model (LLM) training across geographically dispersed data centers. This system divides extensive training runs into asynchronous "islands" of compute, called learner units, which isolates local disruptions and allows other parts of the system to continue learning efficiently. Building on Pathways and DiLoCo, the architecture significantly reduces bandwidth requirements, operating at 2-5 Gbps for a 12 billion parameter model across four U.S. regions, and maintains high "goodput" even with substantial hardware failures. Decoupled DiLoCo also supports mixing different hardware generations, such as TPU v6e and TPU v5p, within a single training run, extending hardware utility and increasing available compute without compromising ML performance.

Key takeaway

For MLOps Engineers managing large-scale LLM training, Decoupled DiLoCo offers a path to significantly improve system resilience and efficiency. Your teams can now consider training models across globally distributed, potentially heterogeneous, hardware without incurring prohibitive communication delays or risking widespread interruptions from localized failures. This approach extends the useful life of existing hardware and expands your available compute capacity.

Key insights

Decoupled DiLoCo enables resilient, low-bandwidth, and asynchronous LLM training across distributed, heterogeneous hardware.

Principles

Asynchronous data flow enhances fault tolerance.
Decoupling compute isolates failures.
Reduced bandwidth enables global-scale training.

Method

Decoupled DiLoCo uses asynchronous data flow between "islands" of compute (learner units) to isolate failures and integrate communication into longer computation periods, avoiding blocking bottlenecks.

In practice

Train LLMs across distant data centers.
Utilize mixed-generation hardware for training.
Maintain training progress despite chip failures.

Topics

Decoupled DiLoCo
Distributed AI Training
Hardware Resiliency
Asynchronous Training
Low-Bandwidth Communication

Best for: MLOps Engineer, Research Scientist, CTO, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Google DeepMind News.