The Counterintuitive Networking Decisions Behind OpenAI’s 131,000-GPU Training Fabric
Summary
The Multipath Reliable Connection (MRC) protocol, developed by a consortium including OpenAI, AMD, Broadcom, Intel, Microsoft, and NVIDIA, was released on May 5, 2026, through the Open Compute Project. Deployed across OpenAI's NVIDIA GB200 supercomputers, including the Stargate and Fairwater sites, MRC has been instrumental in training frontier models like ChatGPT and Codex. This protocol fundamentally re-architects high-performance data center networks for AI training at scales exceeding 100,000 GPUs. It achieves this by eliminating the Layer 3 control plane, disabling dynamic routing protocols like OSPF and BGP, and instead using static SRv6 source routing. MRC also employs packet spraying across eight independent network planes, operates on lossy Ethernet by disabling Priority Flow Control (PFC), and repurposes Explicit Congestion Notification (ECN) for per-path load balancing rather than rate control. These design choices enable microsecond-level failure recovery and predictable bandwidth, significantly improving training performance by mitigating tail latency and preventing job interruptions from network failures.
Key takeaway
For CTOs and VPs of Engineering designing or operating large-scale AI training infrastructure, MRC's radical departure from conventional networking offers a proven path to mitigate tail latency and enhance fault tolerance. You should evaluate the MRC specification and research paper to determine if its multi-plane, static routing, and lossy Ethernet approach aligns with your specific workload profiles and hardware platforms, particularly for synchronous pretraining on single-tenant fabrics. Consider the trade-offs for multi-tenancy or inference workloads.
Key insights
MRC inverts traditional networking principles to achieve predictable performance at extreme AI training scale.
Principles
- Endpoint intelligence beats network intelligence at scale.
- Tail latency dominates performance in large-scale synchronous training.
- Simpler control planes enhance operational manageability.
Method
MRC uses multi-plane topology, packet spraying with entropy values, static SRv6 source routing, lossy Ethernet with selective retransmission, and ECN for load balancing to manage network traffic and failures.
In practice
- Split 800 Gb/s NICs into 8x100 Gb/s links for multi-plane fabrics.
- Disable dynamic routing for fixed-topology, single-tenant AI clusters.
- Implement selective retransmission over PFC for faster recovery.
Topics
- Multipath Reliable Connection
- AI Training Fabrics
- GPU Supercomputers
- Packet Spraying
- SRv6 Static Routing
Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Architect, MLOps Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.