The Counterintuitive Networking Decisions Behind OpenAI’s 131,000-GPU Training Fabric

2026-05-14 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

The Multipath Reliable Connection (MRC) protocol, developed by a consortium including OpenAI, AMD, Broadcom, Intel, Microsoft, and NVIDIA, was released on May 5, 2026, through the Open Compute Project. Deployed across OpenAI's NVIDIA GB200 supercomputers, including the Stargate and Fairwater sites, MRC has been instrumental in training frontier models like ChatGPT and Codex. This protocol fundamentally re-architects high-performance data center networks for AI training at scales exceeding 100,000 GPUs. It achieves this by eliminating the Layer 3 control plane, disabling dynamic routing protocols like OSPF and BGP, and instead using static SRv6 source routing. MRC also employs packet spraying across eight independent network planes, operates on lossy Ethernet by disabling Priority Flow Control (PFC), and repurposes Explicit Congestion Notification (ECN) for per-path load balancing rather than rate control. These design choices enable microsecond-level failure recovery and predictable bandwidth, significantly improving training performance by mitigating tail latency and preventing job interruptions from network failures.

Key takeaway

For CTOs and VPs of Engineering designing or operating large-scale AI training infrastructure, MRC's radical departure from conventional networking offers a proven path to mitigate tail latency and enhance fault tolerance. You should evaluate the MRC specification and research paper to determine if its multi-plane, static routing, and lossy Ethernet approach aligns with your specific workload profiles and hardware platforms, particularly for synchronous pretraining on single-tenant fabrics. Consider the trade-offs for multi-tenancy or inference workloads.

Key insights

MRC inverts traditional networking principles to achieve predictable performance at extreme AI training scale.

Principles

Endpoint intelligence beats network intelligence at scale.
Tail latency dominates performance in large-scale synchronous training.
Simpler control planes enhance operational manageability.

Method

MRC uses multi-plane topology, packet spraying with entropy values, static SRv6 source routing, lossy Ethernet with selective retransmission, and ECN for load balancing to manage network traffic and failures.

In practice

Split 800 Gb/s NICs into 8x100 Gb/s links for multi-plane fabrics.
Disable dynamic routing for fixed-topology, single-tenant AI clusters.
Implement selective retransmission over PFC for faster recovery.

Topics

Multipath Reliable Connection
AI Training Fabrics
GPU Supercomputers
Packet Spraying
SRv6 Static Routing

Best for: CTO, VP of Engineering/Data, Director of AI/ML, AI Architect, MLOps Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.