Unlocking large scale AI training networks with MRC (Multipath Reliable Connection)

· Source: OpenAI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

OpenAI, in collaboration with AMD, Broadcom, Intel, Microsoft, and NVIDIA, has developed and released Multipath Reliable Connection (MRC), a novel network protocol designed to enhance GPU networking performance and resilience in large-scale AI training clusters. Released through the Open Compute Project (OCP) on May 5, 2026, MRC addresses critical challenges in supercomputer networking, such as congestion and frequent link/device failures, which become more pronounced with increasing cluster size. MRC enables multi-plane high-speed networks, allowing over 100,000 GPUs to connect with only two tiers of Ethernet switches, reducing power consumption and cost. Its adaptive packet spraying technique spreads single data transfers across hundreds of paths, virtually eliminating core congestion, while SRv6-based source routing bypasses failures in microseconds, ensuring predictable performance even during network disruptions. This protocol is already deployed across OpenAI's largest NVIDIA GB200 supercomputers, including those with Oracle Cloud Infrastructure (OCI) in Abilene, Texas, and Microsoft's Fairwater supercomputers.

Key takeaway

For MLOps engineers and infrastructure architects building or operating large-scale AI training environments, adopting the MRC protocol can significantly enhance network reliability and efficiency. You can reduce costly training job interruptions caused by network failures and congestion, allowing for more consistent and faster model development. Consider integrating MRC-compatible hardware and software to future-proof your supercomputing infrastructure for frontier model training.

Key insights

MRC improves large-scale AI training network reliability and performance through multipath packet spraying and static source routing.

Principles

Method

MRC extends RoCE with SRv6-based source routing, splitting network interfaces into multiple planes, spraying packets across hundreds of paths, and using static routing tables for failure bypass.

In practice

Topics

Best for: MLOps Engineer, CTO, VP of Engineering/Data, AI Architect, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by OpenAI News.