Technical Dive into AMD’s MLPerf Training v6.0 Submission

2026-06-16 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

AMD released its MLPerf Training v6.0 results on June 16, 2026, showcasing significant advancements in AI training performance across MI325X, MI350X, and MI355X Instinct GPUs. This round features Llama 2 70B LoRA fine-tuning, Llama 3.1 8B pretraining, and FLUX.1 Schnell text-to-image pretraining benchmarks. Key milestones include the debut of a production-ready MXFP4 (FP4) training recipe, the first use of AMD's Primus training framework, and AMD's initial multi-node submissions. Performance on MI355X is 3.5x faster than MI300X from v5.0, with Llama 2 70B fine-tuning improving by 19% on MI355X and 16% on MI350X since v5.1. Llama 3.1 8B pretraining improved by 13% on MI355X and 11% on MI350X. The MI355X achieved competitive performance, within 5% of NVIDIA B200 GPUs, and demonstrated multi-node scalability with 8-node MI325X (64 GPUs) and 64-node MI300X (512 GPUs) submissions.

Key takeaway

For AI Engineers evaluating GPU infrastructure for large-scale model training, AMD's MLPerf Training v6.0 results indicate that MI355X GPUs with MXFP4 precision and the Primus framework offer competitive performance, comparable to NVIDIA B200. You should consider AMD Instinct GPUs for your next-generation AI workloads, especially if seeking efficient multi-node scaling and advanced low-precision training capabilities. This validates AMD's software stack for production-grade deployments.

Key insights

AMD's MLPerf v6.0 submission validates MXFP4 training, Primus framework, and multi-node scalability for competitive AI training on Instinct GPUs.

Principles

MXFP4 offers approximately 2x compute density of FP8.
Deterministic Hadamard rotation stabilizes MXFP4 training.
Multi-node scaling is critical for production AI workloads.

Method

The MXFP4 recipe quantizes linear layers using E2M1 format with E8M0 block scaling, applying a deterministic 16-point Hadamard rotation, and fusing the pipeline into a single HIP kernel.

In practice

Use Primus for unified large-scale model training.
Implement MXFP4 for higher throughput on MI355X.
Employ a healing transition for MXFP4-to-FP8 convergence.

Topics

MLPerf Training
AMD Instinct GPUs
MXFP4 Precision
Primus Framework
Llama 2 70B
Llama 3.1 8B
Multi-node Training

Code references

Best for: MLOps Engineer, NLP Engineer, CTO, Machine Learning Engineer, AI Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.