Technical Dive into AMD’s MLPerf Training v6.0 Submission
Summary
AMD released its MLPerf Training v6.0 results on June 16, 2026, showcasing significant advancements in AI training performance across MI325X, MI350X, and MI355X Instinct GPUs. This round features Llama 2 70B LoRA fine-tuning, Llama 3.1 8B pretraining, and FLUX.1 Schnell text-to-image pretraining benchmarks. Key milestones include the debut of a production-ready MXFP4 (FP4) training recipe, the first use of AMD's Primus training framework, and AMD's initial multi-node submissions. Performance on MI355X is 3.5x faster than MI300X from v5.0, with Llama 2 70B fine-tuning improving by 19% on MI355X and 16% on MI350X since v5.1. Llama 3.1 8B pretraining improved by 13% on MI355X and 11% on MI350X. The MI355X achieved competitive performance, within 5% of NVIDIA B200 GPUs, and demonstrated multi-node scalability with 8-node MI325X (64 GPUs) and 64-node MI300X (512 GPUs) submissions.
Key takeaway
For AI Engineers evaluating GPU infrastructure for large-scale model training, AMD's MLPerf Training v6.0 results indicate that MI355X GPUs with MXFP4 precision and the Primus framework offer competitive performance, comparable to NVIDIA B200. You should consider AMD Instinct GPUs for your next-generation AI workloads, especially if seeking efficient multi-node scaling and advanced low-precision training capabilities. This validates AMD's software stack for production-grade deployments.
Key insights
AMD's MLPerf v6.0 submission validates MXFP4 training, Primus framework, and multi-node scalability for competitive AI training on Instinct GPUs.
Principles
- MXFP4 offers approximately 2x compute density of FP8.
- Deterministic Hadamard rotation stabilizes MXFP4 training.
- Multi-node scaling is critical for production AI workloads.
Method
The MXFP4 recipe quantizes linear layers using E2M1 format with E8M0 block scaling, applying a deterministic 16-point Hadamard rotation, and fusing the pipeline into a single HIP kernel.
In practice
- Use Primus for unified large-scale model training.
- Implement MXFP4 for higher throughput on MI355X.
- Employ a healing transition for MXFP4-to-FP8 convergence.
Topics
- MLPerf Training
- AMD Instinct GPUs
- MXFP4 Precision
- Primus Framework
- Llama 2 70B
- Llama 3.1 8B
- Multi-node Training
Code references
Best for: MLOps Engineer, NLP Engineer, CTO, Machine Learning Engineer, AI Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.