Reproducing AMD MLPerf Training v6.0 Submission Result
Summary
AMD has released a step-by-step guide for reproducing its MLPerf Training v6.0 submission results, achieved on AMD Instinct MI325X, MI350X, and MI355X GPUs. The guide covers three benchmarks: Llama 2 70B LoRA fine-tuning, Llama 3.1 8B pretraining, and Flux.1-schnell text-to-image training. For LLM benchmarks, AMD's Primus training framework was used, abstracting Megatron-LM and TorchTitan. Reproduction requires ROCm 7.2.2 or later, Docker, and specific disk space, such as 6 TB for Flux.1-schnell. The article details environment setup, dataset preparation, training configuration, execution, and result validation for each benchmark. Expected scores include Llama 2 70B LoRA at 8.27 minutes on MI355X and 10.25 minutes on MI350X, Llama 3.1 8B pretraining at 86.84 minutes on MI355X and 109.76 minutes on MI350X, and Flux.1-schnell at 92.36 minutes on an 8-node MI325X configuration.
Key takeaway
For AI Engineers evaluating AMD Instinct GPUs for large-scale model training, this guide offers a clear path to validate AMD's MLPerf Training v6.0 performance claims. You should follow the detailed steps for environment setup and dataset preparation to reproduce Llama 2 70B LoRA, Llama 3.1 8B, or Flux.1-schnell benchmarks. Your team can utilize the Primus framework for streamlined LLM workflows and ensure MLPerf-compliant result validation by averaging 8 of 10 runs.
Key insights
AMD provides a detailed guide to reproduce its MLPerf Training v6.0 benchmark results on Instinct GPUs using the Primus framework.
Principles
- MLPerf results require rigorous, multi-run validation.
- Unified frameworks simplify large-scale model training.
- Hardware-specific configurations optimize performance.
Method
The reproduction method involves setting up a Docker environment with ROCm 7.2.2+, preparing specific datasets, configuring platform-specific hyperparameters, and executing training runs, followed by MLPerf-compliant result validation.
In practice
- Use primus-cli for LLM pretraining/fine-tuning.
- Download pre-tokenized datasets for efficiency.
- Average 8 of 10 runs for MLPerf scores.
Topics
- MLPerf Training v6.0
- AMD Instinct GPUs
- Primus Training Framework
- LLM Training
- Text-to-Image Models
- Benchmark Reproduction
Code references
Best for: Machine Learning Engineer, AI Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.