AMD Instinct™ GPUs MLPerf Inference v6.0 Submission
Summary
AMD announced its MLPerf Inference v6.0 benchmark results on April 1, 2026, showcasing the performance of its Instinct MI355X GPU and the ROCm software stack. The MI355X, an Instinct GPU launched in 2025 with CDNA architecture, demonstrated improved performance for standard and pruned MLPerf models in the Open category. Key achievements included optimized FP4 Large Language Models, first-ever results for the new gpt-oss-120b and Wan2.2 benchmarks, and distributed inference on up to 12 nodes for Llama 2 70B and gpt-oss-120b, exceeding 1 million tokens per second in multi-node inference. The MI355X offers 20 petaflops of FP4 performance, 288 GB HBM3 memory, and 8TB/s bandwidth, with liquid cooling. AMD's results show competitive performance against NVIDIA B200 and B300 GPUs, particularly in server and offline modes, and strong scaling efficiency in multi-node configurations.
Key takeaway
For AI Engineers and MLOps teams evaluating GPU platforms for large-scale generative AI inference, AMD's MI355X with ROCm demonstrates strong competitive performance, particularly for LLMs like Llama 2 70B and gpt-oss-120b. You should consider the MI355X for deployments requiring high throughput, efficient multi-node scaling, and advanced quantization support, especially where interactive or server-mode performance is critical. Review AMD's optimization guides for system-level tuning to maximize your inference throughput.
Key insights
AMD's MI355X GPU and ROCm stack deliver competitive MLPerf Inference v6.0 results, especially for LLMs and multi-node setups.
Principles
- FP4 quantization significantly boosts LLM inference performance.
- Automated kernel optimization (GEAK) improves GPU code efficiency.
- Multi-node inference scales effectively with optimized software stacks.
Method
Optimizations included unified FP8 attention kernels, GEAK for automated kernel tuning, MoE GEMM kernel enhancements, AMD Quark for MXFP4/FP8 quantization, and vLLM scheduler tuning for efficient request dispatch and KV-cache awareness.
In practice
- Utilize FP4/FP8 precision for large-scale generative AI models.
- Implement multi-path routing strategies for mixed inference workloads.
- Optimize BIOS and OS settings for maximum AI workload performance.
Topics
- MLPerf Inference v6.0
- AMD Instinct MI355X GPU
- ROCm Software Stack
- Large Language Model Inference
- GPU Performance Optimization
Code references
Best for: MLOps Engineer, AI Engineer, NLP Engineer, Machine Learning Engineer, AI Architect, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.