AMD Instinct™ GPUs MLPerf Inference v6.0 Submission

2026-04-01 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Advanced, long

Summary

AMD announced its MLPerf Inference v6.0 benchmark results on April 1, 2026, showcasing the performance of its Instinct MI355X GPU and the ROCm software stack. The MI355X, an Instinct GPU launched in 2025 with CDNA architecture, demonstrated improved performance for standard and pruned MLPerf models in the Open category. Key achievements included optimized FP4 Large Language Models, first-ever results for the new gpt-oss-120b and Wan2.2 benchmarks, and distributed inference on up to 12 nodes for Llama 2 70B and gpt-oss-120b, exceeding 1 million tokens per second in multi-node inference. The MI355X offers 20 petaflops of FP4 performance, 288 GB HBM3 memory, and 8TB/s bandwidth, with liquid cooling. AMD's results show competitive performance against NVIDIA B200 and B300 GPUs, particularly in server and offline modes, and strong scaling efficiency in multi-node configurations.

Key takeaway

For AI Engineers and MLOps teams evaluating GPU platforms for large-scale generative AI inference, AMD's MI355X with ROCm demonstrates strong competitive performance, particularly for LLMs like Llama 2 70B and gpt-oss-120b. You should consider the MI355X for deployments requiring high throughput, efficient multi-node scaling, and advanced quantization support, especially where interactive or server-mode performance is critical. Review AMD's optimization guides for system-level tuning to maximize your inference throughput.

Key insights

AMD's MI355X GPU and ROCm stack deliver competitive MLPerf Inference v6.0 results, especially for LLMs and multi-node setups.

Principles

FP4 quantization significantly boosts LLM inference performance.
Automated kernel optimization (GEAK) improves GPU code efficiency.
Multi-node inference scales effectively with optimized software stacks.

Method

Optimizations included unified FP8 attention kernels, GEAK for automated kernel tuning, MoE GEMM kernel enhancements, AMD Quark for MXFP4/FP8 quantization, and vLLM scheduler tuning for efficient request dispatch and KV-cache awareness.

In practice

Utilize FP4/FP8 precision for large-scale generative AI models.
Implement multi-path routing strategies for mixed inference workloads.
Optimize BIOS and OS settings for maximum AI workload performance.

Topics

MLPerf Inference v6.0
AMD Instinct MI355X GPU
ROCm Software Stack
Large Language Model Inference
GPU Performance Optimization

Code references

Best for: MLOps Engineer, AI Engineer, NLP Engineer, Machine Learning Engineer, AI Architect, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.