Accelerating ComfyUI Workflows on AMD Instinct™ MI355X GPUs with ROCm

· Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, AI Hardware Acceleration · Depth: Advanced, long

Summary

AMD's recent blog post, dated May 11, 2026, details how ComfyUI generative AI workflows are accelerated on AMD Instinct MI355X GPUs using ROCm 7.2.0. The article benchmarks three production-relevant ComfyUI workflows—Wan2.2 (Text-to-Video, 1280x1280), FLUX.1-dev (Text-to-Image, 2560x2560), and Hunyuan3D v2.1 (Image-to-3D, 4096)—against an NVIDIA B200 GPU. The MI355X, built on the CDNA4 (gfx950) architecture, achieved up to 1.439x faster end-to-end execution. This performance advantage is attributed to PyTorch Attention optimizations for gfx950, including AOTriton kernel support, occupancy tuning, pipelining, hipBLASLt GEMM tuning, and ThinLTO compiler optimizations, all integrated into ROCm 7.2.0 and available via a pre-built Docker image.

Key takeaway

For AI Engineers evaluating hardware for generative AI workloads, the AMD Instinct MI355X with ROCm 7.2.0 offers a compelling performance advantage, demonstrating up to 1.439x faster execution in ComfyUI workflows compared to the NVIDIA B200. You should consider leveraging the pre-built Docker images to quickly deploy and benefit from the optimized PyTorch Attention and GEMM kernels, potentially reducing inference times for compute-intensive diffusion models.

Key insights

AMD Instinct MI355X GPUs significantly accelerate ComfyUI generative AI workflows over NVIDIA B200.

Principles

Method

Benchmarking involved three attention-heavy ComfyUI workflows on AMD Instinct MI355X (ROCm 7.2.0) and NVIDIA B200 (CUDA 12.x), measuring end-to-end execution time using default parameters.

In practice

Topics

Code references

Best for: AI Hardware Engineer, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.