QuickReduce FP4 Quantization and Benchmarking on MI355

2026-05-20 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Expert, long

Summary

AMD's QuickReduce library, a high-performance all-reduce solution for ROCm, now supports FP4 quantization and has been benchmarked on the MI355 platform. This library, which previously demonstrated up to 2.25x faster performance than RCCL on MI300X configurations, utilizes MI355's native FP4 assembly instructions for accelerated quantization and dequantization. Benchmarking against RCCL and Custom AllReduce (CR) across message sizes from 4 KB to 1 GB revealed that QuickReduce, particularly with FP4 and INT4 quantization, delivers significant speedups for large messages, achieving up to 4.14x over RCCL at TP=2 for 1 GB messages. While CR excels at small message sizes (below ~512 KB), QuickReduce consistently outperforms for volumes exceeding ~1 MB. End-to-end evaluations on Qwen3-30B-A3B-Instruct-2507 and DeepSeek-R1-0528 models using vLLM showed FP4 and INT4 quantization reducing Time To First Token (TTFT) and Time Per Output Token (TPOT) with minimal accuracy impact, such as a 1.758x TTFT speedup for DeepSeek-R1 at TP=4.

Key takeaway

For AI Engineers optimizing LLM inference on AMD MI355 platforms, you should prioritize QuickReduce with FP4 or INT4 quantization for multi-GPU setups. This approach significantly reduces communication latency for large tensor parallelism messages, improving Time To First Token (TTFT) and Time Per Output Token (TPOT) without notable accuracy loss. Ensure your inference framework, like vLLM, is configured to leverage QuickReduce, especially for prefill phases or large batch sizes where communication volume is high.

Key insights

QuickReduce with FP4/INT4 quantization on MI355 significantly accelerates large-message all-reduce operations for LLM inference.

Principles

Inline compression in all-reduce reduces communication latency.
Dedicated hardware instructions accelerate low-bit quantization.
Optimal all-reduce strategy depends on message size and GPU count.

Method

QuickReduce quantizes FP16 to FP4 using MI355 native instructions, computing scale in FP16, then dequantizes back to FP16 for the receive path.

In practice

Use QuickReduce for large-message all-reduce on AMD MI355.
Configure inference frameworks to use FP4 or INT4 quantization.
Implement adaptive all-reduce based on message volume.

Topics

QuickReduce
FP4 Quantization
AMD MI355
LLM Inference
All-Reduce
Tensor Parallelism
vLLM

Code references

Best for: MLOps Engineer, NLP Engineer, Machine Learning Engineer, AI Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.