QuickReduce FP4 Quantization and Benchmarking on MI355
Summary
AMD's QuickReduce library, a high-performance all-reduce solution for ROCm, now supports FP4 quantization and has been benchmarked on the MI355 platform. This library, which previously demonstrated up to 2.25x faster performance than RCCL on MI300X configurations, utilizes MI355's native FP4 assembly instructions for accelerated quantization and dequantization. Benchmarking against RCCL and Custom AllReduce (CR) across message sizes from 4 KB to 1 GB revealed that QuickReduce, particularly with FP4 and INT4 quantization, delivers significant speedups for large messages, achieving up to 4.14x over RCCL at TP=2 for 1 GB messages. While CR excels at small message sizes (below ~512 KB), QuickReduce consistently outperforms for volumes exceeding ~1 MB. End-to-end evaluations on Qwen3-30B-A3B-Instruct-2507 and DeepSeek-R1-0528 models using vLLM showed FP4 and INT4 quantization reducing Time To First Token (TTFT) and Time Per Output Token (TPOT) with minimal accuracy impact, such as a 1.758x TTFT speedup for DeepSeek-R1 at TP=4.
Key takeaway
For AI Engineers optimizing LLM inference on AMD MI355 platforms, you should prioritize QuickReduce with FP4 or INT4 quantization for multi-GPU setups. This approach significantly reduces communication latency for large tensor parallelism messages, improving Time To First Token (TTFT) and Time Per Output Token (TPOT) without notable accuracy loss. Ensure your inference framework, like vLLM, is configured to leverage QuickReduce, especially for prefill phases or large batch sizes where communication volume is high.
Key insights
QuickReduce with FP4/INT4 quantization on MI355 significantly accelerates large-message all-reduce operations for LLM inference.
Principles
- Inline compression in all-reduce reduces communication latency.
- Dedicated hardware instructions accelerate low-bit quantization.
- Optimal all-reduce strategy depends on message size and GPU count.
Method
QuickReduce quantizes FP16 to FP4 using MI355 native instructions, computing scale in FP16, then dequantizes back to FP16 for the receive path.
In practice
- Use QuickReduce for large-message all-reduce on AMD MI355.
- Configure inference frameworks to use FP4 or INT4 quantization.
- Implement adaptive all-reduce based on message volume.
Topics
- QuickReduce
- FP4 Quantization
- AMD MI355
- LLM Inference
- All-Reduce
- Tensor Parallelism
- vLLM
Code references
Best for: MLOps Engineer, NLP Engineer, Machine Learning Engineer, AI Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.