Advanced MXFP4 Quantization: Combining Fine-Tuned Rotations with SmoothQuant for Near-Lossless Compression

2026-02-17 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, long

Summary

This article, published on February 17, 2026, details advanced MXFP4 quantization techniques for language models, specifically combining fine-tuned online rotations with SmoothQuant input channel scaling to achieve near-lossless compression. The approach addresses accuracy degradation observed in small to mid-sized models (8B to 32B parameters) when aggressively quantized to MXFP4, a 4.25-bit format supported by AMD Instinct MI350X and MI355X accelerators. The method recovers 45-55% of the accuracy drop on n-shot tasks for Qwen3-8B and Qwen3-14B models, allowing them to retain over 98% of their original BF16 accuracy. While introducing a lightweight online operation during inference, the block-diagonal structure of rotations minimizes overhead, especially for larger models. The techniques are broadly applicable to other low-precision data types like INT4 and MXFP6, with code available in AMD Quark 0.11.

Key takeaway

For NLP Engineers and AI Scientists deploying language models on AMD Instinct MI350/MI355 accelerators, adopting the combined fine-tuned online rotation and SmoothQuant scaling technique can significantly mitigate accuracy loss from MXFP4 quantization. This approach enables near-lossless compression for models like Qwen3-8B to Qwen3-32B, retaining over 98% BF16 accuracy. You should explore the AMD Quark 0.11 release for implementation and consider its integration into your inference pipelines to balance model size reduction with performance.

Key insights

Combining fine-tuned online rotations and SmoothQuant scales significantly improves MXFP4 quantization accuracy for language models.

Principles

Outlier redistribution reduces quantization error.
Jointly optimizing transforms and scales enhances accuracy.
Block-diagonal rotations balance accuracy and runtime.

Method

Apply block-diagonal orthogonal transforms (rotations) to activations online, fusing inverse transforms into weights offline. Combine with SmoothQuant channel rescaling, jointly learning both rotation and scaling parameters to minimize model output error.

In practice

Use AMD Quark 0.11 for rotation training.
Integrate with vLLM for serving quantized models.
Investigate fused kernels for reduced overhead.

Topics

MXFP4 Quantization
Online Rotations
SmoothQuant
LLM Inference
AMD Instinct Accelerators

Code references

Best for: NLP Engineer, AI Scientist, AI Engineer, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.