ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

2026-05-30 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

ScaleSweep is a novel and efficient scale optimization method designed to enhance NVFP4 post-training quantization for Large Language Models (LLMs). NVFP4, a hardware-supported 4-bit format, uses fine-grained block scales, but prior initialization methods like AbsMax are suboptimal. ScaleSweep addresses this by sweeping over feasible block scale candidates, selecting the one that minimizes a target objective. The method includes a theoretical analysis deriving lower and upper bounds for the sweep range under Mean Square Error (MSE) and Weighted Mean Square Error (WMSE), significantly reducing the search space with negligible overhead. Experiments on Llama and Qwen models show ScaleSweep consistently improves quantization performance, preserving over 93% of full-precision performance even with aggressive end-to-end quantization of weights, activations, KV cache, and query states.

Key takeaway

For Machine Learning Engineers deploying LLMs with 4-bit quantization, particularly using NVFP4, ScaleSweep offers a critical advancement. This method significantly improves quantization accuracy by optimizing block scales, narrowing the performance gap to full precision. You should consider integrating ScaleSweep to achieve higher fidelity and efficiency for your 4-bit quantized LLMs, as it preserves over 93% of full-precision performance even under aggressive end-to-end quantization.

Key insights

ScaleSweep optimizes NVFP4 quantization by sweeping block scale candidates within derived theoretical bounds to minimize reconstruction error.

Principles

AbsMax initialization is suboptimal for NVFP4 quantization.
Optimizing block scales significantly improves 4-bit quantization fidelity.
Theoretical bounds can efficiently constrain the search space for optimal scales.

Method

ScaleSweep sweeps feasible block scale candidates, selecting the one that minimizes a target objective (MSE/WMSE), guided by theoretically derived lower and upper bounds for the sweep range.

In practice

Apply ScaleSweep to Llama and Qwen models for improved 4-bit quantization.
Utilize NVFP4 with ScaleSweep for aggressive end-to-end LLM quantization.

Topics

NVFP4 Quantization
Post-Training Quantization
LLM Compression
ScaleSweep
Block Scale Initialization
Llama Models
Qwen Models

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.