ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

ScaleSweep is a novel and efficient scale optimization method designed to enhance NVFP4 post-training quantization for Large Language Models (LLMs). NVFP4, a hardware-supported 4-bit format, uses fine-grained block scales, but prior initialization methods like AbsMax are suboptimal. ScaleSweep addresses this by sweeping over feasible block scale candidates, selecting the one that minimizes a target objective. The method includes a theoretical analysis deriving lower and upper bounds for the sweep range under Mean Square Error (MSE) and Weighted Mean Square Error (WMSE), significantly reducing the search space with negligible overhead. Experiments on Llama and Qwen models show ScaleSweep consistently improves quantization performance, preserving over 93% of full-precision performance even with aggressive end-to-end quantization of weights, activations, KV cache, and query states.

Key takeaway

For Machine Learning Engineers deploying LLMs with 4-bit quantization, particularly using NVFP4, ScaleSweep offers a critical advancement. This method significantly improves quantization accuracy by optimizing block scales, narrowing the performance gap to full precision. You should consider integrating ScaleSweep to achieve higher fidelity and efficiency for your 4-bit quantized LLMs, as it preserves over 93% of full-precision performance even under aggressive end-to-end quantization.

Key insights

ScaleSweep optimizes NVFP4 quantization by sweeping block scale candidates within derived theoretical bounds to minimize reconstruction error.

Principles

Method

ScaleSweep sweeps feasible block scale candidates, selecting the one that minimizes a target objective (MSE/WMSE), guided by theoretically derived lower and upper bounds for the sweep range.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.