ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization
Summary
ScaleSweep is a novel and efficient scale optimization method designed to enhance NVFP4 post-training quantization for Large Language Models (LLMs). NVFP4, a hardware-supported 4-bit format, uses fine-grained block scales, but prior initialization methods like AbsMax are suboptimal. ScaleSweep addresses this by sweeping over feasible block scale candidates, selecting the one that minimizes a target objective. The method includes a theoretical analysis deriving lower and upper bounds for the sweep range under Mean Square Error (MSE) and Weighted Mean Square Error (WMSE), significantly reducing the search space with negligible overhead. Experiments on Llama and Qwen models show ScaleSweep consistently improves quantization performance, preserving over 93% of full-precision performance even with aggressive end-to-end quantization of weights, activations, KV cache, and query states.
Key takeaway
For Machine Learning Engineers deploying LLMs with 4-bit quantization, particularly using NVFP4, ScaleSweep offers a critical advancement. This method significantly improves quantization accuracy by optimizing block scales, narrowing the performance gap to full precision. You should consider integrating ScaleSweep to achieve higher fidelity and efficiency for your 4-bit quantized LLMs, as it preserves over 93% of full-precision performance even under aggressive end-to-end quantization.
Key insights
ScaleSweep optimizes NVFP4 quantization by sweeping block scale candidates within derived theoretical bounds to minimize reconstruction error.
Principles
- AbsMax initialization is suboptimal for NVFP4 quantization.
- Optimizing block scales significantly improves 4-bit quantization fidelity.
- Theoretical bounds can efficiently constrain the search space for optimal scales.
Method
ScaleSweep sweeps feasible block scale candidates, selecting the one that minimizes a target objective (MSE/WMSE), guided by theoretically derived lower and upper bounds for the sweep range.
In practice
- Apply ScaleSweep to Llama and Qwen models for improved 4-bit quantization.
- Utilize NVFP4 with ScaleSweep for aggressive end-to-end LLM quantization.
Topics
- NVFP4 Quantization
- Post-Training Quantization
- LLM Compression
- ScaleSweep
- Block Scale Initialization
- Llama Models
- Qwen Models
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.