MosaicQuant: Inlier-Outlier Disaggregation for Unified 4-Bit LLM Quantization
Summary
MosaicQuant introduces a unified 4-bit LLM quantization paradigm designed to overcome the accuracy degradation typically seen when 4-bit representations struggle with both common inlier values and rare large-magnitude outliers. Unlike existing mixed-precision methods that retain outliers in higher precision, MosaicQuant employs an "inlier-outlier disaggregation" principle. It quantizes the entire weight matrix into a dense 4-bit base component, primarily capturing inliers, and then introduces a sparse 4-bit residual component to correct quantization errors, focusing on critical weight blocks. To ensure a truly unified low-bit inference pipeline, MosaicQuant integrates ZipperEngine, which fuses sparse block computation directly into the dense 4-bit GEMM kernel using an overlapped pipeline. Extensive experiments on LLaMA3 and Qwen3 models demonstrate that MosaicQuant maintains near-FP16 accuracy while delivering up to a 1.24x speedup compared to the W16A16 baseline.
Key takeaway
For Machine Learning Engineers deploying large language models with strict memory or latency requirements, you should consider MosaicQuant. This approach enables unified 4-bit quantization, preserving near-FP16 accuracy while delivering significant inference speedups, up to 1.24x over W16A16. It addresses the challenge of balancing precision for inliers and outliers without breaking the low-bit execution pipeline, offering a practical solution for efficient LLM deployment.
Key insights
Unified 4-bit LLM quantization can achieve near-FP16 accuracy and speedup by disaggregating inliers and outliers.
Principles
- Disaggregate inliers and outliers for 4-bit weight quantization.
- Use a sparse 4-bit residual to compensate for quantization errors.
- Fuse sparse and dense computations for unified low-bit execution.
Method
Quantize the full weight matrix to a dense 4-bit base, then add a sparse 4-bit residual for error compensation, fusing sparse blocks into the dense 4-bit GEMM kernel via an overlapped pipeline.
In practice
- Quantize LLaMA3 and Qwen3 models to 4-bit with high accuracy.
- Improve LLM inference speed by up to 1.24x over W16A16.
Topics
- LLM Quantization
- 4-bit Quantization
- MosaicQuant
- ZipperEngine
- Inference Acceleration
- LLaMA3
- Qwen3
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.