MosaicQuant: Inlier-Outlier Disaggregation for Unified 4-Bit LLM Quantization

2026-06-14 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

MosaicQuant introduces a unified 4-bit LLM quantization paradigm designed to overcome the accuracy degradation typically seen when 4-bit representations struggle with both common inlier values and rare large-magnitude outliers. Unlike existing mixed-precision methods that retain outliers in higher precision, MosaicQuant employs an "inlier-outlier disaggregation" principle. It quantizes the entire weight matrix into a dense 4-bit base component, primarily capturing inliers, and then introduces a sparse 4-bit residual component to correct quantization errors, focusing on critical weight blocks. To ensure a truly unified low-bit inference pipeline, MosaicQuant integrates ZipperEngine, which fuses sparse block computation directly into the dense 4-bit GEMM kernel using an overlapped pipeline. Extensive experiments on LLaMA3 and Qwen3 models demonstrate that MosaicQuant maintains near-FP16 accuracy while delivering up to a 1.24x speedup compared to the W16A16 baseline.

Key takeaway

For Machine Learning Engineers deploying large language models with strict memory or latency requirements, you should consider MosaicQuant. This approach enables unified 4-bit quantization, preserving near-FP16 accuracy while delivering significant inference speedups, up to 1.24x over W16A16. It addresses the challenge of balancing precision for inliers and outliers without breaking the low-bit execution pipeline, offering a practical solution for efficient LLM deployment.

Key insights

Unified 4-bit LLM quantization can achieve near-FP16 accuracy and speedup by disaggregating inliers and outliers.

Principles

Disaggregate inliers and outliers for 4-bit weight quantization.
Use a sparse 4-bit residual to compensate for quantization errors.
Fuse sparse and dense computations for unified low-bit execution.

Method

Quantize the full weight matrix to a dense 4-bit base, then add a sparse 4-bit residual for error compensation, fusing sparse blocks into the dense 4-bit GEMM kernel via an overlapped pipeline.

In practice

Quantize LLaMA3 and Qwen3 models to 4-bit with high accuracy.
Improve LLM inference speed by up to 1.24x over W16A16.

Topics

LLM Quantization
4-bit Quantization
MosaicQuant
ZipperEngine
Inference Acceleration
LLaMA3
Qwen3

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.