OffQ: Taming Structured Outliers in LLM Quantization by Offsetting
Summary
OffQ is a novel post-training quantization method designed to mitigate activation outliers in low-bit large language model (LLM) inference, specifically for W4A4KV4 quantization. Developed by researchers from EPFL, Huawei, and ETHZ, OffQ employs a unique offsetting mechanism. It first identifies a low-dimensional outlier subspace in activations using a tailored top-1 PCA, then rotates activations to concentrate high-magnitude values into a single channel. This concentrated outlier channel's magnitude is subsequently converted into a shared offset, significantly reducing the activations' standard deviation. This strategy facilitates effective W4A4KV4 quantization with deployment-friendly uniform-grid and uniform-precision. Extensive experiments across Llama 2, Llama 3, Llama 3.2, and Qwen 2.5 LLM architectures, ranging from 1B to 72B parameters, demonstrate OffQ's superior performance in perplexity and 0-shot accuracy compared to state-of-the-art baselines, such as GPTQ (PPL 166.3 on Llama 3-8B vs. OffQ's 6.98).
Key takeaway
For MLOps Engineers deploying LLMs on resource-constrained edge devices or cost-sensitive cloud platforms, OffQ provides a robust solution for W4A4KV4 quantization. You can achieve superior perplexity and accuracy without mixed-precision overheads. Consider integrating OffQ's top-1 PCA and offsetting technique to significantly reduce memory footprint and computational costs for your 4-bit LLM deployments.
Key insights
OffQ effectively tames LLM activation outliers for W4A4KV4 quantization by converting them into absorbable offsets.
Principles
- Activation outliers in LLMs exhibit a low-dimensional structure.
- Concentrating outliers into fewer channels enables targeted suppression.
- Offsetting outliers into zero-points reduces activation standard deviation.
Method
OffQ uses top-1 PCA to identify and concentrate outliers into a single channel. It then applies Hadamard rotation to convert this outlier energy into group-wise offsets, absorbed by the zero-point in asymmetric quantization.
In practice
- Apply top-1 PCA to identify extreme outlier tokens.
- Use group-wise asymmetric quantization with a group size of 128.
- Fuse rotation matrices into weight matrices to reduce overhead.
Topics
- LLM Quantization
- Post-Training Quantization
- Activation Outlier Mitigation
- W4A4KV4
- Top-1 PCA
- Hadamard Rotation
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.