OffQ: Taming Structured Outliers in LLM Quantization by Offsetting

2024-09-25 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

OffQ is a novel post-training quantization method designed to mitigate activation outliers in low-bit large language model (LLM) inference, specifically for W4A4KV4 quantization. Developed by researchers from EPFL, Huawei, and ETHZ, OffQ employs a unique offsetting mechanism. It first identifies a low-dimensional outlier subspace in activations using a tailored top-1 PCA, then rotates activations to concentrate high-magnitude values into a single channel. This concentrated outlier channel's magnitude is subsequently converted into a shared offset, significantly reducing the activations' standard deviation. This strategy facilitates effective W4A4KV4 quantization with deployment-friendly uniform-grid and uniform-precision. Extensive experiments across Llama 2, Llama 3, Llama 3.2, and Qwen 2.5 LLM architectures, ranging from 1B to 72B parameters, demonstrate OffQ's superior performance in perplexity and 0-shot accuracy compared to state-of-the-art baselines, such as GPTQ (PPL 166.3 on Llama 3-8B vs. OffQ's 6.98).

Key takeaway

For MLOps Engineers deploying LLMs on resource-constrained edge devices or cost-sensitive cloud platforms, OffQ provides a robust solution for W4A4KV4 quantization. You can achieve superior perplexity and accuracy without mixed-precision overheads. Consider integrating OffQ's top-1 PCA and offsetting technique to significantly reduce memory footprint and computational costs for your 4-bit LLM deployments.

Key insights

OffQ effectively tames LLM activation outliers for W4A4KV4 quantization by converting them into absorbable offsets.

Principles

Activation outliers in LLMs exhibit a low-dimensional structure.
Concentrating outliers into fewer channels enables targeted suppression.
Offsetting outliers into zero-points reduces activation standard deviation.

Method

OffQ uses top-1 PCA to identify and concentrate outliers into a single channel. It then applies Hadamard rotation to convert this outlier energy into group-wise offsets, absorbed by the zero-point in asymmetric quantization.

In practice

Apply top-1 PCA to identify extreme outlier tokens.
Use group-wise asymmetric quantization with a group size of 128.
Fuse rotation matrices into weight matrices to reduce overhead.

Topics

LLM Quantization
Post-Training Quantization
Activation Outlier Mitigation
W4A4KV4
Top-1 PCA
Hadamard Rotation

Code references

NVIDIA/cutlass

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.