Channel-Wise Mixed-Precision Quantization for Large Language Models

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Channel-Wise Mixed-Precision Quantization (CMPQ) is a novel method addressing the substantial memory requirements of Large Language Models (LLMs) for edge device deployment. Unlike existing integer-bit quantization approaches, CMPQ allocates precision channel-wise based on activation distributions, adapting to any bit-width constraint, including fractional bits. It employs a non-uniform quantization strategy and incorporates two outlier extraction techniques to preserve critical information and minimize quantization loss. Experiments on OPT-2.7B, OPT-6.7B, LLaMA2-7B, and LLaMA2-13B models demonstrate that CMPQ enhances performance in integer-bit quantization tasks and achieves significant gains with only a modest increase in memory usage. For instance, LLaMA2-7B showed a 30% perplexity improvement (from 15.97 to 11.11) with a 10% storage increase at 2.2-bit quantization. CMPQ also requires significantly less memory (1/4) for quantization compared to gradient-based methods like SqueezeLLM, making it more efficient for resource-constrained scenarios.

Key takeaway

For Machine Learning Engineers deploying LLMs on resource-constrained edge devices, CMPQ offers a practical solution to reduce memory footprint without significant performance degradation. You should consider CMPQ for its adaptability to fractional bit-widths and its efficiency, requiring only 1/4 of the memory compared to gradient-based methods. This allows for substantial performance gains with minimal storage overhead, especially in scenarios where integer-only quantization is insufficient or memory is severely limited.

Key insights

CMPQ adaptively quantizes LLMs channel-wise using mixed-precision and outlier protection for efficient edge deployment.

Principles

Channel-wise mixed-precision quantization outperforms layer-wise for LLMs.
Non-uniform quantization is crucial for LLMs' non-uniform weight distributions.
Protecting activation-based and quantization-aware outliers minimizes information loss.

Method

CMPQ allocates channel-wise precision based on activation L2-norm, applies non-uniform K-means clustering, and uses two outlier extraction methods (activation-based and quantization-aware) to preserve critical weights in FP16.

In practice

Quantize salient channels to higher precision based on activation L2-norm.
Retain approximately 0.5% of outliers in FP16 to reduce quantization error.
Prioritize non-uniform quantization for better low-bit LLM performance.

Topics

Large Language Models
Quantization
Mixed-Precision Quantization
Post-Training Quantization
Edge AI
Model Compression

Code references

meta-llama/llama3

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.