AAAC: Activation-Aware Adaptive Codebooks for 4-bit LLM Weight Quantization

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

AAAC (Activation-Aware Adaptive Codebooks) is a lightweight post-training quantization method designed to reduce memory and compute costs for large language model (LLM) inference by quantizing weights to 4 bits. Unlike existing methods like AWQ and GPTQ that use fixed 4-bit grids, AAAC replaces this with two small learned scalar codebooks (64 bytes) per layer. It completes quantization in 3–30 minutes on a single GPU, adding no memory beyond the model itself. AAAC achieves this by selecting the codebook that minimizes activation-weighted reconstruction error for each weight group, encoding this choice in the unused sign bit of the group's positive scale, incurring zero storage overhead. Evaluated across Llama and Qwen models in NVFP4 and INT4 settings, AAAC consistently outperforms gradient-free baselines and matches or exceeds gradient-assisted methods, which typically require hours of quantization time and higher memory. When combined with AWQ, AAAC recovers over 70% of the quantization gap.

Key takeaway

For Machine Learning Engineers optimizing LLM inference, AAAC offers a compelling 4-bit quantization solution. You can achieve state-of-the-art accuracy, matching gradient-assisted methods, in minutes on a single GPU, without significant memory overhead. Consider integrating AAAC, especially when combined with AWQ, to efficiently deploy high-quality quantized models. This approach provides a practical direction for closing the quality gap of 4-bit quantized language models.

Key insights

Adaptive codebooks, learned via activation-weighted k-means, significantly improve 4-bit LLM quantization without high computational cost.

Principles

Fixed-grid quantization leaves quality on the table.
Activation importance guides reconstruction error minimization.
Codebook selection can be stored with zero overhead.

Method

AAAC learns two scalar codebooks per layer via activation-weighted k-means, assigning each weight group to the codebook minimizing reconstruction error, and stores the selection in the scale's unused sign bit.

In practice

Combine AAAC with AWQ for 70.5% quantization gap recovery.
Use AAAC for 4-bit LLM quantization in 3-30 minutes on a single GPU.
Store codebook selection bits in the scale's unused sign bit.

Topics

LLM Quantization
4-bit Quantization
Adaptive Codebooks
Post-Training Quantization
Activation-Aware Quantization
Model Compression

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.