AAAC: Activation-Aware Adaptive Codebooks for 4-bit LLM Weight Quantization
Summary
AAAC (Activation-Aware Adaptive Codebooks) is a lightweight post-training quantization method designed to reduce memory and compute costs for large language model (LLM) inference by quantizing weights to 4 bits. Unlike existing methods like AWQ and GPTQ that use fixed 4-bit grids, AAAC replaces this with two small learned scalar codebooks (64 bytes) per layer. It completes quantization in 3–30 minutes on a single GPU, adding no memory beyond the model itself. AAAC achieves this by selecting the codebook that minimizes activation-weighted reconstruction error for each weight group, encoding this choice in the unused sign bit of the group's positive scale, incurring zero storage overhead. Evaluated across Llama and Qwen models in NVFP4 and INT4 settings, AAAC consistently outperforms gradient-free baselines and matches or exceeds gradient-assisted methods, which typically require hours of quantization time and higher memory. When combined with AWQ, AAAC recovers over 70% of the quantization gap.
Key takeaway
For Machine Learning Engineers optimizing LLM inference, AAAC offers a compelling 4-bit quantization solution. You can achieve state-of-the-art accuracy, matching gradient-assisted methods, in minutes on a single GPU, without significant memory overhead. Consider integrating AAAC, especially when combined with AWQ, to efficiently deploy high-quality quantized models. This approach provides a practical direction for closing the quality gap of 4-bit quantized language models.
Key insights
Adaptive codebooks, learned via activation-weighted k-means, significantly improve 4-bit LLM quantization without high computational cost.
Principles
- Fixed-grid quantization leaves quality on the table.
- Activation importance guides reconstruction error minimization.
- Codebook selection can be stored with zero overhead.
Method
AAAC learns two scalar codebooks per layer via activation-weighted k-means, assigning each weight group to the codebook minimizing reconstruction error, and stores the selection in the scale's unused sign bit.
In practice
- Combine AAAC with AWQ for 70.5% quantization gap recovery.
- Use AAAC for 4-bit LLM quantization in 3-30 minutes on a single GPU.
- Store codebook selection bits in the scale's unused sign bit.
Topics
- LLM Quantization
- 4-bit Quantization
- Adaptive Codebooks
- Post-Training Quantization
- Activation-Aware Quantization
- Model Compression
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.