Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new study reveals that codebook initialization is the primary bottleneck preventing effective extreme quantization of Large Language Models (LLMs) to 2-bit precision, a technique crucial for edge deployment. While additive quantization offers O(1) lookup-table dequantization, it often fails catastrophically at 2 bits per parameter (bpp) despite extensive search and finetuning. The research introduces OA-EM, an output-aware Expectation-Maximization (EM) initialisation method that utilizes Hessian-weighted Mahalanobis distance. This method consistently yields superior solutions after PV-tuning across various compression rates, search budgets, and architectures, including Llama 3.2 3B, Llama 3.1 8B, and Qwen 2.5 3B. The bottleneck's severity increases with the representational ratio \r{ho} = N/KM, becoming extreme at 2 bpp where poor initialization can degrade perplexity by orders of magnitude.

Key takeaway

For AI Engineers deploying LLMs to edge devices, understanding that codebook initialization is the critical factor for extreme quantization is paramount. If you are targeting 2-bit precision, implementing methods like OA-EM can prevent catastrophic performance degradation and significantly improve model quality, dominating the quality-compute frontier for highly compressed models.

Key insights

Codebook initialization is critical for extreme LLM quantization, especially at 2-bit precision.

Principles

Method

OA-EM uses Hessian-weighted Mahalanobis distance for output-aware EM initialisation, improving codebook optimization for extreme LLM quantization.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.