GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets

2026-05-18 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

GAMMA is a novel quantizer-agnostic framework designed to automate mixed-precision quantization for large language models (LLMs) by allocating varying bit-widths to different modules. It addresses limitations of existing methods, such as the infeasibility of quantization-aware training for billion-parameter models and the computational expense of search-based approaches. GAMMA operates within a post-training pipeline, optimizing a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint. It then projects learned precision preferences into exact budget-feasible discrete assignments using integer programming. A key feature is score reuse, where a single training run generates stable sensitivity rankings, allowing adaptation to arbitrary deployment targets in minutes by re-solving only the integer program. This method significantly reduces memory footprints, outperforming fixed-precision baselines by up to +12.99 Avg. and search-based mixed-precision methods by up to +7.00 Avg. across Llama and Qwen models (8B-32B).

Key takeaway

For AI Engineers optimizing LLM deployment, GAMMA offers a robust solution to achieve significant memory footprint reductions without extensive retraining. You can match 3-bit quality at 2.5-bit average precision, enabling more efficient model deployment. This framework allows rapid adaptation to various memory budgets, reducing per-budget adjustment time from hours to minutes, which is critical for agile development cycles.

Key insights

GAMMA automates mixed-precision quantization for LLMs, optimizing bit allocation post-training for arbitrary memory budgets.

Principles

Quantization preferences encode stable sensitivity rankings.
Integer programming ensures exact budget compliance.

Method

GAMMA learns module-wise precision preferences via a hidden-state reconstruction objective with augmented Lagrangian, then projects these into discrete assignments using integer programming for exact budget compliance.

In practice

Achieve 2.5-bit average precision with 3-bit quality.
Adapt a single training run to multiple deployment budgets.

Topics

Mixed-Precision Quantization
Large Language Models
Bit Allocation
Post-Training Quantization
Integer Programming

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.