Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

2024-07-25 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

SAGE-PTQ is a novel ultra-low-bit post-training quantization (PTQ) framework for large language models (LLMs) designed to minimize hidden scaling costs. It achieves an average of 1.03 weight bits and only 0.004 scaling bits per matrix. The method separates salient and unsalient weights via distributional statistics. It models subsampled unsalient weights as a sparse graph, estimating optimal group counts. It employs dual-mode quantization, assigning multi-bit precision to salient weights and binarizing unsalient weights. SAGE-PTQ demonstrates superior performance. It achieves 6.74 WikiText2 perplexity on LLaMA-3-8B, outperforming BiLLM's 55.8. GPU memory is under 50% of BiLLM. It also provides 1.5x faster decoding on LLaMA-2-70B on a single NVIDIA L40 GPU.

Key takeaway

For Machine Learning Engineers deploying large language models and facing severe memory or latency constraints, SAGE-PTQ offers a compelling solution. You should consider this framework for ultra-low-bit quantization. It reduces GPU memory by over 50%. Decoding speed boosts by 1.5x on LLaMA-2-70B. This maintains high accuracy.

Key insights

SAGE-PTQ minimizes hidden scaling costs in ultra-low-bit LLM quantization via saliency-aware, graph-guided weight partitioning.

Principles

Salient weights require higher precision for accuracy.
Magnitude saliency is more reliable than Hessian for identification.
Efficient group index restoration is critical for deployment.

Method

SAGE-PTQ partitions weights into salient/unsalient based on distribution, uses a sparse KNN graph to optimize unsalient group counts, then applies dual-mode quantization with adaptive saliency thresholding.

In practice

Assign multi-bit precision to salient weights.
Binarize unsalient weights for maximum compression.
Use randomized subsampling for efficient graph construction.

Topics

Post-training Quantization
Large Language Models
Ultra-low-bit Quantization
Graph Algorithms
Saliency Allocation
Model Compression
GPU Memory Optimization

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.