Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

SAGE-PTQ is a novel ultra-low-bit post-training quantization (PTQ) framework for large language models (LLMs) designed to minimize hidden scaling costs. It achieves an average of 1.03 weight bits and only 0.004 scaling bits per matrix. The method separates salient and unsalient weights via distributional statistics. It models subsampled unsalient weights as a sparse graph, estimating optimal group counts. It employs dual-mode quantization, assigning multi-bit precision to salient weights and binarizing unsalient weights. SAGE-PTQ demonstrates superior performance. It achieves 6.74 WikiText2 perplexity on LLaMA-3-8B, outperforming BiLLM's 55.8. GPU memory is under 50% of BiLLM. It also provides 1.5x faster decoding on LLaMA-2-70B on a single NVIDIA L40 GPU.

Key takeaway

For Machine Learning Engineers deploying large language models and facing severe memory or latency constraints, SAGE-PTQ offers a compelling solution. You should consider this framework for ultra-low-bit quantization. It reduces GPU memory by over 50%. Decoding speed boosts by 1.5x on LLaMA-2-70B. This maintains high accuracy.

Key insights

SAGE-PTQ minimizes hidden scaling costs in ultra-low-bit LLM quantization via saliency-aware, graph-guided weight partitioning.

Principles

Method

SAGE-PTQ partitions weights into salient/unsalient based on distribution, uses a sparse KNN graph to optimize unsalient group counts, then applies dual-mode quantization with adaptive saliency thresholding.

In practice

Topics

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.