LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs
Summary
Logit-aware Final-block Quantization (LFQ) is introduced as a simple yet effective enhancement for low-bit weight-only post-training quantization (PTQ) in large language models. While existing block-wise PTQ methods enable memory-efficient deployment and match full-precision (FP) baselines on basic language modeling, their performance degrades significantly on generative tasks, particularly for longer responses and complex chains of thought. This degradation is attributed to the omission of the unembedding layer in block-wise optimization and the reliance on the mean squared error (MSE) objective, which misaligns token probability distributions. LFQ addresses this by quantizing the final Transformer block, minimizing the cross-entropy between the logits of the FP model and its quantized counterpart. This logit-level alignment consistently boosts accuracy for complex generation tasks across diverse model families, while preserving FP baseline performance for language modeling and understanding.
Key takeaway
For Machine Learning Engineers deploying low-bit quantized LLMs for generative tasks, integrate Logit-aware Final-block Quantization (LFQ). Your existing block-wise PTQ methods often degrade quality on complex generation. LFQ optimizes the final Transformer block's logits, boosting accuracy for longer responses and chains of thought. This maintains full-precision performance on basic language modeling.
Key insights
LFQ improves low-bit LLM generation by aligning token probabilities at the final Transformer block's logit level.
Principles
- Block-wise PTQ degrades generative quality.
- Logit-level alignment is crucial for generation.
- MSE objective can misalign token probabilities.
Method
LFQ quantizes the final Transformer block by minimizing the cross-entropy between the full-precision model's logits and the quantized model's logits, aligning token probability distributions.
In practice
- Apply LFQ to low-bit PTQ for generative LLMs.
- Focus on final Transformer block quantization.
- Use cross-entropy for logit alignment.
Topics
- Large Language Models
- Post-Training Quantization
- Low-Bit Quantization
- Generative AI
- Logit Alignment
- Transformer Architecture
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.