CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs
Summary
CoQuant, a novel post-training quantization (PTQ) method, addresses the limitations of existing mixed-precision techniques for Large Language Models (LLMs) by jointly considering both activation and weight quantization noise. Developed by Zhe Ding, Su Pan, and Duowei Pan, CoQuant models the expected output error theoretically, leading to a closed-form weighted PCA solution that optimally balances activation and weight covariances to select high-precision subspaces. This approach contrasts with prior methods that rely solely on activation statistics. Extensive experiments on Llama-3.2 and Qwen2.5 models demonstrate CoQuant's superior performance over strong PTQ baselines, showing consistent improvements in WikiText perplexity and zero-shot common-sense reasoning accuracy. The source code for CoQuant is available on GitHub.
Key takeaway
For NLP engineers and research scientists optimizing LLM inference costs, CoQuant offers a principled method to achieve ultra-low bit quantization without significant accuracy loss. By jointly considering weight and activation noise, this technique provides a more robust approach than activation-only methods. You should explore integrating CoQuant into your quantization workflows, especially for Llama-3.2 and Qwen2.5 models, to enhance perplexity and reasoning accuracy while reducing computational overhead.
Key insights
CoQuant improves LLM quantization by jointly optimizing weight and activation subspaces for reduced output error.
Principles
- Output error is driven by both activation and weight quantization noise.
- Balancing activation and weight covariances is key for optimal subspace selection.
Method
CoQuant formulates a closed-form weighted PCA solution by theoretically modeling expected output error, balancing activation and weight covariances to select the optimal high-precision subspace.
In practice
- Apply CoQuant to Llama-3.2 and Qwen2.5 for improved low-bit quantization.
- Utilize joint weight-activation modeling for better PTQ accuracy.
Topics
- Post-Training Quantization
- Large Language Models
- Mixed-Precision Quantization
- Weight-Activation Subspace
- Weighted PCA
Code references
Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.