CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs

2026-04-29 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

CoQuant, a novel post-training quantization (PTQ) method, addresses the limitations of existing mixed-precision techniques for Large Language Models (LLMs) by jointly considering both activation and weight quantization noise. Developed by Zhe Ding, Su Pan, and Duowei Pan, CoQuant models the expected output error theoretically, leading to a closed-form weighted PCA solution that optimally balances activation and weight covariances to select high-precision subspaces. This approach contrasts with prior methods that rely solely on activation statistics. Extensive experiments on Llama-3.2 and Qwen2.5 models demonstrate CoQuant's superior performance over strong PTQ baselines, showing consistent improvements in WikiText perplexity and zero-shot common-sense reasoning accuracy. The source code for CoQuant is available on GitHub.

Key takeaway

For NLP engineers and research scientists optimizing LLM inference costs, CoQuant offers a principled method to achieve ultra-low bit quantization without significant accuracy loss. By jointly considering weight and activation noise, this technique provides a more robust approach than activation-only methods. You should explore integrating CoQuant into your quantization workflows, especially for Llama-3.2 and Qwen2.5 models, to enhance perplexity and reasoning accuracy while reducing computational overhead.

Key insights

CoQuant improves LLM quantization by jointly optimizing weight and activation subspaces for reduced output error.

Principles

Output error is driven by both activation and weight quantization noise.
Balancing activation and weight covariances is key for optimal subspace selection.

Method

CoQuant formulates a closed-form weighted PCA solution by theoretically modeling expected output error, balancing activation and weight covariances to select the optimal high-precision subspace.

In practice

Apply CoQuant to Llama-3.2 and Qwen2.5 for improved low-bit quantization.
Utilize joint weight-activation modeling for better PTQ accuracy.

Topics

Post-Training Quantization
Large Language Models
Mixed-Precision Quantization
Weight-Activation Subspace
Weighted PCA

Code references

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.