KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

2026-02-13 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, extended

Summary

KBVQ-MoE is a novel vector quantization (VQ) framework designed to enhance ultra-low-bit compression for Mixture-of-Experts (MoE) Large Language Models (LLMs), addressing challenges of enormous parameter sizes and memory demands in resource-constrained environments. The framework tackles two critical issues: redundant representation among experts, which inefficiently utilizes limited VQ codebook capacity, and amplified cumulative output bias from expert aggregation. KBVQ-MoE integrates two techniques: Input-driven Redundancy Elimination (IDRE), which uses Karhunen–Loève Transform (KLT)-guided Singular Value Decomposition (SVD) to extract and share dominant weight components across experts, retaining them at full precision. It also employs Bias-Corrected Output Stabilization (BCOS), applying VQ to expert-specific representations and correcting quantized outputs with channel-wise affine compensation. Experiments on MoE LLMs like Qwen1.5-MoE-A2.7B, Qwen3-30B-A3B, Mixtral-8x7B, and DeepseekV2-Lite demonstrate that KBVQ-MoE substantially preserves accuracy, achieving an average accuracy of 67.99 for 3-bit quantization of Qwen1.5-MoE-A2.7B, nearly matching the FP16 baseline of 68.07, with over 1.5x inference speedup.

Key takeaway

For AI Engineers and Research Scientists deploying MoE LLMs on edge devices, KBVQ-MoE offers a robust solution for extreme compression. Its ability to maintain near-FP16 accuracy at 2-3 bit quantization while significantly reducing memory and improving inference speed means you can deploy larger, more capable models in resource-constrained environments. Consider integrating KBVQ-MoE to overcome the memory and computational bottlenecks of MoE architectures without sacrificing performance.

Key insights

KBVQ-MoE significantly improves ultra-low-bit quantization for MoE LLMs by eliminating expert redundancy and correcting output bias.

Principles

Redundancy elimination improves codebook utilization.
Bias correction stabilizes quantized output distributions.
KLT-guided SVD extracts input-coherent shared structures.

Method

KBVQ-MoE applies KLT-guided SVD for input-driven redundancy elimination, followed by vector quantization on expert-specific weights and channel-wise affine compensation for bias-corrected output stabilization.

In practice

Achieves near-FP16 accuracy at 2-3 bit quantization.
Reduces memory footprint by up to 87% for MoE LLMs.
Provides over 1.5x inference speedup on 2-bit models.

Topics

Mixture-of-Experts
Vector Quantization
Large Language Models
Model Compression
Singular Value Decomposition

Code references

EleutherAI/lm-evaluation-harness

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.