A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models
Summary
This work introduces a lightweight, backpropagation-free sensitivity analysis framework for mixed-precision quantization of hybrid Structured State Space Model (SSM)–Transformer architectures. The method identifies components most susceptible to quantization-induced degradation using only forward-pass metrics, avoiding expensive gradient computations and retraining. A formal analysis demonstrates that Kullback–Leibler (KL) divergence better captures quantization sensitivity for language modeling tasks than mean squared error (MSE) or signal-to-quantization-noise ratio (SQNR). Experiments on Hymba hybrid models confirm KL-based rankings align with observed performance drops. On-device profiling on Intel Lunar Lake hardware shows KL-guided mixed-precision achieves near-FP16 perplexity with model sizes and throughput competitive with Uniform INT4. Specifically, Mamba-1.4B was reduced from 5.2 GB to 723 MB (7.2x compression) with minimal perplexity loss, and Mamba2-130M GPU latency was cut by up to 17.6x over the FP16 baseline.
Key takeaway
For AI Engineers deploying hybrid SSM-Transformer LLMs on edge devices, this research indicates that adopting a KL-guided mixed-precision quantization strategy is crucial. You can achieve significant model compression (up to 7.2x) and latency reductions (up to 17.6x) while maintaining near-FP16 perplexity. Focus on using KL divergence for layer sensitivity ranking and selectively preserving higher precision in critical layers like "mamba.x_proj" to optimize efficiency without sacrificing accuracy.
Key insights
KL divergence is a superior metric for identifying quantization sensitivity in hybrid SSM-Transformer LLMs.
Principles
- Quantization sensitivity is highly non-uniform across model components.
- Forward-pass metrics can effectively guide mixed-precision quantization.
- SQNR is not monotonic with perplexity in language models.
Method
A backpropagation-free, surrogate-based sensitivity analysis framework uses forward-pass metrics to identify sensitive layers. KL divergence guides mixed-precision assignment, retaining higher precision for sensitive components and aggressively quantizing others.
In practice
- Use KL divergence for mixed-precision quantization of LLMs.
- Prioritize higher precision for "mamba.x_proj" components.
- Exclude SSM conv1d layers from quantization.
Topics
- Mixed-Precision Quantization
- SSM-Transformer Models
- Kullback-Leibler Divergence
- Quantization Sensitivity Analysis
- Edge Device Deployment
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.