A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation, Internet of Things (IoT) & Connected Devices · Depth: Expert, extended

Summary

This work introduces a lightweight, backpropagation-free sensitivity analysis framework for mixed-precision quantization of hybrid Structured State Space Model (SSM)–Transformer architectures. The method identifies components most susceptible to quantization-induced degradation using only forward-pass metrics, avoiding expensive gradient computations and retraining. A formal analysis demonstrates that Kullback–Leibler (KL) divergence better captures quantization sensitivity for language modeling tasks than mean squared error (MSE) or signal-to-quantization-noise ratio (SQNR). Experiments on Hymba hybrid models confirm KL-based rankings align with observed performance drops. On-device profiling on Intel Lunar Lake hardware shows KL-guided mixed-precision achieves near-FP16 perplexity with model sizes and throughput competitive with Uniform INT4. Specifically, Mamba-1.4B was reduced from 5.2 GB to 723 MB (7.2x compression) with minimal perplexity loss, and Mamba2-130M GPU latency was cut by up to 17.6x over the FP16 baseline.

Key takeaway

For AI Engineers deploying hybrid SSM-Transformer LLMs on edge devices, this research indicates that adopting a KL-guided mixed-precision quantization strategy is crucial. You can achieve significant model compression (up to 7.2x) and latency reductions (up to 17.6x) while maintaining near-FP16 perplexity. Focus on using KL divergence for layer sensitivity ranking and selectively preserving higher precision in critical layers like "mamba.x_proj" to optimize efficiency without sacrificing accuracy.

Key insights

KL divergence is a superior metric for identifying quantization sensitivity in hybrid SSM-Transformer LLMs.

Principles

Method

A backpropagation-free, surrogate-based sensitivity analysis framework uses forward-pass metrics to identify sensitive layers. KL divergence guides mixed-precision assignment, retaining higher precision for sensitive components and aggressively quantizing others.

In practice

Topics

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.