How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models
Summary
A study investigates how quantization, a standard large language model deployment technique, impacts interpretable features identified by sparse autoencoders (SAEs). Using a frozen SAE as a fixed measurement basis, researchers encoded full-precision and round-to-nearest (RTN) quantized activations on identical tokens for Pythia-70M and Gemma-2-2B models, sweeping bit-widths from INT8 to INT4. The analysis revealed that feature survival is graded, with 62.4 percent of active features surviving at INT6 on Pythia-70M and 51.3 percent on Gemma-2-2B; most non-survivors were blurred rather than destroyed. Feature survival proved predictable from full-precision statistics, achieving cross-validated AUCs of 0.92 to 0.97. Critically, task metrics like perplexity can mask this damage; for instance, INT7 improved perplexity on Gemma-2-2B while degrading 18.7 percent of features. The study also found significant overlap (Jaccard 0.79-0.86) between features damaged by quantization and magnitude pruning, suggesting a shared vulnerability mode.
Key takeaway
For Machine Learning Engineers deploying quantized large language models where interpretability is critical for safety or steering, relying solely on perplexity or accuracy metrics is insufficient. Your interpretability findings from full-precision models may not transfer, as quantization can degrade a significant percentage of features without impacting overall task performance. You should implement feature-level audits to verify the fidelity of interpretable features in quantized deployments.
Key insights
Quantization systematically degrades LLM interpretable features, a loss often masked by stable task performance.
Principles
- Feature degradation is systematic, not abrupt.
- Full-precision statistics predict feature survival.
- Task metrics are insufficient for interpretability audits.
Method
Compare full-precision and quantized activations using a frozen sparse autoencoder and Pearson correlation for feature survival.
In practice
- Audit quantized models at the feature level.
- Use peak activation to predict feature vulnerability.
Topics
- Large Language Models
- Model Quantization
- Sparse Autoencoders
- Interpretable AI
- Feature Analysis
- Model Compression
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.