Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales
Summary
A systematic empirical study investigated transformer compression across GPT-2 (124M parameters) and Mistral 7B (7.24B parameters) through over 40 experiments. The analysis covered spectral compression, block-level function replacement, rotation-based quantization, activation geometry, and adaptive early exit. Key findings include that high-variance activation directions are largely uncorrelated with predictive directions, and transformer blocks exhibit conditional linearity, with R^2 values around 0.95 for GPT-2 and 0.93 for Mistral block 31 under correct upstream distributions. The study also found that direct quantization is superior to factoring weights due to error amplification, and linearity increases with model depth, from R^2 = 0.17 in Mistral's block 0 to 0.93 in block 31. Additionally, approximately 30 percent of tokens were identified as computationally easy. Single-block linear replacement achieved 34x compression on Mistral 7B's final block with a 1.71 perplexity increase, while multi-block replacement failed.
Key takeaway
For AI Engineers optimizing large language models for deployment, understanding these structural properties is critical. Your efforts in static post-training compression may face fundamental limits due to residual error accumulation and distribution shift across blocks. Instead, consider exploring adaptive, per-token computation strategies, especially for the approximately 30 percent of tokens identified as computationally easy, to achieve more effective and stable compression without significant perplexity degradation.
Key insights
Variance in transformer activations does not equate to importance for predictive performance.
Principles
- High-variance directions are 96% uncorrelated with predictive directions.
- Transformer block linearity is conditional on upstream distribution.
- Linearity increases with model depth.
Method
The study used spectral compression, block-level function replacement, rotation-based quantization, activation geometry, and adaptive early exit to analyze transformer compressibility.
In practice
- Direct quantization is strictly superior to factored weight approaches.
- Single-block linear replacement can achieve 34x compression.
- Adaptive, per-token computation is a more effective compression direction.
Topics
- Transformer Compression
- Model Quantization
- Block Linearity
- Adaptive Early Exit
- GPT-2
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.