Variance Is Not Importance: Structural Analysis of Transformer Compressibility Across Model Scales

2026-04-22 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A systematic empirical study investigated transformer compression across GPT-2 (124M parameters) and Mistral 7B (7.24B parameters) through over 40 experiments. The analysis covered spectral compression, block-level function replacement, rotation-based quantization, activation geometry, and adaptive early exit. Key findings include that high-variance activation directions are largely uncorrelated with predictive directions, and transformer blocks exhibit conditional linearity, with R^2 values around 0.95 for GPT-2 and 0.93 for Mistral block 31 under correct upstream distributions. The study also found that direct quantization is superior to factoring weights due to error amplification, and linearity increases with model depth, from R^2 = 0.17 in Mistral's block 0 to 0.93 in block 31. Additionally, approximately 30 percent of tokens were identified as computationally easy. Single-block linear replacement achieved 34x compression on Mistral 7B's final block with a 1.71 perplexity increase, while multi-block replacement failed.

Key takeaway

For AI Engineers optimizing large language models for deployment, understanding these structural properties is critical. Your efforts in static post-training compression may face fundamental limits due to residual error accumulation and distribution shift across blocks. Instead, consider exploring adaptive, per-token computation strategies, especially for the approximately 30 percent of tokens identified as computationally easy, to achieve more effective and stable compression without significant perplexity degradation.

Key insights

Variance in transformer activations does not equate to importance for predictive performance.

Principles

High-variance directions are 96% uncorrelated with predictive directions.
Transformer block linearity is conditional on upstream distribution.
Linearity increases with model depth.

Method

The study used spectral compression, block-level function replacement, rotation-based quantization, activation geometry, and adaptive early exit to analyze transformer compressibility.

In practice

Direct quantization is strictly superior to factored weight approaches.
Single-block linear replacement can achieve 34x compression.
Adaptive, per-token computation is a more effective compression direction.

Topics

Transformer Compression
Model Quantization
Block Linearity
Adaptive Early Exit
GPT-2

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.