Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale
Summary
A new diagnostic method has been developed for LayerNorm transformers, identifying "algebraic dead directions" in parameter space related to the inverse-scale direction of the LayerNorm affine. This direction is an exact algebraic kernel of the post-final-norm centred activation covariance. Crucially, it can be read from the LayerNorm scale parameter alone, without requiring forward or backward passes or eigensolves, making it the most efficient dead-direction diagnostic. The method was tested on 14 pretrained transformers (9 LayerNorm, 5 RMSNorm; 160M-35B parameters; language and vision objectives). At random initialization, the predicted direction matched the measured bottom singular direction to four decimal places on all 9 LayerNorm models and was correctly absent in RMSNorm models. On trained checkpoints, the covariance eigenvalue along this direction deepened by ~10^3x. This diagnostic also classifies a transformer's normalization type from parameters alone and reveals that the residual stream's smallest singular value is preserved block-to-block in 13 out of 14 transformers, with Gemma-31B being a pinpointed exception.
Key takeaway
For Machine Learning Engineers optimizing large language models, this diagnostic offers a crucial, low-cost method to identify singular minima and dead directions in LayerNorm transformers. You can now classify normalization types and pinpoint specific architectural weaknesses, like those in Gemma-31B, directly from model parameters. This enables more targeted architectural improvements and fine-tuning strategies without expensive forward or backward passes, streamlining model analysis and development.
Key insights
A novel diagnostic identifies algebraic dead directions in LayerNorm transformers from parameters alone, without forward/backward passes.
Principles
- Pretrained transformers operate near singular minima.
- LayerNorm's inverse-scale direction indicates dead directions.
- Mean-subtraction projectors create these dead directions.
Method
Identify dead directions by reading the inverse-scale direction γ⁻¹/||γ⁻¹|| directly from the LayerNorm scale parameter, bypassing forward/backward passes or eigensolves.
In practice
- Classify transformer normalization from parameters.
- Diagnose singular structure in trained LLMs.
- Pinpoint specific dead directions like in Gemma-31B.
Topics
- Layer Normalization
- Transformer Architectures
- Dead Directions
- Singular Minima
- LLM Diagnostics
- Parameter Analysis
Best for: AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.