Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new diagnostic method has been developed for LayerNorm transformers, identifying "algebraic dead directions" in parameter space related to the inverse-scale direction of the LayerNorm affine. This direction is an exact algebraic kernel of the post-final-norm centred activation covariance. Crucially, it can be read from the LayerNorm scale parameter alone, without requiring forward or backward passes or eigensolves, making it the most efficient dead-direction diagnostic. The method was tested on 14 pretrained transformers (9 LayerNorm, 5 RMSNorm; 160M-35B parameters; language and vision objectives). At random initialization, the predicted direction matched the measured bottom singular direction to four decimal places on all 9 LayerNorm models and was correctly absent in RMSNorm models. On trained checkpoints, the covariance eigenvalue along this direction deepened by ~10^3x. This diagnostic also classifies a transformer's normalization type from parameters alone and reveals that the residual stream's smallest singular value is preserved block-to-block in 13 out of 14 transformers, with Gemma-31B being a pinpointed exception.

Key takeaway

For Machine Learning Engineers optimizing large language models, this diagnostic offers a crucial, low-cost method to identify singular minima and dead directions in LayerNorm transformers. You can now classify normalization types and pinpoint specific architectural weaknesses, like those in Gemma-31B, directly from model parameters. This enables more targeted architectural improvements and fine-tuning strategies without expensive forward or backward passes, streamlining model analysis and development.

Key insights

A novel diagnostic identifies algebraic dead directions in LayerNorm transformers from parameters alone, without forward/backward passes.

Principles

Method

Identify dead directions by reading the inverse-scale direction γ⁻¹/||γ⁻¹|| directly from the LayerNorm scale parameter, bypassing forward/backward passes or eigensolves.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.