Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A new forward-pass-only diagnostic identifies "dead directions" in LayerNorm (LN) transformers, which are parameter space directions where the Fisher information metric degenerates. This method, developed by Tejas Pradeep Shirodkar and P. J. Narayanan, reveals that the inverse-scale direction γ⁻¹/||γ⁻¹|| of the LayerNorm affine is an exact algebraic kernel of the post-final-norm centred activation covariance. Crucially, this direction can be read directly from the LN scale parameter without any forward or backward passes or eigensolves, making it the most efficient dead-direction detection. The diagnostic was validated on 14 pretrained transformers (9 LN, 5 RMSNorm, 160M-35B parameters) across language and vision tasks. At random initialization, it matched the measured bottom singular direction to four decimal places on all 9 LN models and was correctly absent in RMSNorm models. Training deepens the covariance eigenvalue along this direction by approximately 10³x, opening further dead directions. This work also shows the residual stream's smallest singular value is preserved block-to-block on 13/14 transformers, with Gemma 4-31B being a notable exception.

Key takeaway

For Machine Learning Engineers diagnosing model pathologies or optimizing transformer architectures, this research provides a critical, low-cost diagnostic. You can identify architecturally guaranteed dead directions in LayerNorm models by simply reading the γ⁻¹/||γ⁻¹|| parameter, enabling a protocol sanity check. Furthermore, exclude these directions from importance-based LoRA adapter placement to avoid wasting rank, and use the LN/RMSNorm dichotomy to screen new normalization schemes for inherent singular structure.

Key insights

The LayerNorm inverse-scale parameter directly reveals a transformer's algebraic dead direction without complex computation.

Principles

Method

Read the inverse-scale direction γ⁻¹/||γ⁻¹|| from the LayerNorm affine parameter. Compare with the bottom singular direction of post-final-norm centred activation covariance.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.