Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization
Summary
A study on Transformer optimization reveals that different module types benefit from distinct weight-space geometries. Researchers investigated Manifold Muon during GPT-2 small pretraining, a 124M-parameter model, by applying Stiefel and DGram constraints to attention and MLP blocks. The optimal configuration, termed "Hetero," assigned Stiefel geometry to attention layers and DGram geometry to MLP layers, achieving the lowest validation loss of 3.3544. Conversely, configurations with DGram constraints on attention layers, including the inverted "Hetero-Inv" and "All-DGram" setups, became unstable under shared hyperparameters. This instability is attributed to singular value growth in DGram-constrained attention weights, which amplifies attention logits and causes softmax saturation. The findings advocate for module-specific, geometry-aware optimization in Transformers.
Key takeaway
For AI Scientists and Machine Learning Engineers optimizing Transformer models, you should adopt module-specific manifold constraints rather than uniform approaches. Assigning Stiefel geometry to attention layers and DGram geometry to MLP layers can significantly improve training stability and performance, as demonstrated by a 3.3544 validation loss. Neglecting this asymmetry, particularly by applying DGram to attention, risks singular value growth, softmax saturation, and unstable training trajectories.
Key insights
Transformer optimization benefits from module-specific manifold constraints, with Stiefel for attention and DGram for MLP layers.
Principles
- Attention layers require spectrally bounded geometry.
- MLP layers can benefit from scale-preserving freedom.
- Uniform manifold constraints are suboptimal for Transformers.
Method
The method involves comparing layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks using Manifold Muon in GPT-2 pretraining to assess performance and stability.
In practice
- Implement Stiefel constraints for Transformer attention weights.
- Use DGram constraints for Transformer MLP/FFN weights.
- Analyze singular value growth in attention projections.
Topics
- Transformer Optimization
- Manifold Muon
- Weight-Space Geometry
- Stiefel Constraints
- DGram Constraints
- Attention Mechanisms
- MLP Layers
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.