Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization
Summary
A study on Transformer optimization reveals that different modules prefer distinct weight-space manifold geometries. Researchers investigated Manifold Muon for GPT-2 pretraining, comparing Stiefel and DGram constraints across attention and MLP blocks. The findings indicate that assigning Stiefel geometry to attention layers and DGram geometry to MLP layers yields optimal performance. Conversely, inverted assignments or an all-DGram configuration proved unstable under shared hyperparameters. This instability is attributed to singular value growth in DGram-constrained attention weights, which can amplify attention logits and cause softmax saturation. The work concludes that geometry-aware optimization for Transformers should be module-specific rather than uniformly applied.
Key takeaway
For AI Scientists and Machine Learning Engineers optimizing Transformer models, you should adopt module-specific weight-space geometry. Specifically, consider applying Stiefel constraints to attention layers and DGram constraints to MLP layers. This approach can significantly improve performance and stability, preventing issues like singular value growth and softmax saturation that arise from uniform or inverted manifold assignments.
Key insights
Transformer optimization benefits from module-specific manifold geometry, with Stiefel for attention and DGram for MLP layers.
Principles
- Manifold constraints should be module-specific.
- Uniform geometry can cause instability.
- Singular value growth impacts attention logits.
Method
The study compared layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks during GPT-2 pretraining using Manifold Muon.
In practice
- Apply Stiefel geometry to Transformer attention layers.
- Assign DGram geometry to Transformer MLP layers.
- Avoid uniform manifold constraints in Transformers.
Topics
- Transformer Optimization
- Weight-Space Geometry
- Manifold Constraints
- Stiefel Geometry
- DGram Geometry
- GPT-2
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.