Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization
Summary
A study on Transformer optimization reveals that different modules prefer distinct weight-space manifold geometries, challenging the common practice of uniform constraint application. Researchers investigated Manifold Muon during GPT-2 pretraining, comparing layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. The findings indicate that applying Stiefel geometry to attention layers and DGram geometry to MLP layers yields the best performance. Conversely, inverted assignments or an all-DGram configuration proved unstable under shared hyperparameters. This instability is attributed to singular value growth in DGram-constrained attention weights, which can amplify attention logits and lead to softmax saturation. The work suggests that geometry-aware optimization for transformers should be module-specific.
Key takeaway
For machine learning engineers optimizing Transformer models, consider implementing module-specific weight-space geometry constraints rather than uniform approaches. Your optimization strategy should assign Stiefel geometry to attention layers and DGram geometry to MLP layers, as this configuration demonstrated superior stability and performance during GPT-2 pretraining. Ignoring these module-specific preferences risks optimization instability due to issues like softmax saturation in attention layers.
Key insights
Transformer optimization benefits from module-specific manifold geometry constraints, not uniform application.
Principles
- Different transformer modules prefer distinct manifold geometries.
- Uniform manifold constraints can lead to optimization instability.
- Singular value growth in attention weights can cause softmax saturation.
Method
Studied Manifold Muon for GPT-2 pretraining, comparing layer-wise Stiefel and DGram constraints on attention and MLP blocks to assess performance and stability.
In practice
- Apply Stiefel geometry to Transformer attention layers.
- Apply DGram geometry to Transformer MLP layers.
Topics
- Transformer Optimization
- Weight-Space Geometry
- Manifold Muon
- GPT-2
- Stiefel Geometry
- DGram Geometry
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.