Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A study on Transformer optimization reveals that different module types benefit from distinct weight-space geometries. Researchers investigated Manifold Muon during GPT-2 small pretraining, a 124M-parameter model, by applying Stiefel and DGram constraints to attention and MLP blocks. The optimal configuration, termed "Hetero," assigned Stiefel geometry to attention layers and DGram geometry to MLP layers, achieving the lowest validation loss of 3.3544. Conversely, configurations with DGram constraints on attention layers, including the inverted "Hetero-Inv" and "All-DGram" setups, became unstable under shared hyperparameters. This instability is attributed to singular value growth in DGram-constrained attention weights, which amplifies attention logits and causes softmax saturation. The findings advocate for module-specific, geometry-aware optimization in Transformers.

Key takeaway

For AI Scientists and Machine Learning Engineers optimizing Transformer models, you should adopt module-specific manifold constraints rather than uniform approaches. Assigning Stiefel geometry to attention layers and DGram geometry to MLP layers can significantly improve training stability and performance, as demonstrated by a 3.3544 validation loss. Neglecting this asymmetry, particularly by applying DGram to attention, risks singular value growth, softmax saturation, and unstable training trajectories.

Key insights

Transformer optimization benefits from module-specific manifold constraints, with Stiefel for attention and DGram for MLP layers.

Principles

Attention layers require spectrally bounded geometry.
MLP layers can benefit from scale-preserving freedom.
Uniform manifold constraints are suboptimal for Transformers.

Method

The method involves comparing layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks using Manifold Muon in GPT-2 pretraining to assess performance and stability.

In practice

Implement Stiefel constraints for Transformer attention weights.
Use DGram constraints for Transformer MLP/FFN weights.
Analyze singular value growth in attention projections.

Topics

Transformer Optimization
Manifold Muon
Weight-Space Geometry
Stiefel Constraints
DGram Constraints
Attention Mechanisms
MLP Layers

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.