Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

2026-06-11 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study on Transformer optimization reveals that different modules prefer distinct weight-space manifold geometries, challenging the common practice of uniform constraint application. Researchers investigated Manifold Muon during GPT-2 pretraining, comparing layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. The findings indicate that applying Stiefel geometry to attention layers and DGram geometry to MLP layers yields the best performance. Conversely, inverted assignments or an all-DGram configuration proved unstable under shared hyperparameters. This instability is attributed to singular value growth in DGram-constrained attention weights, which can amplify attention logits and lead to softmax saturation. The work suggests that geometry-aware optimization for transformers should be module-specific.

Key takeaway

For machine learning engineers optimizing Transformer models, consider implementing module-specific weight-space geometry constraints rather than uniform approaches. Your optimization strategy should assign Stiefel geometry to attention layers and DGram geometry to MLP layers, as this configuration demonstrated superior stability and performance during GPT-2 pretraining. Ignoring these module-specific preferences risks optimization instability due to issues like softmax saturation in attention layers.

Key insights

Transformer optimization benefits from module-specific manifold geometry constraints, not uniform application.

Principles

Different transformer modules prefer distinct manifold geometries.
Uniform manifold constraints can lead to optimization instability.
Singular value growth in attention weights can cause softmax saturation.

Method

Studied Manifold Muon for GPT-2 pretraining, comparing layer-wise Stiefel and DGram constraints on attention and MLP blocks to assess performance and stability.

In practice

Apply Stiefel geometry to Transformer attention layers.
Apply DGram geometry to Transformer MLP layers.

Topics

Transformer Optimization
Weight-Space Geometry
Manifold Muon
GPT-2
Stiefel Geometry
DGram Geometry

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.