Gating Enables Curvature: A Geometric Expressivity Gap in Attention
Summary
Multiplicative gating, a technique increasingly applied to attention layers in large language models, significantly enhances model performance and training stability. This study investigates the mathematical implications of gated attention by modeling outputs as mean parameters of Gaussian distributions and analyzing the induced Fisher--Rao geometry. It reveals that ungated attention operators are confined to intrinsically flat statistical manifolds due to their affine structure. In contrast, multiplicative gating enables non-flat geometries, including positively curved manifolds previously unattainable. This establishes a geometric expressivity gap between the two attention mechanisms. Empirically, gated models demonstrate higher representation curvature and superior performance on tasks demanding nonlinear decision boundaries, while showing no consistent advantage on tasks with linear boundaries. The research also identifies a structured regime where curvature accumulates with composition, leading to a systematic depth amplification effect.
Key takeaway
For research scientists developing or optimizing large language models, understanding the geometric expressivity gap between gated and ungated attention is crucial. If your model struggles with tasks requiring complex, nonlinear decision boundaries, incorporating multiplicative gating into attention layers could significantly improve performance by enabling non-flat, curved representations. You should prioritize gated architectures for tasks where representational curvature is beneficial.
Key insights
Gating enables non-flat geometric representations in attention, overcoming the flat manifold limitation of ungated attention.
Principles
- Ungated attention yields flat statistical manifolds.
- Gating introduces non-flat, positively curved manifolds.
- Curvature correlates with performance on nonlinear tasks.
Method
Outputs are modeled as mean parameters of Gaussian distributions, and Fisher--Rao geometry is analyzed to study attention's representational geometry.
In practice
- Use gated attention for tasks needing nonlinear boundaries.
- Consider gating for deeper models to amplify curvature.
Topics
- Multiplicative Gating
- Attention Mechanisms
- Geometric Expressivity
- Fisher-Rao Geometry
- Statistical Manifolds
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.