Gating Enables Curvature: A Geometric Expressivity Gap in Attention
Summary
A new study by Anand A. Joshi and Satwik Bathula, published on April 16, 2026, investigates the mathematical implications of multiplicative gating in attention layers, a technique increasingly used in large language models to enhance performance and training stability. The research models attention outputs as mean parameters of Gaussian distributions and analyzes their induced Fisher-Rao geometry. It demonstrates that ungated attention operators are confined to intrinsically flat statistical manifolds due to their affine structure. In contrast, multiplicative gating enables non-flat geometries, including positively curved manifolds that are otherwise unattainable. This finding establishes a geometric expressivity gap between ungated and gated attention, with empirical evidence showing gated models achieve higher representation curvature and improved performance on tasks requiring nonlinear decision boundaries, while offering no consistent advantage for linear decision boundaries. The study also identifies a structured regime where curvature accumulates under composition, leading to a systematic depth amplification effect.
Key takeaway
For research scientists developing or optimizing large language models, understanding the geometric implications of gated attention is crucial. If your model struggles with tasks requiring complex, nonlinear decision boundaries, incorporating multiplicative gating into attention layers could significantly improve performance and representation curvature. This approach offers a clear advantage over ungated methods in such scenarios, potentially leading to more robust and capable models.
Key insights
Multiplicative gating in attention enables non-flat geometric representations, enhancing expressivity for nonlinear tasks.
Principles
- Ungated attention yields flat statistical manifolds.
- Gating introduces non-flat, curved geometries.
- Curvature correlates with nonlinear task performance.
Method
Attention outputs are modeled as mean parameters of Gaussian distributions, and their induced Fisher-Rao geometry is analyzed to compare ungated and gated mechanisms.
In practice
- Use gated attention for tasks needing nonlinear decision boundaries.
- Consider gating for improved training stability in LLMs.
Topics
- Gated Attention
- Fisher-Rao Geometry
- Geometric Expressivity
- Statistical Manifolds
- Representation Curvature
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.