Generalization at the Edge of Stability
Summary
This work introduces a novel theoretical framework to understand generalization in neural networks trained at the "edge of stability" (EoS) regime, where large learning rates lead to oscillatory and chaotic optimization dynamics. The authors model stochastic optimizers as random dynamical systems that converge to a "random pullback attractor." They define a new complexity measure, the "sharpness dimension" (SD), inspired by Lyapunov dimension theory, which depends on the complete Hessian spectrum and its partial determinants, unlike prior measures. The core contribution is a generalization bound linking a smaller sharpness dimension to improved generalization, providing a theoretical explanation for the benefits of chaotic dynamics at EoS. Empirical validation on various MLPs and transformers demonstrates that SD correlates better with the generalization gap than existing measures and offers new insights into "grokking," where SD sharply decreases during the transition to generalization.
Key takeaway
For AI scientists and research scientists investigating deep learning generalization, this work suggests that understanding and controlling the "sharpness dimension" of optimization dynamics is crucial. Your models' generalization capabilities, particularly in the "edge of stability" regime, are directly tied to this new complexity measure. Consider integrating SD analysis into your training diagnostics, especially when observing phenomena like "grokking," as a sharp decrease in SD indicates a transition to better generalization.
Key insights
Generalization at the "edge of stability" is linked to the "sharpness dimension" of chaotic optimization attractors.
Principles
- Stochastic optimizers can be modeled as random dynamical systems.
- Chaotic dynamics at EoS can lead to improved generalization.
- Generalization correlates with a lower "sharpness dimension."
Method
The "sharpness dimension" (SD) is computed by analyzing the complete Hessian spectrum and its partial determinants, quantifying the balance between expansion and contraction in the optimization dynamics on a random pullback attractor.
In practice
- Monitor "sharpness dimension" to predict generalization.
- Analyze Hessian spectrum for insights into EoS dynamics.
- SD can quantify "grokking" phase transitions.
Topics
- Sharpness Dimension
- Edge of Stability
- Random Dynamical Systems
- Neural Network Generalization
- Grokking Phenomenon
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.