Generalization at the Edge of Stability

2026-04-22 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This work introduces a novel theoretical framework to understand generalization in neural networks trained at the "edge of stability" (EoS) regime, where large learning rates lead to oscillatory and chaotic optimization dynamics. The authors model stochastic optimizers as random dynamical systems that converge to a "random pullback attractor." They define a new complexity measure, the "sharpness dimension" (SD), inspired by Lyapunov dimension theory, which depends on the complete Hessian spectrum and its partial determinants, unlike prior measures. The core contribution is a generalization bound linking a smaller sharpness dimension to improved generalization, providing a theoretical explanation for the benefits of chaotic dynamics at EoS. Empirical validation on various MLPs and transformers demonstrates that SD correlates better with the generalization gap than existing measures and offers new insights into "grokking," where SD sharply decreases during the transition to generalization.

Key takeaway

For AI scientists and research scientists investigating deep learning generalization, this work suggests that understanding and controlling the "sharpness dimension" of optimization dynamics is crucial. Your models' generalization capabilities, particularly in the "edge of stability" regime, are directly tied to this new complexity measure. Consider integrating SD analysis into your training diagnostics, especially when observing phenomena like "grokking," as a sharp decrease in SD indicates a transition to better generalization.

Key insights

Generalization at the "edge of stability" is linked to the "sharpness dimension" of chaotic optimization attractors.

Principles

Stochastic optimizers can be modeled as random dynamical systems.
Chaotic dynamics at EoS can lead to improved generalization.
Generalization correlates with a lower "sharpness dimension."

Method

The "sharpness dimension" (SD) is computed by analyzing the complete Hessian spectrum and its partial determinants, quantifying the balance between expansion and contraction in the optimization dynamics on a random pullback attractor.

In practice

Monitor "sharpness dimension" to predict generalization.
Analyze Hessian spectrum for insights into EoS dynamics.
SD can quantify "grokking" phase transitions.

Topics

Sharpness Dimension
Edge of Stability
Random Dynamical Systems
Neural Network Generalization
Grokking Phenomenon

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.