Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction
Summary
This research formalizes the "grokking" phenomenon in neural networks, where fitting training data and learning underlying rules occur on distinct time scales, by introducing "two training clocks." For deep linear networks, the study demonstrates that cross-entropy loss decays to level epsilon on a logarithmic time scale, driven by a post-margin gap-growth or one-step tail-contraction condition. In contrast, the structural energy, expressed as a Schatten-type penalty due to layerwise weight decay, closes on a polynomial time scale under a sharp late-time Kurdyka-Lojasiewicz tail. This separation explains fitting from representation simplification. The paper extends this mechanism to ReLU MLPs, showing that in regions with fixed activation patterns, the network behaves as a linear model. Furthermore, in a two-layer ReLU embedding model, the classifier head receives larger effective gradients, supporting a two-stage process where the classifier fits initially, followed by representation simplification. Modular addition serves as the primary experimental setting.
Key takeaway
For AI Scientists investigating model generalization and grokking, you should consider the "two training clocks" framework to differentiate between rapid loss fitting and slower representation learning. This implies that optimizing solely for loss reduction might not capture the full picture of a model's generalization capabilities. You should analyze the impact of regularization, like layerwise weight decay, on representation simplification, as it operates on a distinct, polynomial time scale compared to logarithmic loss decay.
Key insights
Grokking involves "two training clocks": fast loss decay and slower representation simplification.
Principles
- Loss decay is logarithmic; representation simplification is polynomial.
- Weight decay induces Schatten-type regularization on end-to-end map.
- Fixed ReLU activation patterns can reduce MLPs to linear models.
Method
The study uses deep linear network theory with conditional ReLU reduction to analyze grokking, separating loss decay from representation simplification via two training clocks.
In practice
- Apply "two training clocks" to analyze model generalization.
- Use layerwise weight decay to induce Schatten-type regularization.
- Examine ReLU activation patterns for effective linear model behavior.
Topics
- Grokking
- Deep Linear Networks
- ReLU MLPs
- Training Dynamics
- Weight Decay
- Representation Learning
- Modular Addition
Best for: AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.