Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction
Summary
This paper introduces "two training clocks" to formalize grokking, a phenomenon where models generalize long after achieving low training error. The classifier clock measures the fast, logarithmic-time decay of cross-entropy loss, driven by post-margin gap growth. In contrast, the representation clock tracks the slower, polynomial-time simplification of the learned representation, such as achieving low effective rank via a Schatten-type penalty induced by layerwise weight decay. This temporal mismatch is rigorously demonstrated in deep linear networks and conditionally extended to ReLU MLPs, with modular addition experiments providing qualitative support. The work clarifies that continued training reshapes internal representations, not just reduces error, explaining the delayed generalization.
Key takeaway
For machine learning engineers observing grokking or delayed generalization, you should extend training beyond the point where training loss appears to saturate. Your model's internal representation may still be simplifying on a slower, polynomial time scale, even after the classifier has fit the data logarithmically. Actively monitor structural metrics like stable rank, alongside traditional loss curves, to identify when the representation has achieved its optimal, generalizable form.
Key insights
Grokking arises from two distinct training clocks: fast classifier fitting and slow representation simplification.
Principles
- Classifier fitting converges logarithmically; representation simplification, polynomially.
- Layerwise weight decay biases deep linear networks to low-rank maps.
- Stable ReLU activation patterns enable local linear subsystem analysis.
In practice
- Extend training beyond loss saturation to improve generalization.
- Monitor representation metrics (e.g., stable rank) alongside loss.
- Consider weight decay's role in shaping representation geometry.
Topics
- Grokking
- Time-Scale Separation
- Deep Linear Networks
- Implicit Regularization
- Representation Learning
- Stable Rank
Best for: AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.