What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy
Summary
Grokking, the phenomenon where models exhibit a delayed transition from memorization to generalization, is typically associated with the weight norm, where smaller norms lead to earlier generalization. This research investigates the specific mechanism controlled by the weight norm. By clamping the weight norm and varying only an output temperature, the study demonstrates that the grokking delay can be shifted across its full norm-induced range under cross-entropy loss. Matching the effective logit scale back to baseline recovers approximately 85% of this delay. Across various norms and temperatures, the delay collapses onto the logit scale alone (R2 = 0.97), with the norm contributing only 1-2% beyond this. The effect is loss-dependent; under mean-squared error, the logit scale remains fixed, implying a different control route. Further audits using a memorization control, a float64 softmax-collapse check, and a no-LayerNorm transformer confirm the logit scale as the primary channel. The weight norm acts as an upstream handle, with the logit scale and its driven softmax saturation being the proximal variable.
Key takeaway
For Machine Learning Engineers optimizing model generalization, understand that weight norm primarily influences grokking delay via the logit scale, particularly with cross-entropy loss. If you are observing grokking, focus on managing the effective logit scale, potentially by adjusting output temperature, rather than solely manipulating weight norms. This insight helps refine your hyperparameter tuning strategies for more predictable model training outcomes.
Key insights
Weight norm controls grokking delay primarily through logit scale and softmax saturation, especially under cross-entropy.
Principles
- Weight norm's effect on grokking is largely indirect.
- Logit scale is the proximal variable for grokking delay.
- Loss function dictates the mechanism of norm control.
Method
The study varied output temperature with clamped weight norms to observe grokking delay, then matched effective logit scale to baseline to quantify its influence.
In practice
- Consider logit scale when tuning grokking.
- Adjust output temperature to control delay.
- Be aware of loss function's impact on norm effects.
Topics
- Grokking
- Weight Norm
- Logit Scale
- Cross-Entropy Loss
- Generalization
- Machine Learning Theory
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.