What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Grokking, the phenomenon where models exhibit a delayed transition from memorization to generalization, is typically associated with the weight norm, where smaller norms lead to earlier generalization. This research investigates the specific mechanism controlled by the weight norm. By clamping the weight norm and varying only an output temperature, the study demonstrates that the grokking delay can be shifted across its full norm-induced range under cross-entropy loss. Matching the effective logit scale back to baseline recovers approximately 85% of this delay. Across various norms and temperatures, the delay collapses onto the logit scale alone (R2 = 0.97), with the norm contributing only 1-2% beyond this. The effect is loss-dependent; under mean-squared error, the logit scale remains fixed, implying a different control route. Further audits using a memorization control, a float64 softmax-collapse check, and a no-LayerNorm transformer confirm the logit scale as the primary channel. The weight norm acts as an upstream handle, with the logit scale and its driven softmax saturation being the proximal variable.

Key takeaway

For Machine Learning Engineers optimizing model generalization, understand that weight norm primarily influences grokking delay via the logit scale, particularly with cross-entropy loss. If you are observing grokking, focus on managing the effective logit scale, potentially by adjusting output temperature, rather than solely manipulating weight norms. This insight helps refine your hyperparameter tuning strategies for more predictable model training outcomes.

Key insights

Weight norm controls grokking delay primarily through logit scale and softmax saturation, especially under cross-entropy.

Principles

Method

The study varied output temperature with clamped weight norms to observe grokking delay, then matched effective logit scale to baseline to quantify its influence.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.