Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

This research formalizes the "grokking" phenomenon in neural networks, where fitting training data and learning underlying rules occur on distinct time scales, by introducing "two training clocks." For deep linear networks, the study demonstrates that cross-entropy loss decays to level epsilon on a logarithmic time scale, driven by a post-margin gap-growth or one-step tail-contraction condition. In contrast, the structural energy, expressed as a Schatten-type penalty due to layerwise weight decay, closes on a polynomial time scale under a sharp late-time Kurdyka-Lojasiewicz tail. This separation explains fitting from representation simplification. The paper extends this mechanism to ReLU MLPs, showing that in regions with fixed activation patterns, the network behaves as a linear model. Furthermore, in a two-layer ReLU embedding model, the classifier head receives larger effective gradients, supporting a two-stage process where the classifier fits initially, followed by representation simplification. Modular addition serves as the primary experimental setting.

Key takeaway

For AI Scientists investigating model generalization and grokking, you should consider the "two training clocks" framework to differentiate between rapid loss fitting and slower representation learning. This implies that optimizing solely for loss reduction might not capture the full picture of a model's generalization capabilities. You should analyze the impact of regularization, like layerwise weight decay, on representation simplification, as it operates on a distinct, polynomial time scale compared to logarithmic loss decay.

Key insights

Grokking involves "two training clocks": fast loss decay and slower representation simplification.

Principles

Method

The study uses deep linear network theory with conditional ReLU reduction to analyze grokking, separating loss decay from representation simplification via two training clocks.

In practice

Topics

Best for: AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.