Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

This research formalizes the "grokking" phenomenon in neural networks, where fitting training data and learning underlying rules occur on distinct time scales, by introducing "two training clocks." For deep linear networks, the study demonstrates that cross-entropy loss decays to level epsilon on a logarithmic time scale, driven by a post-margin gap-growth or one-step tail-contraction condition. In contrast, the structural energy, expressed as a Schatten-type penalty due to layerwise weight decay, closes on a polynomial time scale under a sharp late-time Kurdyka-Lojasiewicz tail. This separation explains fitting from representation simplification. The paper extends this mechanism to ReLU MLPs, showing that in regions with fixed activation patterns, the network behaves as a linear model. Furthermore, in a two-layer ReLU embedding model, the classifier head receives larger effective gradients, supporting a two-stage process where the classifier fits initially, followed by representation simplification. Modular addition serves as the primary experimental setting.

Key takeaway

For AI Scientists investigating model generalization and grokking, you should consider the "two training clocks" framework to differentiate between rapid loss fitting and slower representation learning. This implies that optimizing solely for loss reduction might not capture the full picture of a model's generalization capabilities. You should analyze the impact of regularization, like layerwise weight decay, on representation simplification, as it operates on a distinct, polynomial time scale compared to logarithmic loss decay.

Key insights

Grokking involves "two training clocks": fast loss decay and slower representation simplification.

Principles

Loss decay is logarithmic; representation simplification is polynomial.
Weight decay induces Schatten-type regularization on end-to-end map.
Fixed ReLU activation patterns can reduce MLPs to linear models.

Method

The study uses deep linear network theory with conditional ReLU reduction to analyze grokking, separating loss decay from representation simplification via two training clocks.

In practice

Apply "two training clocks" to analyze model generalization.
Use layerwise weight decay to induce Schatten-type regularization.
Examine ReLU activation patterns for effective linear model behavior.

Topics

Grokking
Deep Linear Networks
ReLU MLPs
Training Dynamics
Weight Decay
Representation Learning
Modular Addition

Best for: AI Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.