Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This paper introduces "two training clocks" to formalize grokking, a phenomenon where models generalize long after achieving low training error. The classifier clock measures the fast, logarithmic-time decay of cross-entropy loss, driven by post-margin gap growth. In contrast, the representation clock tracks the slower, polynomial-time simplification of the learned representation, such as achieving low effective rank via a Schatten-type penalty induced by layerwise weight decay. This temporal mismatch is rigorously demonstrated in deep linear networks and conditionally extended to ReLU MLPs, with modular addition experiments providing qualitative support. The work clarifies that continued training reshapes internal representations, not just reduces error, explaining the delayed generalization.

Key takeaway

For machine learning engineers observing grokking or delayed generalization, you should extend training beyond the point where training loss appears to saturate. Your model's internal representation may still be simplifying on a slower, polynomial time scale, even after the classifier has fit the data logarithmically. Actively monitor structural metrics like stable rank, alongside traditional loss curves, to identify when the representation has achieved its optimal, generalizable form.

Key insights

Grokking arises from two distinct training clocks: fast classifier fitting and slow representation simplification.

Principles

Classifier fitting converges logarithmically; representation simplification, polynomially.
Layerwise weight decay biases deep linear networks to low-rank maps.
Stable ReLU activation patterns enable local linear subsystem analysis.

In practice

Extend training beyond loss saturation to improve generalization.
Monitor representation metrics (e.g., stable rank) alongside loss.
Consider weight decay's role in shaping representation geometry.

Topics

Grokking
Time-Scale Separation
Deep Linear Networks
Implicit Regularization
Representation Learning
Stable Rank

Best for: AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.