Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This paper introduces "two training clocks" to formalize grokking, a phenomenon where models generalize long after achieving low training error. The classifier clock measures the fast, logarithmic-time decay of cross-entropy loss, driven by post-margin gap growth. In contrast, the representation clock tracks the slower, polynomial-time simplification of the learned representation, such as achieving low effective rank via a Schatten-type penalty induced by layerwise weight decay. This temporal mismatch is rigorously demonstrated in deep linear networks and conditionally extended to ReLU MLPs, with modular addition experiments providing qualitative support. The work clarifies that continued training reshapes internal representations, not just reduces error, explaining the delayed generalization.

Key takeaway

For machine learning engineers observing grokking or delayed generalization, you should extend training beyond the point where training loss appears to saturate. Your model's internal representation may still be simplifying on a slower, polynomial time scale, even after the classifier has fit the data logarithmically. Actively monitor structural metrics like stable rank, alongside traditional loss curves, to identify when the representation has achieved its optimal, generalizable form.

Key insights

Grokking arises from two distinct training clocks: fast classifier fitting and slow representation simplification.

Principles

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.