Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers
Summary
A new study introduces the Frequency Synchronization Degree (FSD), a normalized, permutation-tested metric designed to track Fourier circuit synchronization in transformers exhibiting "grokking." This phenomenon, where models on modular arithmetic tasks suddenly achieve high validation accuracy, is shown to be preceded by FSD synchronization. Across nine modular addition configurations, including primes p in {53, 71, 97, 113, 131} and three seeds, FSD consistently synchronizes 500-3,000 steps before grokking, with a mean lead of +1,722 steps, making it the earliest known predictor. The research provides causal evidence that the inter-phase gap is a regularisation effect: varying weight decay lambda after FSD-ceiling step leads to earlier grokking, with Delta_t proportional to 1/lambda. This relationship, Delta_t ~ C/lambda, replicated across primes p in {53, 97, 131} with high R^2 values. Ablations confirmed FSD as a multi-block circuit property, with attention-only models grokking with FSD precursor, while MLP-only models did not grok.
Key takeaway
For Machine Learning Engineers optimizing transformer training, understanding grokking's onset is crucial. You should integrate Frequency Synchronization Degree (FSD) monitoring into your training pipelines. This metric provides a 500-3,000 step lead time before generalization, allowing you to proactively adjust regularization strategies like weight decay. By controlling weight decay based on FSD signals, you can precisely influence when your models transition from memorization to robust generalization, potentially saving significant computational resources and accelerating model development.
Key insights
Frequency Synchronization Degree (FSD) causally predicts grokking onset by tracking Fourier circuit synchronization, linking it to regularization.
Principles
- Fourier circuit synchronization precedes generalization.
- Inter-phase gap is a regularization effect.
- Grokking is a multi-block circuit property.
Method
The Frequency Synchronization Degree (FSD) is a normalized, permutation-tested metric for Fourier circuit synchronization, usable without prior circuit knowledge to predict grokking.
In practice
- Employ FSD for early grokking prediction.
- Tune weight decay to modulate grokking timing.
- Focus analysis on multi-block transformer circuits.
Topics
- Grokking
- Transformer Circuits
- Fourier Analysis
- Mechanistic Interpretability
- Weight Decay
- Modular Arithmetic
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.