Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

A new study introduces the Frequency Synchronization Degree (FSD), a normalized, permutation-tested metric designed to track Fourier circuit synchronization in transformers exhibiting "grokking." This phenomenon, where models on modular arithmetic tasks suddenly achieve high validation accuracy, is shown to be preceded by FSD synchronization. Across nine modular addition configurations, including primes p in {53, 71, 97, 113, 131} and three seeds, FSD consistently synchronizes 500-3,000 steps before grokking, with a mean lead of +1,722 steps, making it the earliest known predictor. The research provides causal evidence that the inter-phase gap is a regularisation effect: varying weight decay lambda after FSD-ceiling step leads to earlier grokking, with Delta_t proportional to 1/lambda. This relationship, Delta_t ~ C/lambda, replicated across primes p in {53, 97, 131} with high R^2 values. Ablations confirmed FSD as a multi-block circuit property, with attention-only models grokking with FSD precursor, while MLP-only models did not grok.

Key takeaway

For Machine Learning Engineers optimizing transformer training, understanding grokking's onset is crucial. You should integrate Frequency Synchronization Degree (FSD) monitoring into your training pipelines. This metric provides a 500-3,000 step lead time before generalization, allowing you to proactively adjust regularization strategies like weight decay. By controlling weight decay based on FSD signals, you can precisely influence when your models transition from memorization to robust generalization, potentially saving significant computational resources and accelerating model development.

Key insights

Frequency Synchronization Degree (FSD) causally predicts grokking onset by tracking Fourier circuit synchronization, linking it to regularization.

Principles

Method

The Frequency Synchronization Degree (FSD) is a normalized, permutation-tested metric for Fourier circuit synchronization, usable without prior circuit knowledge to predict grokking.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.