The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
Summary
Research on transformer models trained for one-step Collatz prediction reveals that the "grokking" phenomenon, characterized by a significant delay between training-set fit and generalization, stems primarily from a decoder bottleneck rather than a failure in encoder-side structure acquisition. The encoder rapidly learns task-relevant arithmetic structure, such as parity and low-order residue information, achieving 99.7% accuracy on parity probes by step 2,000, while overall sequence accuracy remains at 38%. Causal interventions demonstrate that transplanting a trained encoder into a fresh model accelerates grokking by 2.75x, and retraining only the decoder with a frozen, converged encoder eliminates the generalization plateau, achieving 97.6% accuracy compared to 86.1% for joint training. Furthermore, numeral representation significantly impacts decoder learnability; bases aligned with the Collatz map's arithmetic (e.g., base 24) achieve 99.8% accuracy, while binary representations collapse.
Key takeaway
For Machine Learning Engineers optimizing transformer performance on algorithmic tasks, focus your efforts on improving decoder readout efficiency. Your models may already possess the necessary internal representations, but struggle to translate them into correct outputs. Consider carefully the numeral representation, as bases like 24 can significantly outperform binary for tasks like Collatz prediction, and ensure your training data adequately exposes the decoder to complex computational patterns, such as deep carry chains, to achieve robust generalization.
Key insights
Grokking in arithmetic transformers is primarily a decoder readout bottleneck, not an encoder learning deficiency.
Principles
- Encoder learns useful structure early.
- Numeral base acts as an inductive bias.
- Deep carry examples are necessary for generalization.
Method
The study uses encoder-decoder transformer models on one-step Collatz prediction, employing causal interventions like encoder/decoder transplantation and rewinding, alongside linear probing and a 15-base numeral sweep to localize bottlenecks.
In practice
- Prioritize decoder optimization for arithmetic tasks.
- Select numeral bases aligned with task arithmetic.
- Ensure training data includes complex carry chains.
Topics
- Grokking
- Encoder-Decoder Transformers
- Collatz Prediction
- Arithmetic Generalization
- Numeral Representation
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.