The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

Research on transformer models trained for one-step Collatz prediction reveals that the "grokking" phenomenon, characterized by a significant delay between training-set fit and generalization, stems primarily from a decoder bottleneck rather than a failure in encoder-side structure acquisition. The encoder rapidly learns task-relevant arithmetic structure, such as parity and low-order residue information, achieving 99.7% accuracy on parity probes by step 2,000, while overall sequence accuracy remains at 38%. Causal interventions demonstrate that transplanting a trained encoder into a fresh model accelerates grokking by 2.75x, and retraining only the decoder with a frozen, converged encoder eliminates the generalization plateau, achieving 97.6% accuracy compared to 86.1% for joint training. Furthermore, numeral representation significantly impacts decoder learnability; bases aligned with the Collatz map's arithmetic (e.g., base 24) achieve 99.8% accuracy, while binary representations collapse.

Key takeaway

For Machine Learning Engineers optimizing transformer performance on algorithmic tasks, focus your efforts on improving decoder readout efficiency. Your models may already possess the necessary internal representations, but struggle to translate them into correct outputs. Consider carefully the numeral representation, as bases like 24 can significantly outperform binary for tasks like Collatz prediction, and ensure your training data adequately exposes the decoder to complex computational patterns, such as deep carry chains, to achieve robust generalization.

Key insights

Grokking in arithmetic transformers is primarily a decoder readout bottleneck, not an encoder learning deficiency.

Principles

Encoder learns useful structure early.
Numeral base acts as an inductive bias.
Deep carry examples are necessary for generalization.

Method

The study uses encoder-decoder transformer models on one-step Collatz prediction, employing causal interventions like encoder/decoder transplantation and rewinding, alongside linear probing and a 15-base numeral sweep to localize bottlenecks.

In practice

Prioritize decoder optimization for arithmetic tasks.
Select numeral bases aligned with task arithmetic.
Ensure training data includes complex carry chains.

Topics

Grokking
Encoder-Decoder Transformers
Collatz Prediction
Arithmetic Generalization
Numeral Representation

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.