Radial Suppression Accelerates Algorithmic Generalization: A Geometric Analysis of Delayed Generalization
Summary
A new geometric analysis reveals that the memorization-generalization delay in neural networks, particularly on algorithmic tasks, stems from the radial inflation of hidden representations during cross-entropy optimization. Researchers formalized a radial-angular decomposition of activation-space dynamics, proposing that penalizing this radial inflation induces anisotropic, data-dependent weight regularization, suppresses radial gradient energy, and biases convergence toward flatter minima. Empirically, a single-hyperparameter norm penalty, which softly constrains activations to a sqrt(d)-radius hypersphere, accelerated "grokking" up to 6x across MLPs and Transformers on modular arithmetic tasks. This method also halved the training steps required for a 10M-parameter nanoGPT model performing 3-digit addition.
Key takeaway
For Machine Learning Engineers tackling delayed generalization or "grokking" phenomena in neural networks, consider implementing radial suppression techniques. Applying a single-hyperparameter norm penalty can significantly accelerate training convergence and improve generalization, as demonstrated by up to 6x faster grokking and halved training steps for models like nanoGPT on algorithmic tasks. Experiment with this approach to achieve more efficient and robust model training.
Key insights
Radial inflation of hidden representations drives memorization-generalization delays in neural networks.
Principles
- Penalizing radial inflation induces anisotropic weight regularization.
- Radial suppression forces predominantly angular gradient updates.
- Radial suppression biases convergence toward flatter minima.
Method
A single-hyperparameter norm penalty softly constrains activations to a sqrt(d)-radius hypersphere, thereby suppressing radial inflation during training.
In practice
- Apply norm penalty to accelerate grokking in MLPs.
- Use norm penalty to speed Transformer training.
- Reduce training steps for arithmetic tasks.
Topics
- Algorithmic Generalization
- Neural Networks
- Grokking
- Radial Suppression
- Transformers
- Machine Learning
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.