[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo
Summary
Independent researchers introduced a novel method involving per-row ℓ₂ clipping on decoder weights after each optimizer step, significantly accelerating the "grokking" phenomenon. This technique, requiring only 5 lines of code and no additional memory or weight decay, achieved an 18-66× speedup over an AdamW baseline with Lion+Clip on a standard modular arithmetic benchmark. Notably, it demonstrated zero failures across 300 seeds and reduced the Interquartile Range (IQR) by 61–72% for 8-layer models with edge initialization. While current experiments focus on modular arithmetic, the researchers are testing its applicability on a 277M LLM and are seeking arXiv endorsement for their work, which includes code and a PDF in a public repository.
Key takeaway
Weight norm clipping via per-row ℓ₂ clipping on decoder weights accelerates grokking in transformers by 18-66×, achieving zero failures across 300 seeds on modular arithmetic tasks. This 5-line code modification, used with Lion+Clip, offers a robust, memory-efficient method (no extra memory/weight decay) for researchers studying generalization, though its broader applicability is still under investigation.
Topics
- Grokking
- Weight Norm Clipping
- Deep Learning Optimization
- Transformer Models
- Modular Arithmetic
Code references
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.