[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo

2026-03-17 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

Independent researchers introduced a novel method involving per-row ℓ₂ clipping on decoder weights after each optimizer step, significantly accelerating the "grokking" phenomenon. This technique, requiring only 5 lines of code and no additional memory or weight decay, achieved an 18-66× speedup over an AdamW baseline with Lion+Clip on a standard modular arithmetic benchmark. Notably, it demonstrated zero failures across 300 seeds and reduced the Interquartile Range (IQR) by 61–72% for 8-layer models with edge initialization. While current experiments focus on modular arithmetic, the researchers are testing its applicability on a 277M LLM and are seeking arXiv endorsement for their work, which includes code and a PDF in a public repository.

Key takeaway

Weight norm clipping via per-row ℓ₂ clipping on decoder weights accelerates grokking in transformers by 18-66×, achieving zero failures across 300 seeds on modular arithmetic tasks. This 5-line code modification, used with Lion+Clip, offers a robust, memory-efficient method (no extra memory/weight decay) for researchers studying generalization, though its broader applicability is still under investigation.

Topics

Grokking
Weight Norm Clipping
Deep Learning Optimization
Transformer Models
Modular Arithmetic

Code references

Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, Deep Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.