Scaling Laws, Carefully
Summary
Scaling laws in deep learning describe the predictable power-law relationship where training loss (L) decreases as model size (N), dataset size (D), and compute (C) scale up. Early research by Amari et al. (1992) and Hestness et al. (2017) established this predictability for generalization error. Kaplan et al. (2020) formalized these laws for Transformer language models, suggesting optimal compute allocation involved scaling model size faster than dataset size (Nopt ∝ C^0.73). However, the Chinchilla paper (Hoffmann et al. 2022), based on experiments with over 400 models ranging from 70M to 16B parameters and 5B to 500B tokens, revised this, concluding that model size and training tokens should scale at equal rates (Nopt ∝ C^0.5). This discrepancy was later reconciled by Pearce & Song (2024), attributing it to Kaplan et al.'s smaller experimental scale and parameter counting methods. The article further examines scaling in data-limited regimes, where data repetition impacts loss, and highlights the practical challenges of fitting scaling laws due to sensitivity to precision, noise, and fit-region selection.
Key takeaway
For machine learning engineers allocating compute for large language model training, prioritize scaling model size and training data tokens at roughly equal rates. The Chinchilla scaling laws (Nopt ∝ C^0.5) indicate that many models are undertrained, suggesting you should train smaller models on more data. Be aware that data repetition introduces overfitting penalties; consider using strong weight decay to mitigate this effect. Carefully validate your scaling law fits, as precision and fit-region choices significantly impact extrapolation accuracy.
Key insights
Scaling laws predict deep learning loss based on model size, data, and compute, but optimal allocation depends critically on experimental scale and data availability.
Principles
- Generalization error scales as a power law with data size.
- Optimal model size and training tokens should scale equally.
- Data repetition introduces an explicit overfitting penalty.
Method
Chinchilla's scaling law fitting methods include fixing model sizes and varying token budgets, creating IsoFLOP profiles to find optimal N, and parametric fitting using Huber loss and L-BFGS.
In practice
- Fit scaling laws on small runs to extrapolate larger model needs.
- Use strong weight decay to reduce overfitting from data repetition.
Topics
- Scaling Laws
- Large Language Models
- Compute Optimization
- Data Repetition
- Generalization Error
- Model Training
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Lil'Log.