Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration
Summary
A new research paper introduces the Complete(d) Parameterisation, an extension of existing neural network parameterisation techniques like μP, designed to unify hyperparameter scaling across model width, depth, batch-size, and training duration. This method addresses the challenge of hyperparameter tuning, which significantly impacts the stability and performance of large-scale models. The authors investigate per-module hyperparameter optimization and transfer, demonstrating that their parameterisation enables effective transfer even in this granular regime. Their experiments, covering learning rates, AdamW parameters, weight decay, initialization scales, and residual block multipliers, show that per-module hyperparameter optimization at a small 50M parameter scale can lead to substantial training speed improvements in Large Language Models when transferred to a ~14000× larger FLOP budget, outperforming globally optimized hyperparameters.
Key takeaway
For research scientists optimizing large language models, adopting the Complete(d) Parameterisation and exploring per-module hyperparameter optimization at smaller scales can significantly reduce training time and improve performance. You should consider an evolutionary strategy for initial hyperparameter search and then directly transfer these optimized settings to much larger models, leveraging the unified scaling properties of Complete(d)P.
Key insights
Complete(d)P enables effective per-module hyperparameter transfer across diverse scaling axes for large models.
Principles
- Per-module HPO can outperform global HPO.
- Parameterization unifies scaling across model dimensions.
Method
The Complete(d) Parameterisation unifies scaling in width, depth, batch-size, and training duration, allowing for per-module hyperparameter optimization and direct transfer to larger models.
In practice
- Optimize HPs at small scale (e.g., 50M params).
- Use Complete(d)P for direct transfer to large models.
- Consider per-module HPO for better performance.
Topics
- Hyperparameter Transfer
- Complete(d) Parameterisation
- Per-module Optimization
- Large Language Models
- Neural Network Scaling
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.