Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
Summary
A new framework quantifies hyperparameter transfer, a critical process for training large language models (LLMs) by extrapolating optimal optimization hyperparameters from small to large scales. This framework uses three metrics: scaling law fit quality, robustness to extrapolation errors, and asymptotic loss penalty from parameterization choice. Researchers investigated why Maximal Update (μP) offers superior learning rate transfer compared to standard parameterization (SP) when using AdamW. They found that μP's primary advantage stems from maximizing the embedding layer learning rate. In SP, the embedding layer learning rate acts as a bottleneck, inducing training instabilities; increasing it by a factor of width to match μP significantly smooths training and enhances hyperparameter transfer. Additionally, weight decay improves scaling law fits but negatively impacts extrapolation robustness in fixed token-per-parameter scenarios.
Key takeaway
For Machine Learning Engineers optimizing large language models, if you are using standard parameterization (SP) with AdamW, you should critically examine your embedding layer learning rate. Increasing your embedding layer's learning rate by a factor of width, to align with μP's approach, can significantly stabilize training and improve hyperparameter transfer. This adjustment can mitigate training instabilities and enhance the scalability of your models, potentially reducing the need for complex hyperparameter tuning strategies.
Key insights
Maximizing the embedding layer learning rate is crucial for effective hyperparameter transfer in large language model training.
Principles
- Hyperparameter transfer is quantifiable via three metrics.
- Embedding layer learning rate is a critical bottleneck.
- Weight decay impacts scaling law fits and robustness.
Method
A framework quantifies hyperparameter transfer using three metrics: scaling law fit quality, robustness to extrapolation errors, and asymptotic loss penalty due to parameterization choice.
In practice
- Increase embedding layer learning rate in SP to match μP.
- Evaluate weight decay's impact on scaling law fits and extrapolation.
Topics
- Hyperparameter Transfer
- Large Language Models
- Embedding Layer Learning Rate
- Maximal Update (μP)
- AdamW Optimizer
- Scaling Laws
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.