Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate

2026-05-20 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

A new framework quantifies hyperparameter transfer, a critical process for training large language models (LLMs) by extrapolating optimal optimization hyperparameters from small to large scales. This framework uses three metrics: scaling law fit quality, robustness to extrapolation errors, and asymptotic loss penalty from parameterization choice. Researchers investigated why Maximal Update (μP) offers superior learning rate transfer compared to standard parameterization (SP) when using AdamW. They found that μP's primary advantage stems from maximizing the embedding layer learning rate. In SP, the embedding layer learning rate acts as a bottleneck, inducing training instabilities; increasing it by a factor of width to match μP significantly smooths training and enhances hyperparameter transfer. Additionally, weight decay improves scaling law fits but negatively impacts extrapolation robustness in fixed token-per-parameter scenarios.

Key takeaway

For Machine Learning Engineers optimizing large language models, if you are using standard parameterization (SP) with AdamW, you should critically examine your embedding layer learning rate. Increasing your embedding layer's learning rate by a factor of width, to align with μP's approach, can significantly stabilize training and improve hyperparameter transfer. This adjustment can mitigate training instabilities and enhance the scalability of your models, potentially reducing the need for complex hyperparameter tuning strategies.

Key insights

Maximizing the embedding layer learning rate is crucial for effective hyperparameter transfer in large language model training.

Principles

Hyperparameter transfer is quantifiable via three metrics.
Embedding layer learning rate is a critical bottleneck.
Weight decay impacts scaling law fits and robustness.

Method

A framework quantifies hyperparameter transfer using three metrics: scaling law fit quality, robustness to extrapolation errors, and asymptotic loss penalty due to parameterization choice.

In practice

Increase embedding layer learning rate in SP to match μP.
Evaluate weight decay's impact on scaling law fits and extrapolation.

Topics

Hyperparameter Transfer
Large Language Models
Embedding Layer Learning Rate
Maximal Update (μP)
AdamW Optimizer
Scaling Laws

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.