Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

2026-02-13 · Source: Apple Machine Learning Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new research paper introduces the Complete(d) Parameterisation, an extension of existing neural network parameterisation techniques like μP, designed to unify hyperparameter scaling across model width, depth, batch-size, and training duration. This method addresses the challenge of hyperparameter tuning, which significantly impacts the stability and performance of large-scale models. The authors investigate per-module hyperparameter optimization and transfer, demonstrating that their parameterisation enables effective transfer even in this granular regime. Their experiments, covering learning rates, AdamW parameters, weight decay, initialization scales, and residual block multipliers, show that per-module hyperparameter optimization at a small 50M parameter scale can lead to substantial training speed improvements in Large Language Models when transferred to a ~14000× larger FLOP budget, outperforming globally optimized hyperparameters.

Key takeaway

For research scientists optimizing large language models, adopting the Complete(d) Parameterisation and exploring per-module hyperparameter optimization at smaller scales can significantly reduce training time and improve performance. You should consider an evolutionary strategy for initial hyperparameter search and then directly transfer these optimized settings to much larger models, leveraging the unified scaling properties of Complete(d)P.

Key insights

Complete(d)P enables effective per-module hyperparameter transfer across diverse scaling axes for large models.

Principles

Per-module HPO can outperform global HPO.
Parameterization unifies scaling across model dimensions.

Method

The Complete(d) Parameterisation unifies scaling in width, depth, batch-size, and training duration, allowing for per-module hyperparameter optimization and direct transfer to larger models.

In practice

Optimize HPs at small scale (e.g., 50M params).
Use Complete(d)P for direct transfer to large models.
Consider per-module HPO for better performance.

Topics

Hyperparameter Transfer
Complete(d) Parameterisation
Per-module Optimization
Large Language Models
Neural Network Scaling

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.