Fantastic Pretraining Optimizers and Where to Find Them II: Hyperball Optimization
Summary
Hyperball Optimization introduces a novel optimizer wrapper designed to address the diminishing performance gains of matrix-based optimizers like Muon over AdamW in large language model pretraining, particularly when using standard constant decoupled weight decay. This wrapper, named Hyperball, operates by fixing the Frobenius norms of weight matrices and their corresponding optimizer updates to constant values, working with base optimizers such as Adam or Muon. Evaluations on Qwen3 style models, up to 1.2B parameters, demonstrate that Muon Hyperball achieves a significant 20-30% token equivalent speedup compared to weight decay baselines. Furthermore, Hyperball enhances learning rate transferability across varying model widths and depths. The method is grounded in theory suggesting that weight decay establishes an equilibrium weight norm, which in turn determines the angular learning rate.
Key takeaway
For Machine Learning Engineers optimizing large language model pretraining, Hyperball offers a significant performance enhancement. If you are struggling with matrix-based optimizers like Muon losing their edge on larger models, consider integrating Hyperball. This wrapper can deliver a 20-30% token equivalent speedup on models up to 1.2B parameters, improving training efficiency and learning rate transfer. Your team can achieve faster convergence and more robust hyperparameter tuning by adopting this approach.
Key insights
Hyperball fixes weight matrix Frobenius norms to improve matrix optimizer performance in large model pretraining.
Principles
- Weight decay sets an equilibrium weight norm.
- Equilibrium weight norm dictates angular learning rate.
- Constant Frobenius norms stabilize optimizer updates.
Method
Hyperball wraps a base optimizer (e.g., Adam, Muon) to enforce fixed Frobenius norms for weight matrices and their updates during training.
In practice
- Apply Hyperball to Muon for 20-30% pretraining speedup.
- Use Hyperball to improve learning rate transferability.
- Integrate Hyperball with Qwen3 style models up to 1.2B parameters.
Topics
- Pretraining Optimizers
- Hyperball Optimization
- Muon Optimizer
- AdamW
- Weight Decay
- Language Models
- Qwen3 Models
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.