Fantastic Pretraining Optimizers and Where to Find Them II: Hyperball Optimization

2026-06-15 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Hyperball Optimization introduces a novel optimizer wrapper designed to address the diminishing performance gains of matrix-based optimizers like Muon over AdamW in large language model pretraining, particularly when using standard constant decoupled weight decay. This wrapper, named Hyperball, operates by fixing the Frobenius norms of weight matrices and their corresponding optimizer updates to constant values, working with base optimizers such as Adam or Muon. Evaluations on Qwen3 style models, up to 1.2B parameters, demonstrate that Muon Hyperball achieves a significant 20-30% token equivalent speedup compared to weight decay baselines. Furthermore, Hyperball enhances learning rate transferability across varying model widths and depths. The method is grounded in theory suggesting that weight decay establishes an equilibrium weight norm, which in turn determines the angular learning rate.

Key takeaway

For Machine Learning Engineers optimizing large language model pretraining, Hyperball offers a significant performance enhancement. If you are struggling with matrix-based optimizers like Muon losing their edge on larger models, consider integrating Hyperball. This wrapper can deliver a 20-30% token equivalent speedup on models up to 1.2B parameters, improving training efficiency and learning rate transfer. Your team can achieve faster convergence and more robust hyperparameter tuning by adopting this approach.

Key insights

Hyperball fixes weight matrix Frobenius norms to improve matrix optimizer performance in large model pretraining.

Principles

Weight decay sets an equilibrium weight norm.
Equilibrium weight norm dictates angular learning rate.
Constant Frobenius norms stabilize optimizer updates.

Method

Hyperball wraps a base optimizer (e.g., Adam, Muon) to enforce fixed Frobenius norms for weight matrices and their updates during training.

In practice

Apply Hyperball to Muon for 20-30% pretraining speedup.
Use Hyperball to improve learning rate transferability.
Integrate Hyperball with Qwen3 style models up to 1.2B parameters.

Topics

Pretraining Optimizers
Hyperball Optimization
Muon Optimizer
AdamW
Weight Decay
Language Models
Qwen3 Models

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.