Muon Learns More Robust and Transferable Features than Adam

2026-06-08 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Muon, an emerging optimizer for pretraining Large Language Models (LLMs) and vision classifiers, demonstrates significant feature-learning advantages over Adam and SGD. Research indicates that features learned by Muon are consistently more robust, as evidenced by evaluations on corrupted images and texts across transformer and Convolutional Neural Network (CNN) architectures. This robustness is further reflected in larger logit margins across layers. Additionally, Muon-learned features exhibit superior transferability to downstream tasks, supported by increased diversity of hidden states, measured by effective rank. These empirical findings are theoretically substantiated by a classification problem showing Muon achieves larger margins and higher effective rank than Adam and SGD.

Key takeaway

For Machine Learning Engineers pretraining large models, you should consider Muon as an optimizer alternative to Adam or SGD. Muon consistently yields more robust and transferable features, which can significantly improve your model's performance on corrupted data and downstream tasks. Integrating Muon could lead to more efficient and effective model development, reducing the need for extensive fine-tuning.

Key insights

Muon optimizer yields more robust and transferable features than Adam and SGD across diverse model architectures and tasks.

Principles

Muon improves feature robustness.
Muon enhances feature transferability.

In practice

Use Muon for LLM pretraining.
Apply Muon to vision classifier training.

Topics

Muon Optimizer
Large Language Models
Vision Classifiers
Feature Robustness
Transfer Learning
Deep Learning Optimizers

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.