PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training
Summary
A novel preconditioning (PC) layer is proposed to enhance the stability of Large Language Model (LLM) pre-training. This module parameterizes weights using a polynomial preconditioner, effectively reshaping the singular-value spectrum of weight matrices to ensure consistent conditioning throughout the training process. Crucially, after training, these preconditioned weights can be seamlessly integrated back into the original architecture, introducing no additional inference overhead. The PC layer demonstrated a clear advantage over standard transformers during Llama-1B pre-training, utilizing both AdamW and Muon optimizers. Theoretically, the approach is supported by a proof showing that uniformly bounding each layer's singular values leads to geometric convergence of gradient descent in certain deep linear networks. The associated code is publicly available.
Key takeaway
For Machine Learning Engineers optimizing LLM pre-training, consider integrating the PC layer to enhance training stability and convergence. This method allows you to control weight conditioning without incurring any inference overhead post-training, potentially accelerating your development cycles. You should evaluate its impact on your specific LLM architectures, especially if you are experiencing training instability or slow convergence with standard transformers.
Key insights
Polynomial preconditioning stabilizes LLM training by controlling weight singular values, incurring no inference overhead.
Principles
- Stable weight conditioning improves LLM training.
- Singular value spectrum control is key.
- Geometric convergence is provable with bounded singular values.
Method
Parameterize weights via a polynomial preconditioner to reshape singular-value spectrum, then merge preconditioned weights post-training for zero inference overhead.
In practice
- Apply PC layer to Llama-1B pre-training.
- Use with AdamW or Muon optimizers.
- Integrate without inference cost.
Topics
- Large Language Models
- Pre-training Optimization
- Weight Preconditioning
- Singular Value Decomposition
- AdamW Optimizer
- Muon Optimizer
- Llama-1B
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.