PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training
Summary
The PC layer introduces a weight parameterization via polynomial preconditioner to stabilize weight conditioning during Large Language Model (LLM) training. This module reshapes the singular-value spectrum of weight matrices using low-degree polynomial preconditioning, which can be merged back into the original architecture post-training, incurring no inference overhead. Empirical validation on Llama-1B pre-training shows a 2x token-efficiency speedup with AdamW and 1.13x with Muon, alongside improved zero-shot downstream accuracy. The method also reduces the Global Modified Condition Number (GMCN) by approximately 41% compared to baselines on Llama-1B.
Key takeaway
For AI Scientists and Machine Learning Engineers optimizing LLM pre-training, integrating the PC layer offers a significant token-efficiency boost and improved downstream performance. You should consider applying this polynomial weight preconditioning to your Llama-style models, particularly the FFN and attention output projections, as it stabilizes training dynamics and incurs no inference overhead, making it a practical enhancement for production systems.
Key insights
Polynomial preconditioning reshapes weight singular-value spectra to improve LLM training stability and efficiency without inference overhead.
Principles
- Bounded singular values ensure geometric convergence.
- Soft spectrum conditioning balances optimization and expressiveness.
- Weight geometry is a key design axis for stable LLM training.
Method
The PC layer normalizes weights, applies a low-degree polynomial to reshape singular values, then performs norm recovery and learnable scaling.
In practice
- Achieve 2x token-efficiency speedup for Llama-1B with AdamW.
- Improve zero-shot downstream accuracy by 0.0206 (AdamW) and 0.0125 (Muon).
- Integrate PC layer without inference-time computational cost.
Topics
- Large Language Models
- Weight Preconditioning
- Polynomial Preconditioners
- Optimization Stability
- Singular Value Spectrum
- Llama Architecture
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.