PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

The PC layer introduces a weight parameterization via polynomial preconditioner to stabilize weight conditioning during Large Language Model (LLM) training. This module reshapes the singular-value spectrum of weight matrices using low-degree polynomial preconditioning, which can be merged back into the original architecture post-training, incurring no inference overhead. Empirical validation on Llama-1B pre-training shows a 2x token-efficiency speedup with AdamW and 1.13x with Muon, alongside improved zero-shot downstream accuracy. The method also reduces the Global Modified Condition Number (GMCN) by approximately 41% compared to baselines on Llama-1B.

Key takeaway

For AI Scientists and Machine Learning Engineers optimizing LLM pre-training, integrating the PC layer offers a significant token-efficiency boost and improved downstream performance. You should consider applying this polynomial weight preconditioning to your Llama-style models, particularly the FFN and attention output projections, as it stabilizes training dynamics and incurs no inference overhead, making it a practical enhancement for production systems.

Key insights

Polynomial preconditioning reshapes weight singular-value spectra to improve LLM training stability and efficiency without inference overhead.

Principles

Method

The PC layer normalizes weights, applies a low-degree polynomial to reshape singular values, then performs norm recovery and learnable scaling.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.