Small Initialization Matters for Large Language Models
Summary
A study reveals that parameter initialization is a crucial determinant of large language model training and capacity, challenging the sole attribution of progress to scale, data, and architecture. Reducing the initialization scale consistently enhances pretraining, yielding the most significant improvements on reasoning-demanding tasks. The research identifies two common empirical settings that limit the benefits of small initialization, demonstrating that relaxing these settings restores favorable scaling. A critical initialization point is also uncovered, balancing reasoning capabilities with training efficiency. Mechanistically, small initialization guides a distinct developmental path where parameters initially form low-complexity structures before evolving into richer representations. Token-level analyses confirm gains are concentrated on non-trivial, context-constrained predictions. This work proposes a simple γ-initialization rule: expose initialization range as an explicit knob and use small initialization by default, offering a cost-free intervention to improve pretraining and strengthen reasoning across model scales.
Key takeaway
For Machine Learning Engineers optimizing large language model pretraining, you should prioritize experimenting with smaller parameter initialization scales. Implementing the proposed γ-initialization rule offers an almost cost-free intervention to significantly enhance model capacity and strengthen reasoning abilities, particularly for complex, context-constrained tasks. Evaluate your current initialization settings and consider relaxing any empirical constraints that might be limiting these benefits.
Key insights
Small parameter initialization significantly improves large language model pretraining and reasoning capacity by guiding distinct developmental trajectories.
Principles
- Reducing initialization scale consistently improves LLM pretraining.
- Small initialization drives parameters from low-complexity to rich representations.
- Critical initialization balances reasoning and training performance.
Method
The γ-initialization rule proposes exposing initialization range as an explicit knob and using small initialization by default for improved pretraining and reasoning.
In practice
- Implement γ-initialization for cost-free pretraining improvements.
- Relax empirical settings that restrain small initialization advantages.
- Focus on context-constrained predictions for reasoning gains.
Topics
- Large Language Models
- Parameter Initialization
- LLM Pretraining
- Reasoning Tasks
- Model Capacity
- γ-initialization
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.