Small Initialization Matters for Large Language Models

2026-06-16 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

A study reveals that parameter initialization is a crucial determinant of large language model training and capacity, challenging the sole attribution of progress to scale, data, and architecture. Reducing the initialization scale consistently enhances pretraining, yielding the most significant improvements on reasoning-demanding tasks. The research identifies two common empirical settings that limit the benefits of small initialization, demonstrating that relaxing these settings restores favorable scaling. A critical initialization point is also uncovered, balancing reasoning capabilities with training efficiency. Mechanistically, small initialization guides a distinct developmental path where parameters initially form low-complexity structures before evolving into richer representations. Token-level analyses confirm gains are concentrated on non-trivial, context-constrained predictions. This work proposes a simple γ-initialization rule: expose initialization range as an explicit knob and use small initialization by default, offering a cost-free intervention to improve pretraining and strengthen reasoning across model scales.

Key takeaway

For Machine Learning Engineers optimizing large language model pretraining, you should prioritize experimenting with smaller parameter initialization scales. Implementing the proposed γ-initialization rule offers an almost cost-free intervention to significantly enhance model capacity and strengthen reasoning abilities, particularly for complex, context-constrained tasks. Evaluate your current initialization settings and consider relaxing any empirical constraints that might be limiting these benefits.

Key insights

Small parameter initialization significantly improves large language model pretraining and reasoning capacity by guiding distinct developmental trajectories.

Principles

Reducing initialization scale consistently improves LLM pretraining.
Small initialization drives parameters from low-complexity to rich representations.
Critical initialization balances reasoning and training performance.

Method

The γ-initialization rule proposes exposing initialization range as an explicit knob and using small initialization by default for improved pretraining and reasoning.

In practice

Implement γ-initialization for cost-free pretraining improvements.
Relax empirical settings that restrain small initialization advantages.
Focus on context-constrained predictions for reasoning gains.

Topics

Large Language Models
Parameter Initialization
LLM Pretraining
Reasoning Tasks
Model Capacity
γ-initialization

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.