Pre-training Scaling Stopped Being the Whole Recipe
Summary
Public reports from SmolLM3, Kimi K2, DeepSeek V3, and NVIDIA's Nemotron 3 Super indicate a significant shift in open model training methodologies from 2020-2025's "single-axis scaling" to a more complex, multi-faceted approach in 2026. This new paradigm involves four key changes: overtraining models beyond Chinchilla scaling laws, adopting Warmup-Stable-Decay (WSD) learning rate schedules, integrating synthetic data extensively across all training stages, and reorganizing compute ratios to prioritize inference over reinforcement learning (RL) training. Models like SmolLM3, a 3B dense model, were trained on 11.2 trillion tokens (185x Chinchilla recommendations) to optimize for on-device inference costs, despite increased training expense. RL, while crucial, scales less efficiently than inference, leading labs to rebalance budgets. This evolution reflects a return to "the age of research" in AI development, emphasizing nuanced design decisions over brute-force scaling.
Key takeaway
For research scientists and CTOs evaluating model development strategies, recognize that the era of simple scaling is over. Your focus should shift from merely increasing model size or pre-training tokens to optimizing for inference costs, strategically employing synthetic data, and carefully designing learning rate schedules. The trade-off between training expense and deployment efficiency is now paramount, making detailed open training reports essential for understanding practical model capabilities and limitations.
Key insights
AI model training is shifting from simple scaling to a multi-faceted approach optimizing for inference and data quality.
Principles
- Overtraining small models optimizes inference costs.
- Learning rate schedules impact fine-tuning and data curricula.
- Synthetic data is a standard, multi-stage training ingredient.
Method
Modern training involves overtraining past Chinchilla, using Warmup-Stable-Decay (WSD) learning rate schedules, integrating synthetic data throughout, and rebalancing compute towards inference.
In practice
- Prioritize inference cost over training compute for deployment.
- Experiment with WSD schedules for multi-stage data curricula.
- Utilize synthetic data for instruction tuning and evaluation.
Topics
- Pre-training Scaling
- Chinchilla Scaling Laws
- Warmup-Stable-Decay
- Synthetic Data
- Reinforcement Learning
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Deep Learning on Medium.