Pre-training Scaling Stopped Being the Whole Recipe

2026-04-23 · Source: Deep Learning on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

Public reports from SmolLM3, Kimi K2, DeepSeek V3, and NVIDIA's Nemotron 3 Super indicate a significant shift in open model training methodologies from 2020-2025's "single-axis scaling" to a more complex, multi-faceted approach in 2026. This new paradigm involves four key changes: overtraining models beyond Chinchilla scaling laws, adopting Warmup-Stable-Decay (WSD) learning rate schedules, integrating synthetic data extensively across all training stages, and reorganizing compute ratios to prioritize inference over reinforcement learning (RL) training. Models like SmolLM3, a 3B dense model, were trained on 11.2 trillion tokens (185x Chinchilla recommendations) to optimize for on-device inference costs, despite increased training expense. RL, while crucial, scales less efficiently than inference, leading labs to rebalance budgets. This evolution reflects a return to "the age of research" in AI development, emphasizing nuanced design decisions over brute-force scaling.

Key takeaway

For research scientists and CTOs evaluating model development strategies, recognize that the era of simple scaling is over. Your focus should shift from merely increasing model size or pre-training tokens to optimizing for inference costs, strategically employing synthetic data, and carefully designing learning rate schedules. The trade-off between training expense and deployment efficiency is now paramount, making detailed open training reports essential for understanding practical model capabilities and limitations.

Key insights

AI model training is shifting from simple scaling to a multi-faceted approach optimizing for inference and data quality.

Principles

Overtraining small models optimizes inference costs.
Learning rate schedules impact fine-tuning and data curricula.
Synthetic data is a standard, multi-stage training ingredient.

Method

Modern training involves overtraining past Chinchilla, using Warmup-Stable-Decay (WSD) learning rate schedules, integrating synthetic data throughout, and rebalancing compute towards inference.

In practice

Prioritize inference cost over training compute for deployment.
Experiment with WSD schedules for multi-stage data curricula.
Utilize synthetic data for instruction tuning and evaluation.

Topics

Pre-training Scaling
Chinchilla Scaling Laws
Warmup-Stable-Decay
Synthetic Data
Reinforcement Learning

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Deep Learning on Medium.