Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them
Summary
A new study identifies "repetition mismatch" as a primary reason why small-scale data mixture experiments fail to extrapolate to larger training budgets, particularly when high-quality data is limited and repeated. This mismatch occurs because the repetition rate of small, high-quality datasets changes as the training budget grows, altering the optimal data mixture in ways proxy experiments don't predict. Researchers propose a repetition-controlled subsampling procedure to mitigate this effect. In a two-source scenario combining limited high-quality data with web crawl, a single repetition-controlled experiment using only 1/16 of the target tokens achieved a mixture within 0.05 of the optimum for a 757M parameter model, significantly outperforming uncontrolled experiments with an error of 0.75. For three data sources, two repetition-controlled horizons recovered the optimal mixture at the 757M scale. The findings emphasize that data repetition dynamics are crucial for mixture optimization, suggesting it be treated as a first-class variable.
Key takeaway
For Machine Learning Engineers tuning pre-training data mixtures, relying solely on small-scale experiments without accounting for data repetition is inefficient and inaccurate. You should implement a repetition-controlled subsampling procedure to accurately predict optimal data mixtures. This approach allows you to achieve high accuracy with significantly less compute, using as little as 1/16 of your target token budget, thereby accelerating model development and reducing resource consumption.
Key insights
Repetition mismatch, where data repetition rates change with scale, causes small-scale data mixture experiments to fail, but repetition control fixes it.
Principles
- Repetition dynamics dictate mixture experiment generalization.
- Treat data repetition as a first-class optimization variable.
Method
Implement a subsampling procedure that precisely matches the target repetition rate of high-quality datasets to control for repetition mismatch in data mixture experiments.
In practice
- Apply repetition control when tuning data mixtures.
- Use 1/16 target tokens for efficient mixture optimization.
Topics
- Data Mixture Optimization
- Pre-training Data
- Repetition Mismatch
- Subsampling
- Large Language Models
- Training Efficiency
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.