Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new study identifies "repetition mismatch" as a primary reason why small-scale data mixture experiments fail to extrapolate to larger training budgets, particularly when high-quality data is limited and repeated. This mismatch occurs because the repetition rate of small, high-quality datasets changes as the training budget grows, altering the optimal data mixture in ways proxy experiments don't predict. Researchers propose a repetition-controlled subsampling procedure to mitigate this effect. In a two-source scenario combining limited high-quality data with web crawl, a single repetition-controlled experiment using only 1/16 of the target tokens achieved a mixture within 0.05 of the optimum for a 757M parameter model, significantly outperforming uncontrolled experiments with an error of 0.75. For three data sources, two repetition-controlled horizons recovered the optimal mixture at the 757M scale. The findings emphasize that data repetition dynamics are crucial for mixture optimization, suggesting it be treated as a first-class variable.

Key takeaway

For Machine Learning Engineers tuning pre-training data mixtures, relying solely on small-scale experiments without accounting for data repetition is inefficient and inaccurate. You should implement a repetition-controlled subsampling procedure to accurately predict optimal data mixtures. This approach allows you to achieve high accuracy with significantly less compute, using as little as 1/16 of your target token budget, thereby accelerating model development and reducing resource consumption.

Key insights

Repetition mismatch, where data repetition rates change with scale, causes small-scale data mixture experiments to fail, but repetition control fixes it.

Principles

Method

Implement a subsampling procedure that precisely matches the target repetition rate of high-quality datasets to control for repetition mismatch in data mixture experiments.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.