RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories
Summary
RegMix-D is a novel method for dynamic data mixture selection during Large Language Model pretraining, extending the existing RegMix approach. It leverages full loss trajectories from small-scale proxy runs, rather than just endpoint losses, to predict optimal data mixtures across multiple training stages. The method offers two deployment modes: an offline variant that generates a complete mixture schedule before target training, and an "online" variant that adapts the mixture during training using observed loss, incurring only 0.37% overhead. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model demonstrate that RegMix-D consistently outperforms both RegMix and DoReMi across 13 downstream tasks. Notably, RegMix-D achieves superior results even with only 128 proxy models, utilizing 25% of RegMix's proxy compute budget. Ablation studies confirm its robustness to various hyperparameter choices.
Key takeaway
For Machine Learning Engineers optimizing Large Language Model pretraining, RegMix-D offers a compelling alternative to static data mixture selection. By dynamically adjusting data proportions throughout training, your models can achieve consistently lower validation loss and superior performance across downstream tasks. You should consider implementing RegMix-D, especially its online variant, as it outperforms existing methods like RegMix and DoReMi while significantly reducing proxy compute requirements, potentially saving substantial resources.
Key insights
RegMix-D dynamically optimizes LLM data mixtures by using full proxy loss trajectories, outperforming static methods with less compute.
Principles
- Optimal data mixtures evolve during LLM pretraining.
- Proxy loss trajectories inform dynamic mixture prediction.
- Online adaptation improves accuracy via observed losses.
Method
RegMix-D trains a regression model on proxy loss trajectories to predict next-step loss. It supports offline pre-scheduling or online adaptation, where observed target losses are power-law corrected before querying the model.
In practice
- Use 128 proxy models for efficient dynamic mixing.
- Set N=5 switch points for optimal mixture changes.
- Apply β=0.05 for cross-scale loss correction.
Topics
- Dynamic Data Mixing
- LLM Pretraining
- RegMix-D
- Proxy Training
- Compute Efficiency
- Regression Models
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.