RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories
Summary
RegMix-D is a novel method extending RegMix for dynamic data mixture selection during Large Language Model pretraining. It leverages full loss trajectories from small-scale proxy runs, rather than just endpoint losses, to train a regression model that predicts optimal data mixtures across multiple training stages. RegMix-D offers both an offline variant for pre-generating mixture schedules and an online variant for adaptive mixture adjustment during training. Experiments using 25B tokens of the Pile dataset and a 1B parameter target model demonstrated RegMix-D's consistent improvement over RegMix and DoReMi across 13 downstream tasks. Notably, it achieved superior performance with only 128 proxy models, utilizing 25% of RegMix's proxy compute budget.
Key takeaway
For Machine Learning Engineers pretraining Large Language Models, RegMix-D offers a significant advancement in data mixture optimization. You should consider integrating its dynamic mixing approach, which leverages full loss trajectories, to achieve better downstream task performance and reduce computational costs. Its ability to surpass existing methods like RegMix with only 25% of the proxy compute budget makes it a highly efficient strategy for improving LLM pretraining outcomes.
Key insights
RegMix-D dynamically optimizes LLM data mixtures by analyzing full loss trajectories from proxy training runs.
Principles
- Proxy runs yield valuable full loss trajectories.
- Dynamic data mixing improves LLM pretraining.
- Efficiency can be gained by trajectory analysis.
Method
RegMix-D trains a regression model on full loss trajectories from proxy runs to predict optimal data mixtures at various training stages, supporting both pre-scheduled offline and adaptive online deployment.
In practice
- Generate complete mixture schedules offline.
- Adapt data mixtures dynamically during training.
- Reduce proxy compute budget by 75%.
Topics
- RegMix-D
- Large Language Models
- Data Mixing
- LLM Pretraining
- Loss Trajectories
- Proxy Training
- The Pile Dataset
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.