RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories
Summary
RegMix-D is a novel method for dynamically selecting data mixtures during Large Language Model (LLM) pretraining, extending the existing RegMix approach. Unlike RegMix, which relies on static mixtures derived from endpoint losses of small-scale proxy runs, RegMix-D leverages full loss trajectories from these proxy runs to predict optimal data mixtures across various training stages. This dynamic approach supports two deployment modes: an offline variant that pre-generates a complete mixture schedule, and an online variant that adapts the mixture in real-time based on observed loss. Experiments conducted on 25 billion tokens of the Pile dataset with a 1 billion parameter target model demonstrated that RegMix-D consistently outperforms both RegMix and DoReMi across 13 downstream tasks. Furthermore, RegMix-D achieves these improvements with enhanced proxy efficiency, surpassing RegMix's performance using only 128 proxy models, which represents 25% of RegMix's original proxy compute budget.
Key takeaway
For Machine Learning Engineers optimizing Large Language Model pretraining, you should consider implementing dynamic data mixture strategies like RegMix-D. This approach, by utilizing full loss trajectories from proxy runs, significantly improves downstream task performance and reduces data mixture selection costs. You can either pre-generate a complete mixture schedule or adapt it online, potentially achieving better results with 75% less proxy compute.
Key insights
Leveraging full loss trajectories from proxy runs enables dynamic data mixture optimization for LLM pretraining.
Principles
- Model learning preferences evolve during training.
- Loss trajectories provide richer optimization data.
- Dynamic data mixing boosts downstream performance.
Method
RegMix-D trains a regression model on proxy run loss trajectories to predict optimal data mixtures for multiple training stages. It supports offline schedule generation or online adaptation.
In practice
- Implement dynamic data mixture schedules.
- Use full loss trajectories for mixture optimization.
- Reduce proxy compute for data mixture selection.
Topics
- Large Language Models
- Data Mixture Optimization
- Pretraining Strategies
- Proxy Training
- Dynamic Data Mixing
- Loss Trajectories
- RegMix-D
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.