RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

RegMix-D is a novel method for dynamic data mixture selection during Large Language Model pretraining, extending the existing RegMix approach. It leverages full loss trajectories from small-scale proxy runs, rather than just endpoint losses, to predict optimal data mixtures across multiple training stages. The method offers two deployment modes: an offline variant that generates a complete mixture schedule before target training, and an "online" variant that adapts the mixture during training using observed loss, incurring only 0.37% overhead. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model demonstrate that RegMix-D consistently outperforms both RegMix and DoReMi across 13 downstream tasks. Notably, RegMix-D achieves superior results even with only 128 proxy models, utilizing 25% of RegMix's proxy compute budget. Ablation studies confirm its robustness to various hyperparameter choices.

Key takeaway

For Machine Learning Engineers optimizing Large Language Model pretraining, RegMix-D offers a compelling alternative to static data mixture selection. By dynamically adjusting data proportions throughout training, your models can achieve consistently lower validation loss and superior performance across downstream tasks. You should consider implementing RegMix-D, especially its online variant, as it outperforms existing methods like RegMix and DoReMi while significantly reducing proxy compute requirements, potentially saving substantial resources.

Key insights

RegMix-D dynamically optimizes LLM data mixtures by using full proxy loss trajectories, outperforming static methods with less compute.

Principles

Method

RegMix-D trains a regression model on proxy loss trajectories to predict next-step loss. It supports offline pre-scheduling or online adaptation, where observed target losses are power-law corrected before querying the model.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.