RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

RegMix-D is a novel method for dynamic data mixture selection during Large Language Model pretraining, extending the existing RegMix approach. It leverages full loss trajectories from small-scale proxy runs, rather than just endpoint losses, to predict optimal data mixtures across multiple training stages. The method offers two deployment modes: an offline variant that generates a complete mixture schedule before target training, and an "online" variant that adapts the mixture during training using observed loss, incurring only 0.37% overhead. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model demonstrate that RegMix-D consistently outperforms both RegMix and DoReMi across 13 downstream tasks. Notably, RegMix-D achieves superior results even with only 128 proxy models, utilizing 25% of RegMix's proxy compute budget. Ablation studies confirm its robustness to various hyperparameter choices.

Key takeaway

For Machine Learning Engineers optimizing Large Language Model pretraining, RegMix-D offers a compelling alternative to static data mixture selection. By dynamically adjusting data proportions throughout training, your models can achieve consistently lower validation loss and superior performance across downstream tasks. You should consider implementing RegMix-D, especially its online variant, as it outperforms existing methods like RegMix and DoReMi while significantly reducing proxy compute requirements, potentially saving substantial resources.

Key insights

RegMix-D dynamically optimizes LLM data mixtures by using full proxy loss trajectories, outperforming static methods with less compute.

Principles

Optimal data mixtures evolve during LLM pretraining.
Proxy loss trajectories inform dynamic mixture prediction.
Online adaptation improves accuracy via observed losses.

Method

RegMix-D trains a regression model on proxy loss trajectories to predict next-step loss. It supports offline pre-scheduling or online adaptation, where observed target losses are power-law corrected before querying the model.

In practice

Use 128 proxy models for efficient dynamic mixing.
Set N=5 switch points for optimal mixture changes.
Apply β=0.05 for cross-scale loss correction.

Topics

Dynamic Data Mixing
LLM Pretraining
RegMix-D
Proxy Training
Compute Efficiency
Regression Models

Code references

EleutherAI/lm-evaluation-harness

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.