RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

2026-06-17 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

RegMix-D is a novel method for dynamically selecting data mixtures during Large Language Model (LLM) pretraining, extending the existing RegMix approach. Unlike RegMix, which relies on static mixtures derived from endpoint losses of small-scale proxy runs, RegMix-D leverages full loss trajectories from these proxy runs to predict optimal data mixtures across various training stages. This dynamic approach supports two deployment modes: an offline variant that pre-generates a complete mixture schedule, and an online variant that adapts the mixture in real-time based on observed loss. Experiments conducted on 25 billion tokens of the Pile dataset with a 1 billion parameter target model demonstrated that RegMix-D consistently outperforms both RegMix and DoReMi across 13 downstream tasks. Furthermore, RegMix-D achieves these improvements with enhanced proxy efficiency, surpassing RegMix's performance using only 128 proxy models, which represents 25% of RegMix's original proxy compute budget.

Key takeaway

For Machine Learning Engineers optimizing Large Language Model pretraining, you should consider implementing dynamic data mixture strategies like RegMix-D. This approach, by utilizing full loss trajectories from proxy runs, significantly improves downstream task performance and reduces data mixture selection costs. You can either pre-generate a complete mixture schedule or adapt it online, potentially achieving better results with 75% less proxy compute.

Key insights

Leveraging full loss trajectories from proxy runs enables dynamic data mixture optimization for LLM pretraining.

Principles

Model learning preferences evolve during training.
Loss trajectories provide richer optimization data.
Dynamic data mixing boosts downstream performance.

Method

RegMix-D trains a regression model on proxy run loss trajectories to predict optimal data mixtures for multiple training stages. It supports offline schedule generation or online adaptation.

In practice

Implement dynamic data mixture schedules.
Use full loss trajectories for mixture optimization.
Reduce proxy compute for data mixture selection.

Topics

Large Language Models
Data Mixture Optimization
Pretraining Strategies
Proxy Training
Dynamic Data Mixing
Loss Trajectories
RegMix-D

Code references

LMMMEng/ParaX

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.