RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

RegMix-D is a novel method extending RegMix for dynamic data mixture selection during Large Language Model pretraining. It leverages full loss trajectories from small-scale proxy runs, rather than just endpoint losses, to train a regression model that predicts optimal data mixtures across multiple training stages. RegMix-D offers both an offline variant for pre-generating mixture schedules and an online variant for adaptive mixture adjustment during training. Experiments using 25B tokens of the Pile dataset and a 1B parameter target model demonstrated RegMix-D's consistent improvement over RegMix and DoReMi across 13 downstream tasks. Notably, it achieved superior performance with only 128 proxy models, utilizing 25% of RegMix's proxy compute budget.

Key takeaway

For Machine Learning Engineers pretraining Large Language Models, RegMix-D offers a significant advancement in data mixture optimization. You should consider integrating its dynamic mixing approach, which leverages full loss trajectories, to achieve better downstream task performance and reduce computational costs. Its ability to surpass existing methods like RegMix with only 25% of the proxy compute budget makes it a highly efficient strategy for improving LLM pretraining outcomes.

Key insights

RegMix-D dynamically optimizes LLM data mixtures by analyzing full loss trajectories from proxy training runs.

Principles

Method

RegMix-D trains a regression model on full loss trajectories from proxy runs to predict optimal data mixtures at various training stages, supporting both pre-scheduled offline and adaptive online deployment.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.