OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

2026-04-01 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

OptiMer is a novel framework that addresses the challenge of tuning data mixture ratios in continual pre-training (CPT) for large language models (LLMs). Traditionally, these ratios are fixed before training, leading to expensive and time-consuming hyperparameter tuning. OptiMer decouples ratio selection from training by independently training one CPT model per dataset, extracting a "distribution vector" representing the parameter shift from each. These vectors are then optimally composed post-hoc using Bayesian optimization via the Tree-structured Parzen Estimator (TPE). Experiments on Gemma 3 27B across Japanese, Chinese, Math, and Code datasets demonstrate that OptiMer consistently outperforms data mixture and model averaging baselines, achieving 15–35\times lower search costs. The framework also allows for re-optimization of the same vector pool for different objectives without retraining, enabling on-demand, target-tailored models.

Key takeaway

For AI Engineers and Research Scientists adapting LLMs to new languages or domains, OptiMer offers a significantly more efficient and flexible approach to continual pre-training. Instead of costly, upfront data mixture ratio tuning, you can train models independently and optimize their composition post-hoc, saving weeks of GPU compute time. This method allows for rapid iteration and the creation of objective-specific models from a single set of trained components, enhancing adaptability and reducing resource waste.

Key insights

OptiMer decouples CPT data ratio selection from training via post-hoc Bayesian optimization of distribution vectors.

Principles

Distribution vectors are approximately orthogonal.
CPT trajectories are approximately linear in parameter space.

Method

Train individual CPT models, extract distribution vectors (parameter shifts), then use Bayesian optimization (TPE) to find optimal merge weights for these vectors post-hoc, maximizing an evaluation score on a development set.

In practice

Optimize merge weights for target-tailored models on demand.
Allow negative weights to remove cross-distribution interference.

Topics

Continual Pre-training
LLM Adaptation
Distribution Vectors
Bayesian Optimization
Tree-structured Parzen Estimator

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.