OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training
Summary
OptiMer is a novel framework that addresses the challenge of tuning data mixture ratios in continual pre-training (CPT) for large language models (LLMs). Traditionally, these ratios are fixed before training, leading to expensive and time-consuming hyperparameter tuning. OptiMer decouples ratio selection from training by independently training one CPT model per dataset, extracting a "distribution vector" representing the parameter shift from each. These vectors are then optimally composed post-hoc using Bayesian optimization via the Tree-structured Parzen Estimator (TPE). Experiments on Gemma 3 27B across Japanese, Chinese, Math, and Code datasets demonstrate that OptiMer consistently outperforms data mixture and model averaging baselines, achieving 15–35\times lower search costs. The framework also allows for re-optimization of the same vector pool for different objectives without retraining, enabling on-demand, target-tailored models.
Key takeaway
For AI Engineers and Research Scientists adapting LLMs to new languages or domains, OptiMer offers a significantly more efficient and flexible approach to continual pre-training. Instead of costly, upfront data mixture ratio tuning, you can train models independently and optimize their composition post-hoc, saving weeks of GPU compute time. This method allows for rapid iteration and the creation of objective-specific models from a single set of trained components, enhancing adaptability and reducing resource waste.
Key insights
OptiMer decouples CPT data ratio selection from training via post-hoc Bayesian optimization of distribution vectors.
Principles
- Distribution vectors are approximately orthogonal.
- CPT trajectories are approximately linear in parameter space.
Method
Train individual CPT models, extract distribution vectors (parameter shifts), then use Bayesian optimization (TPE) to find optimal merge weights for these vectors post-hoc, maximizing an evaluation score on a development set.
In practice
- Optimize merge weights for target-tailored models on demand.
- Allow negative weights to remove cross-distribution interference.
Topics
- Continual Pre-training
- LLM Adaptation
- Distribution Vectors
- Bayesian Optimization
- Tree-structured Parzen Estimator
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.