To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models
Summary
This study, named M2RL, systematically compares two multi-domain Reinforcement Learning with Verifiable Rewards (RLVR) paradigms for Large Language Models (LLMs): mixed multi-task RLVR and separate RLVR followed by model merging. Using the Qwen3-4B-Base model and open-source Nemotron 3 Nano datasets, experiments were conducted across math, coding, science, and instruction following domains. Key findings indicate that mixed multi-task RLVR achieves comparable performance to model merging while requiring only 33.2% of the GPU hours. The research reveals minimal inter-task interference and significant synergistic effects, particularly among reasoning-intensive domains. Analysis of weight space geometry, model prediction behavior, and information constraints shows that RLVR induces self-discrimination and that multi-task training fosters cross-domain synergy, enhancing overall evaluation accuracy.
Key takeaway
Research scientists developing multi-domain LLMs should consider adopting mixed multi-task Reinforcement Learning with Verifiable Rewards (RLVR). This approach offers performance comparable to separate training and model merging, but with substantially lower computational overhead (33.2% fewer GPU hours). You can achieve synergistic gains, especially across reasoning-intensive tasks, by leveraging this method, which also naturally enhances the model's self-discrimination capabilities. This strategy provides a more efficient path to developing general expert-level LLMs.
Key insights
Multi-task RLVR for LLMs offers comparable performance to model merging with significantly reduced computational cost.
Principles
- Reasoning-intensive domains exhibit synergistic effects in RLVR.
- RLVR naturally induces self-discrimination capabilities in LLMs.
- Policy neighborhoods can enhance multi-domain model performance.
Method
The study uses a simplified Supervised Fine-Tuning (SFT) then Reinforcement Learning (RL) pipeline, applying Group Relative Policy Optimization (GRPO) with a domain-routed reward function for multi-task RLVR.
In practice
- Consider mixed multi-task RLVR for multi-domain LLM training to save GPU hours.
- Prioritize process-based verification for logic-intensive tasks like math and coding.
- Utilize outcome-based verification for constraint-intensive tasks such as instruction following.
Topics
- Reinforcement Learning with Verifiable Rewards
- Multi-domain Learning
- Model Merging
- Multi-task Training
- Policy Neighborhoods
Code references
Best for: Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.