To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Data Science & Analytics · Depth: Expert, extended

Summary

This study, named M2RL, systematically compares two multi-domain Reinforcement Learning with Verifiable Rewards (RLVR) paradigms for Large Language Models (LLMs): mixed multi-task RLVR and separate RLVR followed by model merging. Using the Qwen3-4B-Base model and open-source Nemotron 3 Nano datasets, experiments were conducted across math, coding, science, and instruction following domains. Key findings indicate that mixed multi-task RLVR achieves comparable performance to model merging while requiring only 33.2% of the GPU hours. The research reveals minimal inter-task interference and significant synergistic effects, particularly among reasoning-intensive domains. Analysis of weight space geometry, model prediction behavior, and information constraints shows that RLVR induces self-discrimination and that multi-task training fosters cross-domain synergy, enhancing overall evaluation accuracy.

Key takeaway

Research scientists developing multi-domain LLMs should consider adopting mixed multi-task Reinforcement Learning with Verifiable Rewards (RLVR). This approach offers performance comparable to separate training and model merging, but with substantially lower computational overhead (33.2% fewer GPU hours). You can achieve synergistic gains, especially across reasoning-intensive tasks, by leveraging this method, which also naturally enhances the model's self-discrimination capabilities. This strategy provides a more efficient path to developing general expert-level LLMs.

Key insights

Multi-task RLVR for LLMs offers comparable performance to model merging with significantly reduced computational cost.

Principles

Method

The study uses a simplified Supervised Fine-Tuning (SFT) then Reinforcement Learning (RL) pipeline, applying Group Relative Policy Optimization (GRPO) with a domain-routed reward function for multi-task RLVR.

In practice

Topics

Code references

Best for: Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.