Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy
Summary
A new approach, Discrepancy-Constrained Markov Decision Process (DCMDP), addresses the train-inference discrepancy causing unpredictable performance and training collapses in Large Language Model (LLM) reinforcement learning. Researchers found that training policies can self-correct this discrepancy, identifying a "discrepancy tolerance region" where excessive narrowing can hinder exploration, but outside which, reducing discrepancy improves optimization. DCMDP formulates this as a reward maximization problem coupled with a constraint to align training-inference behavior. It employs a Lagrangian relaxation mechanism to dynamically adjust objective weights, balancing performance improvement and discrepancy control. This enables stable dual-objective optimization, allowing policy exploration within safe boundaries. Empirically, DCMDP significantly improves the performance of the 8B dense model Qwen-3-8b and the 30B Mixture-of-Expert model Qwen-3-30bA3b, facilitating heterogeneous training for resource-constrained inference deployments.
Key takeaway
For Machine Learning Engineers optimizing LLMs with reinforcement learning, you should consider implementing the Discrepancy-Constrained Markov Decision Process (DCMDP) to mitigate train-inference discrepancies. This approach stabilizes training and improves performance, particularly for models like Qwen-3-8b and Qwen-3-30bA3b. By explicitly aligning training with low-cost inference deployment, you can achieve more consistent and efficient optimization, ensuring robust model behavior in production environments.
Key insights
LLM RL training stability improves by explicitly constraining train-inference discrepancy within a tolerance region.
Principles
- Train-inference discrepancy causes RL instability.
- A "discrepancy tolerance region" exists for optimal exploration.
- Policy self-correction is possible with proper signals.
Method
DCMDP formulates RL as a Discrepancy-Constrained Markov Decision Process. It uses Lagrangian relaxation to dynamically balance reward maximization and a constraint aligning training-inference behavior, ensuring stable dual-objective optimization.
In practice
- Apply DCMDP to stabilize LLM RL training.
- Optimize LLMs for low-cost inference via heterogeneous training.
- Monitor train-inference discrepancy to guide policy exploration.
Topics
- Large Language Models
- Reinforcement Learning
- Train-Inference Discrepancy
- DCMDP
- Efficient LLM Training
- Resource-Constrained Inference
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.