Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy

2026-06-07 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new approach, Discrepancy-Constrained Markov Decision Process (DCMDP), addresses the train-inference discrepancy causing unpredictable performance and training collapses in Large Language Model (LLM) reinforcement learning. Researchers found that training policies can self-correct this discrepancy, identifying a "discrepancy tolerance region" where excessive narrowing can hinder exploration, but outside which, reducing discrepancy improves optimization. DCMDP formulates this as a reward maximization problem coupled with a constraint to align training-inference behavior. It employs a Lagrangian relaxation mechanism to dynamically adjust objective weights, balancing performance improvement and discrepancy control. This enables stable dual-objective optimization, allowing policy exploration within safe boundaries. Empirically, DCMDP significantly improves the performance of the 8B dense model Qwen-3-8b and the 30B Mixture-of-Expert model Qwen-3-30bA3b, facilitating heterogeneous training for resource-constrained inference deployments.

Key takeaway

For Machine Learning Engineers optimizing LLMs with reinforcement learning, you should consider implementing the Discrepancy-Constrained Markov Decision Process (DCMDP) to mitigate train-inference discrepancies. This approach stabilizes training and improves performance, particularly for models like Qwen-3-8b and Qwen-3-30bA3b. By explicitly aligning training with low-cost inference deployment, you can achieve more consistent and efficient optimization, ensuring robust model behavior in production environments.

Key insights

LLM RL training stability improves by explicitly constraining train-inference discrepancy within a tolerance region.

Principles

Train-inference discrepancy causes RL instability.
A "discrepancy tolerance region" exists for optimal exploration.
Policy self-correction is possible with proper signals.

Method

DCMDP formulates RL as a Discrepancy-Constrained Markov Decision Process. It uses Lagrangian relaxation to dynamically balance reward maximization and a constraint aligning training-inference behavior, ensuring stable dual-objective optimization.

In practice

Apply DCMDP to stabilize LLM RL training.
Optimize LLMs for low-cost inference via heterogeneous training.
Monitor train-inference discrepancy to guide policy exploration.

Topics

Large Language Models
Reinforcement Learning
Train-Inference Discrepancy
DCMDP
Efficient LLM Training
Resource-Constrained Inference

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.