vLLM V0 to V1: Correctness Before Corrections in RL
Summary
ServiceNow AI's PipelineRL team successfully migrated its vLLM inference engine from V0 (version 0.8.5) to V1 (version 0.18.1) for rollout generation, ensuring correctness before addressing Reinforcement Learning (RL) objective changes. The migration aimed to eliminate train-inference mismatch, where discrepancies in logprob computation could alter training dynamics. Initial V1 attempts showed significant deviations in trainer-side metrics like clip rate, KL, entropy, and reward compared to the V0 reference. The team identified and fixed four key issues: semantic logprob mismatch, V1-specific runtime defaults (prefix caching, async scheduling), inflight weight-update path, and the use of an fp32 `lm_head` for the final projection. After these fixes, the V1 run closely matched the V0 trajectory across all critical training metrics, demonstrating backend parity.
Key takeaway
For MLOps Engineers managing online RL systems, ensuring inference engine parity during upgrades is critical. You should meticulously verify that the backend returns logprobs and runtime behavior consistent with trainer expectations before attempting any objective-level corrections. This approach prevents confounding inference correctness issues with true off-policy or asynchronous mismatches, leading to more interpretable training curves and robust system performance.
Key insights
Prioritize backend inference correctness before modifying RL objectives to resolve train-inference mismatches.
Principles
- Separate backend behavior from RL objective changes.
- Verify logprob semantics and runtime defaults.
- Match numerical paths for critical computations.
Method
To achieve vLLM V0-V1 parity, ensure `logprobs-mode=processed_logprobs`, disable prefix caching and async scheduling, match inflight weight update behavior (e.g., `mode="keep", clear_cache=False`), and use an fp32 `lm_head` for final projection.
In practice
- Set `logprobs-mode=processed_logprobs` for accurate logprobs.
- Disable prefix caching in online RL setups.
- Use fp32 for `lm_head` in RL inference.
Topics
- vLLM Migration
- Reinforcement Learning
- Train-Inference Mismatch
- Logprob Computation
- Inflight Weight Updates
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.