vLLM V0 to V1: Correctness Before Corrections in RL

2026-05-06 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

ServiceNow AI's PipelineRL team successfully migrated its vLLM inference engine from V0 (version 0.8.5) to V1 (version 0.18.1) for rollout generation, ensuring correctness before addressing Reinforcement Learning (RL) objective changes. The migration aimed to eliminate train-inference mismatch, where discrepancies in logprob computation could alter training dynamics. Initial V1 attempts showed significant deviations in trainer-side metrics like clip rate, KL, entropy, and reward compared to the V0 reference. The team identified and fixed four key issues: semantic logprob mismatch, V1-specific runtime defaults (prefix caching, async scheduling), inflight weight-update path, and the use of an fp32 `lm_head` for the final projection. After these fixes, the V1 run closely matched the V0 trajectory across all critical training metrics, demonstrating backend parity.

Key takeaway

For MLOps Engineers managing online RL systems, ensuring inference engine parity during upgrades is critical. You should meticulously verify that the backend returns logprobs and runtime behavior consistent with trainer expectations before attempting any objective-level corrections. This approach prevents confounding inference correctness issues with true off-policy or asynchronous mismatches, leading to more interpretable training curves and robust system performance.

Key insights

Prioritize backend inference correctness before modifying RL objectives to resolve train-inference mismatches.

Principles

Separate backend behavior from RL objective changes.
Verify logprob semantics and runtime defaults.
Match numerical paths for critical computations.

Method

To achieve vLLM V0-V1 parity, ensure `logprobs-mode=processed_logprobs`, disable prefix caching and async scheduling, match inflight weight update behavior (e.g., `mode="keep", clear_cache=False`), and use an fp32 `lm_head` for final projection.

In practice

Set `logprobs-mode=processed_logprobs` for accurate logprobs.
Disable prefix caching in online RL setups.
Use fp32 for `lm_head` in RL inference.

Topics

vLLM Migration
Reinforcement Learning
Train-Inference Mismatch
Logprob Computation
Inflight Weight Updates

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.