NVIDIA Releases Polar, a Token-Faithful Rollout Framework for GRPO Training Across Codex, Claude Code, and Qwen Code

2026-05-27 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

NVIDIA has released Polar, a token-faithful rollout framework designed for GRPO training across various large language models, including Codex, Claude Code, and Qwen Code. Polar simplifies reinforcement learning system integration by treating the agent harness as a black box, intercepting at the model API call boundary. It supports Anthropic Messages, OpenAI Chat Completions, OpenAI Responses, and Google generateContent APIs without requiring harness code changes. The framework achieves a 5.39x speedup and 87.7% GPU utilization using its `prefix_merging` strategy compared to `per_request`. SWE-Bench verified results show significant gains, with Codex improving from 3.8% to 26.4% (+22.6 pts). Polar also functions as a distributed data generation service, producing 504 accepted SFT trajectories from 1,638 SWE-Gym attempts in approximately 64 GPU-hours using Qwen3.5-122B-A10B on 8xH100.

Key takeaway

For Machine Learning Engineers integrating LLMs into reinforcement learning systems, Polar significantly reduces integration complexity and improves training efficiency. You should evaluate its proxy design and `prefix_merging` strategy to accelerate GRPO training and data generation, especially when working with diverse LLM APIs and aiming for higher GPU utilization. This framework offers a practical approach to streamline your development workflow.

Key insights

Polar simplifies RL system integration by abstracting the agent harness via a model API proxy.

Principles

Proxy design streamlines RL agent integration.
Token-faithful reconstruction boosts training efficiency.
Reward attachment to sampled tokens is critical.

Method

Polar employs a provider-compatible proxy between the agent harness and inference server, intercepting model API calls to reconstruct token-faithful trajectories for GRPO training across diverse LLMs.

In practice

Point your model base URL to the Polar gateway.
Utilize `prefix_merging` for 5.39x training speedup.
Generate SFT trajectories using Polar's data service.

Topics

NVIDIA Polar
GRPO Training
Large Language Models
Reinforcement Learning
SWE-Bench
Code Generation
SFT Trajectories

Code references

NVIDIA-NeMo/ProRL-Agent-Server

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.