VIMPO: Value-Implicit Policy Optimization for LLMs

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

VIMPO introduces a novel critic-free policy optimization method designed to enhance the reasoning abilities of large language models (LLMs) using reinforcement learning with verifiable rewards. This approach addresses the inherent trade-off between method simplicity and effective credit assignment found in existing techniques like GRPO, which assigns trajectory-level advantages, and actor-critic methods, which require a potentially unstable learned value function. VIMPO derives a policy-implied value function from KL-regularized reinforcement learning optimality conditions, enabling a simple value loss that incorporates outcome-level verifiable rewards without needing a separate critic. It also provides a critic-free actor advantage, allowing distinct reward incorporation and PPO-style policy improvement. VIMPO demonstrates improved performance over GRPO across mathematical RLVR benchmarks, including MATH-500, AIME 2024, AIME 2025, and OlympiadBench, showing larger gains on competition-style tasks and maintaining consistency even with noisy rewards.

Key takeaway

For Machine Learning Engineers fine-tuning LLMs for complex reasoning tasks, VIMPO presents a compelling alternative to traditional reinforcement learning approaches. If your projects demand improved credit assignment with verifiable rewards but struggle with actor-critic training instability or GRPO's coarse advantages, you should evaluate VIMPO. Its critic-free, policy-implied value function offers a simpler, more stable path to enhanced performance on benchmarks like mathematical RLVR, even under noisy reward conditions.

Key insights

VIMPO offers critic-free policy optimization for LLMs by deriving a policy-implied value function, improving credit assignment and performance on verifiable reward tasks.

Principles

Method

VIMPO derives a policy-implied value function from KL-regularized RL optimality, using a value recurrence based on policy-reference log-ratios and a terminal condition, then applies a PPO-style actor update.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.