VIMPO: Value-Implicit Policy Optimization for LLMs
Summary
VIMPO introduces a novel critic-free policy optimization method designed to enhance the reasoning abilities of large language models (LLMs) using reinforcement learning with verifiable rewards. This approach addresses the inherent trade-off between method simplicity and effective credit assignment found in existing techniques like GRPO, which assigns trajectory-level advantages, and actor-critic methods, which require a potentially unstable learned value function. VIMPO derives a policy-implied value function from KL-regularized reinforcement learning optimality conditions, enabling a simple value loss that incorporates outcome-level verifiable rewards without needing a separate critic. It also provides a critic-free actor advantage, allowing distinct reward incorporation and PPO-style policy improvement. VIMPO demonstrates improved performance over GRPO across mathematical RLVR benchmarks, including MATH-500, AIME 2024, AIME 2025, and OlympiadBench, showing larger gains on competition-style tasks and maintaining consistency even with noisy rewards.
Key takeaway
For Machine Learning Engineers fine-tuning LLMs for complex reasoning tasks, VIMPO presents a compelling alternative to traditional reinforcement learning approaches. If your projects demand improved credit assignment with verifiable rewards but struggle with actor-critic training instability or GRPO's coarse advantages, you should evaluate VIMPO. Its critic-free, policy-implied value function offers a simpler, more stable path to enhanced performance on benchmarks like mathematical RLVR, even under noisy reward conditions.
Key insights
VIMPO offers critic-free policy optimization for LLMs by deriving a policy-implied value function, improving credit assignment and performance on verifiable reward tasks.
Principles
- Policy-implied values can replace learned critics.
- KL-regularized RL optimality yields value recurrence.
- Separating reward incorporation from policy updates.
Method
VIMPO derives a policy-implied value function from KL-regularized RL optimality, using a value recurrence based on policy-reference log-ratios and a terminal condition, then applies a PPO-style actor update.
In practice
- Improve LLM reasoning on math benchmarks.
- Enhance credit assignment in RLVR tasks.
- Apply critic-free methods for stable training.
Topics
- Large Language Models
- Reinforcement Learning
- Policy Optimization
- Critic-Free RL
- Credit Assignment
- Mathematical Reasoning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.