ESPO: Early-Stopping Proximal Policy Optimization

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

ESPO (Early-Stopping Proximal Policy Optimization) is a novel algorithm designed to enhance reinforcement learning for large language models by detecting and terminating failed reasoning trajectories early. Standard RL methods waste compute and introduce noise by forcing LLMs to complete full trajectories even after an early error. ESPO addresses this by computing a surrogate regret using already-computed logits at each generation step, terminating rollouts when smoothed cumulative regret significantly exceeds estimates. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference errors. This approach, tested on DeepSeek-R1-Distill-Qwen-7B for mathematical reasoning, outperformed PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving over 20% rollout tokens.

Key takeaway

For Machine Learning Engineers optimizing large language model training with reinforcement learning, ESPO offers a significant efficiency and performance improvement. By intelligently terminating failed reasoning trajectories early, you can reduce compute costs by over 20% and achieve higher benchmark scores compared to standard PPO. Consider integrating early-stopping mechanisms like ESPO into your RL pipelines to enhance training efficacy and resource utilization.

Key insights

ESPO improves LLM reinforcement learning by early-stopping failed trajectories based on real-time regret estimation.

Principles

Method

ESPO computes surrogate regret from logits, terminates when smoothed cumulative regret exceeds estimates, and assigns terminal reward to truncated trajectories.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.