From GRPO to SAMPO: Solving Training Collapse in Agentic RL

2026-03-02 · Source: Discover AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Advanced, long

Summary

Researchers from the University of California and University of Wisconsin have introduced SAMPO, a new policy optimization methodology designed to address training instability and collapses in agentic reinforcement learning (RL) for large language models (LLMs) operating in multi-turn environments. This instability arises from challenges like invalid actions, sparse rewards, long-term credit assignment, and non-stationary dynamics. The team developed a benchmark to analyze existing policy optimization algorithms across four dimensions: loss aggregation, important sampling clipping, trajectory filtering/resampling, and advantage design. By systematically optimizing these dimensions and eliminating failure modes identified through extensive analysis, SAMPO consistently achieves superior performance and improved training stability compared to prior methods like GRPO, demonstrating significant success rate increases, for example, boosting a local 4B model from 51% to 92% in certain tasks.

Key takeaway

For research scientists developing or deploying agentic LLMs in multi-turn environments, SAMPO offers a robust solution to common training instability issues. You should consider integrating SAMPO's principles, particularly its optimized clipping and advantage functions, to achieve significantly higher success rates and more stable learning, even with smaller, locally runnable models. This approach can transform an agent's decision-making and exploration patterns, reducing decision entropy and solving exploration inefficiency.

Key insights

SAMPO stabilizes agentic RL training by optimizing four key policy dimensions to prevent catastrophic gradient issues.

Principles

Multi-turn agent-environment interactions cause RL instability.
Unconstrained optimization leads to gradient explosion.
Systematic dimension-wise optimization improves RL stability.

Method

SAMPO optimizes loss aggregation, important sampling clipping, trajectory filtering/resampling, and advantage design to create a unified, stable agentic RL framework, derived from benchmark analysis of existing methods.

In practice

Use SAMPO for stable agentic LLM training.
Apply sequence-level clipping for W term.
Filter trajectories to avoid zero advantage vectors.

Topics

Agentic Reinforcement Learning
Policy Optimization Algorithms
LLM Training Stability
Multi-turn Interaction
Important Sampling Clipping

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.