From GRPO to SAMPO: Solving Training Collapse in Agentic RL
Summary
Researchers from the University of California and University of Wisconsin have introduced SAMPO, a new policy optimization methodology designed to address training instability and collapses in agentic reinforcement learning (RL) for large language models (LLMs) operating in multi-turn environments. This instability arises from challenges like invalid actions, sparse rewards, long-term credit assignment, and non-stationary dynamics. The team developed a benchmark to analyze existing policy optimization algorithms across four dimensions: loss aggregation, important sampling clipping, trajectory filtering/resampling, and advantage design. By systematically optimizing these dimensions and eliminating failure modes identified through extensive analysis, SAMPO consistently achieves superior performance and improved training stability compared to prior methods like GRPO, demonstrating significant success rate increases, for example, boosting a local 4B model from 51% to 92% in certain tasks.
Key takeaway
For research scientists developing or deploying agentic LLMs in multi-turn environments, SAMPO offers a robust solution to common training instability issues. You should consider integrating SAMPO's principles, particularly its optimized clipping and advantage functions, to achieve significantly higher success rates and more stable learning, even with smaller, locally runnable models. This approach can transform an agent's decision-making and exploration patterns, reducing decision entropy and solving exploration inefficiency.
Key insights
SAMPO stabilizes agentic RL training by optimizing four key policy dimensions to prevent catastrophic gradient issues.
Principles
- Multi-turn agent-environment interactions cause RL instability.
- Unconstrained optimization leads to gradient explosion.
- Systematic dimension-wise optimization improves RL stability.
Method
SAMPO optimizes loss aggregation, important sampling clipping, trajectory filtering/resampling, and advantage design to create a unified, stable agentic RL framework, derived from benchmark analysis of existing methods.
In practice
- Use SAMPO for stable agentic LLM training.
- Apply sequence-level clipping for W term.
- Filter trajectories to avoid zero advantage vectors.
Topics
- Agentic Reinforcement Learning
- Policy Optimization Algorithms
- LLM Training Stability
- Multi-turn Interaction
- Important Sampling Clipping
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Discover AI.