ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information
Summary
ASymPO, or Asymmetric-Scale Policy Optimization, is a novel method designed for asynchronous large language model (LLM) post-training without requiring behavior information. Traditional asynchronous reinforcement learning, while improving throughput by decoupling response generation from policy optimization, suffers from distribution drift caused by stale responses. Existing behavior-corrected methods mitigate this drift but demand complex token-aligned, versioned, and numerically consistent behavior log-probabilities. ASymPO addresses this by stabilizing asynchronous group-relative RL using only current-policy probabilities. It identifies and corrects a scale-imbalance failure mode where stale responses, when evaluated under the current policy, produce positive and negative loss terms at different negative log-probability scales. ASymPO normalizes each response's token loss by its current average token negative log-probability, thereby restoring response-level zero-sum balance and preserving a nonzero learning signal without needing behavior-policy probabilities. It was evaluated alongside Scaled Policy Optimization (SPO) in asynchronous mathematical reasoning post-training.
Key takeaway
For Machine Learning Engineers optimizing large language model post-training throughput with asynchronous reinforcement learning, ASymPO provides a critical simplification. You can stabilize asynchronous group-relative RL and mitigate distribution drift using only current-policy probabilities, eliminating the complex requirement for token-aligned, versioned behavior log-probabilities. Consider implementing ASymPO to streamline your asynchronous training pipelines, particularly for tasks like mathematical reasoning, and achieve robust learning signals without the overhead of behavior information.
Key insights
ASymPO stabilizes asynchronous LLM post-training by normalizing token loss with current policy probabilities, eliminating the need for behavior information.
Principles
- Asynchronous RL can suffer scale-imbalance from stale responses.
- Current-policy probabilities can stabilize group-relative RL.
- Normalizing token loss restores zero-sum balance.
Method
ASymPO normalizes each response's token loss by its current average token negative log-probability. This restores response-level zero-sum balance and preserves a nonzero learning signal in asynchronous LLM post-training, without requiring behavior-policy probabilities.
In practice
- Apply ASymPO for asynchronous LLM fine-tuning.
- Use current-policy probabilities to control drift.
- Evaluate on mathematical reasoning tasks.
Topics
- Asymmetric-Scale Policy Optimization
- Large Language Models
- Reinforcement Learning
- Asynchronous Training
- Policy Optimization
- Distribution Drift
- Mathematical Reasoning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.