ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information
Summary
ASymPO (Asymmetric-Scale Policy Optimization) is a novel method addressing distribution drift in asynchronous reinforcement learning for large language model (LLM) post-training. This drift, caused by stale responses, creates a scale-imbalance failure mode where positive and negative loss terms appear at different negative log-probability scales, destabilizing training. Unlike standard behavior-corrected methods that demand complex infrastructure like token-aligned behavior log-probabilities and policy versioning, ASymPO normalizes each response's token loss by its current average token negative log-probability. This approach restores response-level zero-sum balance and maintains a nonzero learning signal using only current-policy probabilities. The paper also introduces Scaled Policy Optimization (SPO) as a fixed negative-scaling baseline. Evaluated on asynchronous mathematical reasoning post-training across Qwen3-1.7B-Base, Qwen3-4B-Base, and LLaMA-3.2-3B-Instruct models, ASymPO and SPO demonstrated stable training, unlike naive loss and GPG which collapsed. ASymPO significantly simplifies the rollout–learner interface by eliminating the need for behavior log-probability transport and policy-version metadata.
Key takeaway
For Machine Learning Engineers designing asynchronous LLM post-training pipelines, you should consider adopting ASymPO. This method stabilizes training by adaptively balancing loss contributions without requiring complex behavior-policy probabilities or policy-version bookkeeping. Implementing ASymPO simplifies your rollout–learner interface, reducing infrastructure overhead and avoiding the training collapses observed with naive current-policy objectives, thereby streamlining your development and deployment efforts.
Key insights
Asynchronous LLM post-training can be stabilized by adaptively scaling current-policy loss, removing behavior-policy dependencies.
Principles
- Asynchronous RL without behavior correction risks scale-imbalance.
- Zero-sum advantages alone do not guarantee balanced loss.
- Normalizing loss by current log-probability scale restores balance.
Method
ASymPO normalizes each response's token loss by its current average token negative log-probability, using a stop-gradient operator to balance loss contributions.
In practice
- Implement ASymPO to simplify rollout–learner interface.
- Eliminate behavior log-probability and policy version transport.
- Apply current-policy-only objectives for LLM post-training.
Topics
- ASymPO
- Asynchronous RL
- LLM Post-Training
- Policy Optimization
- Distribution Drift
- Mathematical Reasoning
Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.