ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, long

Summary

ASymPO (Asymmetric-Scale Policy Optimization) is a novel method addressing distribution drift in asynchronous reinforcement learning for large language model (LLM) post-training. This drift, caused by stale responses, creates a scale-imbalance failure mode where positive and negative loss terms appear at different negative log-probability scales, destabilizing training. Unlike standard behavior-corrected methods that demand complex infrastructure like token-aligned behavior log-probabilities and policy versioning, ASymPO normalizes each response's token loss by its current average token negative log-probability. This approach restores response-level zero-sum balance and maintains a nonzero learning signal using only current-policy probabilities. The paper also introduces Scaled Policy Optimization (SPO) as a fixed negative-scaling baseline. Evaluated on asynchronous mathematical reasoning post-training across Qwen3-1.7B-Base, Qwen3-4B-Base, and LLaMA-3.2-3B-Instruct models, ASymPO and SPO demonstrated stable training, unlike naive loss and GPG which collapsed. ASymPO significantly simplifies the rollout–learner interface by eliminating the need for behavior log-probability transport and policy-version metadata.

Key takeaway

For Machine Learning Engineers designing asynchronous LLM post-training pipelines, you should consider adopting ASymPO. This method stabilizes training by adaptively balancing loss contributions without requiring complex behavior-policy probabilities or policy-version bookkeeping. Implementing ASymPO simplifies your rollout–learner interface, reducing infrastructure overhead and avoiding the training collapses observed with naive current-policy objectives, thereby streamlining your development and deployment efforts.

Key insights

Asynchronous LLM post-training can be stabilized by adaptively scaling current-policy loss, removing behavior-policy dependencies.

Principles

Method

ASymPO normalizes each response's token loss by its current average token negative log-probability, using a stop-gradient operator to balance loss contributions.

In practice

Topics

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.