ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information

2026-06-02 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

ASymPO, or Asymmetric-Scale Policy Optimization, is a novel method designed for asynchronous large language model (LLM) post-training without requiring behavior information. Traditional asynchronous reinforcement learning, while improving throughput by decoupling response generation from policy optimization, suffers from distribution drift caused by stale responses. Existing behavior-corrected methods mitigate this drift but demand complex token-aligned, versioned, and numerically consistent behavior log-probabilities. ASymPO addresses this by stabilizing asynchronous group-relative RL using only current-policy probabilities. It identifies and corrects a scale-imbalance failure mode where stale responses, when evaluated under the current policy, produce positive and negative loss terms at different negative log-probability scales. ASymPO normalizes each response's token loss by its current average token negative log-probability, thereby restoring response-level zero-sum balance and preserving a nonzero learning signal without needing behavior-policy probabilities. It was evaluated alongside Scaled Policy Optimization (SPO) in asynchronous mathematical reasoning post-training.

Key takeaway

For Machine Learning Engineers optimizing large language model post-training throughput with asynchronous reinforcement learning, ASymPO provides a critical simplification. You can stabilize asynchronous group-relative RL and mitigate distribution drift using only current-policy probabilities, eliminating the complex requirement for token-aligned, versioned behavior log-probabilities. Consider implementing ASymPO to streamline your asynchronous training pipelines, particularly for tasks like mathematical reasoning, and achieve robust learning signals without the overhead of behavior information.

Key insights

ASymPO stabilizes asynchronous LLM post-training by normalizing token loss with current policy probabilities, eliminating the need for behavior information.

Principles

Asynchronous RL can suffer scale-imbalance from stale responses.
Current-policy probabilities can stabilize group-relative RL.
Normalizing token loss restores zero-sum balance.

Method

ASymPO normalizes each response's token loss by its current average token negative log-probability. This restores response-level zero-sum balance and preserves a nonzero learning signal in asynchronous LLM post-training, without requiring behavior-policy probabilities.

In practice

Apply ASymPO for asynchronous LLM fine-tuning.
Use current-policy probabilities to control drift.
Evaluate on mathematical reasoning tasks.

Topics

Asymmetric-Scale Policy Optimization
Large Language Models
Reinforcement Learning
Asynchronous Training
Policy Optimization
Distribution Drift
Mathematical Reasoning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.