Scalable Reinforcement Learning via Adaptive Batch Scaling
Summary
Adaptive Batch Scaling (ABS) is a novel framework that challenges the conventional belief that large-batch training is incompatible with Reinforcement Learning (RL). ABS dynamically adjusts the effective batch size based on the evolving non-stationarity of the learning policy, quantified by "Behavioral Divergence," a new metric measuring action-level shifts between consecutive updates. This approach allows for small batches during early, volatile training stages to maintain plasticity and large batches in later, quasi-stationary stages for precise convergence. Integrated with the Parallelised Q-Network (PQN) algorithm and evaluated on the ALE benchmark, ABS significantly improves performance. It enables larger networks, such as PQN-L and PQN-XL, to achieve superior results with larger batch sizes, a scaling behavior previously considered unattainable in RL. The method also demonstrates generalizability to continuous control tasks with PPO and off-policy RL with BTR.
Key takeaway
For Machine Learning Engineers developing large-scale Reinforcement Learning agents, you should consider implementing Adaptive Batch Scaling (ABS) to overcome performance bottlenecks. By dynamically adjusting batch sizes based on policy stability, ABS enables your high-capacity models to leverage larger batches for stable convergence, a critical advantage previously limited to supervised learning. This approach can significantly improve sample efficiency and final performance across both on-policy and off-policy RL algorithms.
Key insights
RL non-stationarity evolves, requiring adaptive batch scaling for optimal performance and model scaling.
Principles
- RL non-stationarity is dynamic, not fixed.
- Small batches aid early-stage plasticity.
- Large batches ensure late-stage convergence.
Method
ABS dynamically adjusts rollout length (batch size) using "Behavioral Divergence," a forward-pass metric quantifying action-level policy shifts, to balance plasticity and convergence.
In practice
- Integrate ABS with PQN for Atari-57 performance gains.
- Apply ABS to PPO for continuous control tasks.
- Use ABS to scale large RL network architectures.
Topics
- Reinforcement Learning
- Adaptive Batch Scaling
- Behavioral Divergence
- On-Policy RL
- Large-Batch Training
- PQN Algorithm
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.