Retry Policy Gradients in Continuous Action Spaces
Summary
Researchers Soichiro Nishimori and Paavo Parmas from The University of Tokyo introduce Retry Policy Gradients for continuous action spaces, extending retry-based objectives like ReMax from discrete settings. Their work, ReMax Actor-Critic (ReMAC), is an off-policy actor-critic algorithm that uses a pathwise derivative estimator to optimize the ReMax objective. ReMAC promotes stochastic exploration without explicit entropy regularization by reshaping the policy-gradient landscape, biasing updates towards higher policy entropy and damping gradients to slow convergence. They demonstrate that Adam's adaptive normalization can mitigate this damping effect. Empirical evaluations across six Brax continuous-control tasks, including Ant and HalfCheetah, show ReMAC with retry budgets M>1 achieves performance comparable to Soft Actor-Critic (SAC) and consistently yields higher policy entropy than M=1 settings, particularly for M=4 and M=8 with Adam's default ε=10⁻⁸.
Key takeaway
For Machine Learning Engineers developing continuous control agents, consider ReMax Actor-Critic (ReMAC) as an alternative to entropy-regularized methods like SAC. ReMAC can achieve comparable performance while naturally encouraging exploration and higher policy entropy with retry budgets M>1. You should experiment with Adam's ε parameter to fine-tune the balance between exploration and convergence speed, potentially increasing it to restore ReMax's damping effect.
Key insights
ReMax extends to continuous action spaces, promoting exploration by reshaping policy gradients without entropy bonuses.
Principles
- Retry objectives promote exploration by adapting to return uncertainty.
- ReMax alters gradients to increase policy entropy and slow convergence.
- Adam's adaptive normalization can counteract ReMax's gradient damping.
Method
ReMAC is an off-policy actor-critic algorithm. It uses a pathwise derivative estimator to optimize the ReMax objective, replacing SAC's entropy bonus with the ReMax loss.
In practice
- Implement ReMAC by modifying SAC's actor loss.
- Adjust Adam's ε to control ReMax's gradient damping.
- Consider M>1 for higher policy entropy in continuous control.
Topics
- Continuous Control
- ReMax Objective
- Policy Gradients
- Actor-Critic Methods
- Exploration in RL
- Adam Optimizer
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.