Retry Policy Gradients in Continuous Action Spaces

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Researchers Soichiro Nishimori and Paavo Parmas from The University of Tokyo introduce Retry Policy Gradients for continuous action spaces, extending retry-based objectives like ReMax from discrete settings. Their work, ReMax Actor-Critic (ReMAC), is an off-policy actor-critic algorithm that uses a pathwise derivative estimator to optimize the ReMax objective. ReMAC promotes stochastic exploration without explicit entropy regularization by reshaping the policy-gradient landscape, biasing updates towards higher policy entropy and damping gradients to slow convergence. They demonstrate that Adam's adaptive normalization can mitigate this damping effect. Empirical evaluations across six Brax continuous-control tasks, including Ant and HalfCheetah, show ReMAC with retry budgets M>1 achieves performance comparable to Soft Actor-Critic (SAC) and consistently yields higher policy entropy than M=1 settings, particularly for M=4 and M=8 with Adam's default ε=10⁻⁸.

Key takeaway

For Machine Learning Engineers developing continuous control agents, consider ReMax Actor-Critic (ReMAC) as an alternative to entropy-regularized methods like SAC. ReMAC can achieve comparable performance while naturally encouraging exploration and higher policy entropy with retry budgets M>1. You should experiment with Adam's ε parameter to fine-tune the balance between exploration and convergence speed, potentially increasing it to restore ReMax's damping effect.

Key insights

ReMax extends to continuous action spaces, promoting exploration by reshaping policy gradients without entropy bonuses.

Principles

Retry objectives promote exploration by adapting to return uncertainty.
ReMax alters gradients to increase policy entropy and slow convergence.
Adam's adaptive normalization can counteract ReMax's gradient damping.

Method

ReMAC is an off-policy actor-critic algorithm. It uses a pathwise derivative estimator to optimize the ReMax objective, replacing SAC's entropy bonus with the ReMax loss.

In practice

Implement ReMAC by modifying SAC's actor loss.
Adjust Adam's ε to control ReMax's gradient damping.
Consider M>1 for higher policy entropy in continuous control.

Topics

Continuous Control
ReMax Objective
Policy Gradients
Actor-Critic Methods
Exploration in RL
Adam Optimizer

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.