RL without TD learning
Summary
A new reinforcement learning (RL) algorithm, Transitive RL (TRL), introduces a "divide and conquer" paradigm for off-policy RL, addressing the scalability challenges of traditional temporal difference (TD) learning in long-horizon tasks. Unlike TD learning, which suffers from accumulating errors due to bootstrapping, TRL recursively splits trajectories into smaller segments, reducing Bellman recursions logarithmically. This approach is particularly effective in goal-conditioned RL, where it leverages a "transitive" Bellman update rule based on the triangle inequality of shortest path distances. TRL restricts the search space for optimal subgoals to states within dataset trajectories and uses expectile regression to compute a "soft" argmax, preventing value overestimation. Evaluated on challenging OGBench tasks like humanoidmaze and puzzle with 1B-sized datasets, TRL achieved superior performance compared to strong baselines and matched optimally tuned n-step TD learning without requiring manual tuning of the 'n' hyperparameter, demonstrating its ability to naturally handle long horizons.
Key takeaway
For research scientists developing off-policy RL solutions for long-horizon, complex tasks, Transitive RL (TRL) presents a compelling alternative to traditional TD learning. You should investigate TRL, especially for goal-conditioned problems, as it inherently scales better and eliminates the need for careful tuning of hyperparameters like 'n' in n-step TD learning, potentially simplifying development and improving performance on challenging benchmarks like OGBench.
Key insights
Divide and conquer offers a scalable, hyperparameter-free alternative to TD learning for long-horizon off-policy RL.
Principles
- Error accumulation limits TD learning scalability.
- Divide and conquer reduces Bellman recursions logarithmically.
- Goal-conditioned RL naturally supports transitive value updates.
Method
Transitive RL (TRL) restricts subgoal search to dataset trajectories and uses expectile regression for a "soft" argmax in a transitive Bellman update, enabling divide-and-conquer value learning.
In practice
- Apply TRL to goal-conditioned RL for complex, long-horizon tasks.
- Consider TRL for off-policy RL where data collection is expensive.
- Explore converting reward-based RL to goal-conditioned for TRL applicability.
Topics
- Off-policy Reinforcement Learning
- Divide and Conquer RL
- Transitive RL
- Goal-conditioned RL
- Temporal Difference Learning
Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Berkeley Artificial Intelligence Research Blog.