RL without TD learning

2025-11-01 · Source: The Berkeley Artificial Intelligence Research Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, medium

Summary

A new reinforcement learning (RL) algorithm, Transitive RL (TRL), introduces a "divide and conquer" paradigm for off-policy RL, addressing the scalability challenges of traditional temporal difference (TD) learning in long-horizon tasks. Unlike TD learning, which suffers from accumulating errors due to bootstrapping, TRL recursively splits trajectories into smaller segments, reducing Bellman recursions logarithmically. This approach is particularly effective in goal-conditioned RL, where it leverages a "transitive" Bellman update rule based on the triangle inequality of shortest path distances. TRL restricts the search space for optimal subgoals to states within dataset trajectories and uses expectile regression to compute a "soft" argmax, preventing value overestimation. Evaluated on challenging OGBench tasks like humanoidmaze and puzzle with 1B-sized datasets, TRL achieved superior performance compared to strong baselines and matched optimally tuned n-step TD learning without requiring manual tuning of the 'n' hyperparameter, demonstrating its ability to naturally handle long horizons.

Key takeaway

For research scientists developing off-policy RL solutions for long-horizon, complex tasks, Transitive RL (TRL) presents a compelling alternative to traditional TD learning. You should investigate TRL, especially for goal-conditioned problems, as it inherently scales better and eliminates the need for careful tuning of hyperparameters like 'n' in n-step TD learning, potentially simplifying development and improving performance on challenging benchmarks like OGBench.

Key insights

Divide and conquer offers a scalable, hyperparameter-free alternative to TD learning for long-horizon off-policy RL.

Principles

Error accumulation limits TD learning scalability.
Divide and conquer reduces Bellman recursions logarithmically.
Goal-conditioned RL naturally supports transitive value updates.

Method

Transitive RL (TRL) restricts subgoal search to dataset trajectories and uses expectile regression for a "soft" argmax in a transitive Bellman update, enabling divide-and-conquer value learning.

In practice

Apply TRL to goal-conditioned RL for complex, long-horizon tasks.
Consider TRL for off-policy RL where data collection is expensive.
Explore converting reward-based RL to goal-conditioned for TRL applicability.

Topics

Off-policy Reinforcement Learning
Divide and Conquer RL
Transitive RL
Goal-conditioned RL
Temporal Difference Learning

Best for: Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Berkeley Artificial Intelligence Research Blog.