The Fundamental Choice in Reinforcement Learning: On‑Policy vs. Off‑Policy

· Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Intermediate, long

Summary

Reinforcement learning algorithms often boil down to a fundamental choice: whether an agent learns solely from its current behavior (on-policy) or can also learn from other strategies (off-policy). This distinction, central to methods like SARSA (on-policy) and Q-learning (off-policy), profoundly impacts exploration, data efficiency, training stability, and safety. On-policy methods, such as SARSA, learn the value of the policy currently being executed, leading to safer behavior during training, as demonstrated by SARSA choosing a safe path in Cliff Walking with ε=0.1. Off-policy methods, like Q-learning, learn about an optimal policy while potentially acting differently, enabling data reuse via replay buffers but risking maximization bias and instability, especially when combined with function approximation and bootstrapping (the "deadly triad"). Expected SARSA offers a hybrid approach, reducing variance.

Key takeaway

For Machine Learning Engineers designing new reinforcement learning systems, your choice between on-policy and off-policy algorithms hinges on critical trade-offs. If safety during learning or stable online performance is paramount, opt for on-policy methods like PPO. Conversely, if sample efficiency, data reuse, and achieving optimal final performance are key, off-policy approaches such as DQN or SAC are more suitable, especially in simulations or with expensive data collection. Evaluate your system's specific constraints, including action space size and tolerance for exploration risk, to make an informed decision.

Key insights

Reinforcement learning's core choice is between on-policy learning from current actions and off-policy learning from diverse experiences.

Principles

Method

Temporal-Difference (TD) learning updates value estimates by bootstrapping from future estimates. SARSA uses the actual next action's value, while Q-learning uses the maximum possible next action's value.

In practice

Topics

Best for: Machine Learning Engineer, AI Scientist, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.