The Fundamental Choice in Reinforcement Learning: On‑Policy vs. Off‑Policy
Summary
Reinforcement learning algorithms often boil down to a fundamental choice: whether an agent learns solely from its current behavior (on-policy) or can also learn from other strategies (off-policy). This distinction, central to methods like SARSA (on-policy) and Q-learning (off-policy), profoundly impacts exploration, data efficiency, training stability, and safety. On-policy methods, such as SARSA, learn the value of the policy currently being executed, leading to safer behavior during training, as demonstrated by SARSA choosing a safe path in Cliff Walking with ε=0.1. Off-policy methods, like Q-learning, learn about an optimal policy while potentially acting differently, enabling data reuse via replay buffers but risking maximization bias and instability, especially when combined with function approximation and bootstrapping (the "deadly triad"). Expected SARSA offers a hybrid approach, reducing variance.
Key takeaway
For Machine Learning Engineers designing new reinforcement learning systems, your choice between on-policy and off-policy algorithms hinges on critical trade-offs. If safety during learning or stable online performance is paramount, opt for on-policy methods like PPO. Conversely, if sample efficiency, data reuse, and achieving optimal final performance are key, off-policy approaches such as DQN or SAC are more suitable, especially in simulations or with expensive data collection. Evaluate your system's specific constraints, including action space size and tolerance for exploration risk, to make an informed decision.
Key insights
Reinforcement learning's core choice is between on-policy learning from current actions and off-policy learning from diverse experiences.
Principles
- On-policy methods prioritize safety and stable online performance.
- Off-policy methods enhance sample efficiency via experience reuse.
- Combining function approximation, bootstrapping, and off-policy learning risks instability.
Method
Temporal-Difference (TD) learning updates value estimates by bootstrapping from future estimates. SARSA uses the actual next action's value, while Q-learning uses the maximum possible next action's value.
In practice
- Employ replay buffers with off-policy algorithms like DQN.
- Mitigate maximization bias using Double Q-learning.
- Choose Expected SARSA for lower variance in small action spaces.
Topics
- Reinforcement Learning
- On-policy Learning
- Off-policy Learning
- SARSA
- Q-learning
- Temporal-Difference Learning
- Sample Efficiency
Best for: Machine Learning Engineer, AI Scientist, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.