The Bitter Lesson: The history of reinforcement learning

2026-06-13 · Source: CoRecursive: Coding Stories · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Intermediate, extended

Summary

Richard Sutton's "The Bitter Lesson" and "Reward is Enough" principles advocate for general methods that utilize massive computation and simple reward signals over handcrafted rules in AI. Tracing its history, the approach began with B.F. Skinner's behaviorism and Sutton's 1988 temporal difference learning. This led to Gerald Tesauro's 1992 TD Gammon, which achieved superhuman backgammon play against 100 quintillion states using a neural network, contrasting with IBM Deep Blue's 1997 expert system for chess with 8,000 rules. After two decades of dormancy, DeepMind revived reinforcement learning in 2015 with deep Q-networks for Atari games, culminating in AlphaGo's 2016 4-1 victory over Go champion Lee Sedol. Further advancements, AlphaGo Zero (2017) and MuZero, demonstrated superhuman performance from a "tabula rasa" state, learning without human game data or even game rules, in just three days for AlphaGo Zero. Sutton's 2019 manifesto argues that computation-driven general methods consistently outperform human-engineered cleverness, a "bitter pill" for AI developers.

Key takeaway

For AI Scientists and Machine Learning Engineers designing intelligent systems, recognize that relying on human-defined rules and expert systems is a diminishing strategy. Instead, focus your efforts on defining clear reward signals and scaling computational resources for self-play and general learning algorithms. Your cleverness is best applied to creating new "boxes" or problem domains, as direct competition with reward-driven AI in defined tasks will likely result in being outperformed. Embrace continuous learning and adaptation to stay ahead.

Key insights

Computation-driven general methods, guided by simple reward signals, consistently outperform human-engineered AI.

Principles

Computation-driven general methods win.
Reward signals are sufficient for intelligence.
Human-designed rules can be constraints.

Method

Reinforcement learning involves self-play, assigning values to outcomes, and updating a neural network based on reward signals, often combined with tree search.

In practice

Define clear reward signals for tasks.
Prioritize computational scale over complex rules.
Explore self-play for novel strategies.

Topics

Reinforcement Learning
The Bitter Lesson
DeepMind AlphaGo
Expert Systems
Temporal Difference Learning
Self-Play Algorithms

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by CoRecursive: Coding Stories.