A Structural Threshold in Decision Capacity Governs Collapse in Self-Play Reinforcement Learning
Summary
Self-play reinforcement learning (RL) agents exhibit a critical structural threshold in decision capacity that dictates their stability under asymmetric action-space perturbations. Researchers found that eliminating all positive-reach contingent decisions, reducing the reach-weighted contingent action capacity (CAC_w) to zero, causes rapid convergence to a deterministic exploitation attractor (DEA), a fixed point of near-maximal loss. This phenomenon was observed across various poker variants (Kuhn, Leduc, Leduc-4), matrix games (Matching Pennies), and a dice game (Liar's Dice, up to 24,576 info sets), using six different learning algorithms (Q-Learning, SARSA, REINFORCE, PPO, DQN, NFSP). Crucially, preserving even a single positive-reach contingent decision point prevents this collapse. The mechanism is attributed to co-adaptation under constraint, not the perturbation itself, and is timing-invariant, fully reversible, and intensifies with function approximation.
Key takeaway
For research scientists developing or deploying multi-agent reinforcement learning systems, you should recognize that structural changes to an agent's action space can induce a sharp, catastrophic collapse if all contingent decision points are eliminated. Ensure your agents retain at least one positive-reach contingent action to maintain strategic flexibility and prevent convergence to a deterministic exploitation attractor, especially in competitive, zero-sum environments where co-adaptation can amplify vulnerabilities.
Key insights
A structural threshold in decision capacity governs self-play RL agent collapse under asymmetric action-space perturbations.
Principles
- Zero contingent action capacity (CAC_w=0) leads to deterministic exploitation.
- Co-adaptation under constraint drives catastrophic collapse.
- A single positive-reach decision point prevents collapse.
Method
The study involved deterministically removing one player's ability to bet or raise at specified decision nodes in discrete, imperfect-information games to observe self-play dynamics.
In practice
- Monitor CAC_w in multi-agent RL deployments.
- Design systems to retain minimal strategic flexibility.
- Avoid full action space restrictions in competitive RL.
Topics
- Self-Play Reinforcement Learning
- Contingent Action Capacity
- Deterministic Exploitation Attractor
- Action Space Perturbations
- Multi-Agent Robustness
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.