A Structural Threshold in Decision Capacity Governs Collapse in Self-Play Reinforcement Learning

2026-05-19 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Gaming & Interactive Media · Depth: Expert, extended

Summary

Self-play reinforcement learning (RL) agents exhibit a critical structural threshold in decision capacity that dictates their stability under asymmetric action-space perturbations. Researchers found that eliminating all positive-reach contingent decisions, reducing the reach-weighted contingent action capacity (CAC_w) to zero, causes rapid convergence to a deterministic exploitation attractor (DEA), a fixed point of near-maximal loss. This phenomenon was observed across various poker variants (Kuhn, Leduc, Leduc-4), matrix games (Matching Pennies), and a dice game (Liar's Dice, up to 24,576 info sets), using six different learning algorithms (Q-Learning, SARSA, REINFORCE, PPO, DQN, NFSP). Crucially, preserving even a single positive-reach contingent decision point prevents this collapse. The mechanism is attributed to co-adaptation under constraint, not the perturbation itself, and is timing-invariant, fully reversible, and intensifies with function approximation.

Key takeaway

For research scientists developing or deploying multi-agent reinforcement learning systems, you should recognize that structural changes to an agent's action space can induce a sharp, catastrophic collapse if all contingent decision points are eliminated. Ensure your agents retain at least one positive-reach contingent action to maintain strategic flexibility and prevent convergence to a deterministic exploitation attractor, especially in competitive, zero-sum environments where co-adaptation can amplify vulnerabilities.

Key insights

A structural threshold in decision capacity governs self-play RL agent collapse under asymmetric action-space perturbations.

Principles

Zero contingent action capacity (CAC_w=0) leads to deterministic exploitation.
Co-adaptation under constraint drives catastrophic collapse.
A single positive-reach decision point prevents collapse.

Method

The study involved deterministically removing one player's ability to bet or raise at specified decision nodes in discrete, imperfect-information games to observe self-play dynamics.

In practice

Monitor CAC_w in multi-agent RL deployments.
Design systems to retain minimal strategic flexibility.
Avoid full action space restrictions in competitive RL.

Topics

Self-Play Reinforcement Learning
Contingent Action Capacity
Deterministic Exploitation Attractor
Action Space Perturbations
Multi-Agent Robustness

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.