When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

2026-05-19 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Expert, extended

Summary

Adversarial action masking, where an attacker selectively removes legal actions from a victim's action set, poses a significant and distinct threat to self-play reinforcement learning (RL) agents. This attack is dramatically more damaging than random masking or learned observation/action perturbations, with adversarial removal being up to 4.8 times more effective. The vulnerability persists across diverse RL algorithms, including Q-learning, PPO, NFSP, neural NFSP, and DQN, and scales with game complexity, from 6 to 5,531 information states in poker variants and across non-poker domains like competitive gridworld and resource collection. The attack mechanism targets high-value decision points, quantified by reach-weighted contingent action capacity (CACw) and its value-weighted refinement (CACv), and victims show no recovery even under extended masked training. This research identifies action availability as a critical robustness surface in self-play RL.

Key takeaway

For research scientists developing multi-agent reinforcement learning systems, you should prioritize designing robustness mechanisms that specifically address targeted action-space attacks. Your defenses must focus on preserving strategic flexibility at high-reach, high-CACv decision points, as generic robustness to stochastic unavailability or simple action dropout proved insufficient. Consider integrating mechanisms that identify and protect critical action pathways to prevent catastrophic performance degradation.

Key insights

Adversarial action removal severely degrades self-play RL agents by targeting high-value decision points, with no recovery.

Principles

Action availability is a distinct robustness surface.
Adversarial masking is more efficient than random removal.
Self-play amplifies action removal attacks.

Method

A bi-level optimization trains an inner-loop RL agent under masked actions and an outer-loop adversary to select actions for removal, using REINFORCE with a negative victim value reward signal.

In practice

Prioritize strategic flexibility at high-CACv states.
Uniform robustness methods are insufficient defenses.
Identify and preserve high-reach decision points.

Topics

Adversarial Action Masking
Self-Play Reinforcement Learning
Contingent Action Capacity
Multi-Agent Robustness
Imperfect-Information Games

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.