GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory
Summary
GT-HarmBench is a new AI safety benchmark introduced to evaluate frontier AI systems in high-stakes multi-agent environments, addressing a gap in existing single-agent benchmarks. Released on February 12, 2026, it comprises 2,009 scenarios drawn from the MIT AI Risk Repository, covering game-theoretic structures like Prisoner's Dilemma, Stag Hunt, and Chicken. Across 15 frontier models, agents chose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes. The benchmark measures sensitivity to prompt framing and ordering, and analyzes reasoning patterns. It also demonstrates that game-theoretic interventions can improve socially beneficial outcomes by up to 18%, highlighting significant reliability gaps and providing a standardized testbed for multi-agent AI alignment research. The benchmark and code are available at https://github.com/causalNLP/gt-harmbench.
Key takeaway
For AI Scientists and Research Scientists developing or deploying multi-agent AI systems, this research indicates that current frontier models exhibit significant reliability gaps in strategic interactions. You should prioritize integrating mechanism design principles, such as trusted mediators or pre-play communication, into your AI system architectures. This can improve socially optimal outcomes by up to 18%, mitigating risks like coordination failure and conflict in high-stakes scenarios.
Key insights
AI models struggle with multi-agent coordination, but game-theoretic interventions can improve socially optimal outcomes.
Principles
- Game-theoretic framing increases self-interested behavior.
- Order effects bias AI coordination abilities.
- Social welfare reasoning correlates with optimal outcomes.
Method
GT-HarmBench maps AI safety risks to six canonical 2x2 symmetric games, generates 2,009 scenarios, and evaluates 15 frontier models using self-play and utilitarian accuracy, then tests five mechanism design interventions via prompt modifications.
In practice
- Implement mechanism design interventions to improve AI outcomes.
- Prioritize social welfare reasoning in AI design.
- Be aware of prompt framing and order effects on AI behavior.
Topics
- AI Safety Benchmarking
- Multi-Agent AI
- Game Theory
- Mechanism Design
- Large Language Models
Code references
Best for: AI Scientist, Research Scientist, AI Engineer, AI Researcher, Machine Learning Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.