GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

2025-03-19 · Source: cs.MA updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

GT-HarmBench is a new AI safety benchmark introduced to evaluate frontier AI systems in high-stakes multi-agent environments, addressing a gap in existing single-agent benchmarks. Released on February 12, 2026, it comprises 2,009 scenarios drawn from the MIT AI Risk Repository, covering game-theoretic structures like Prisoner's Dilemma, Stag Hunt, and Chicken. Across 15 frontier models, agents chose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes. The benchmark measures sensitivity to prompt framing and ordering, and analyzes reasoning patterns. It also demonstrates that game-theoretic interventions can improve socially beneficial outcomes by up to 18%, highlighting significant reliability gaps and providing a standardized testbed for multi-agent AI alignment research. The benchmark and code are available at https://github.com/causalNLP/gt-harmbench.

Key takeaway

For AI Scientists and Research Scientists developing or deploying multi-agent AI systems, this research indicates that current frontier models exhibit significant reliability gaps in strategic interactions. You should prioritize integrating mechanism design principles, such as trusted mediators or pre-play communication, into your AI system architectures. This can improve socially optimal outcomes by up to 18%, mitigating risks like coordination failure and conflict in high-stakes scenarios.

Key insights

AI models struggle with multi-agent coordination, but game-theoretic interventions can improve socially optimal outcomes.

Principles

Game-theoretic framing increases self-interested behavior.
Order effects bias AI coordination abilities.
Social welfare reasoning correlates with optimal outcomes.

Method

GT-HarmBench maps AI safety risks to six canonical 2x2 symmetric games, generates 2,009 scenarios, and evaluates 15 frontier models using self-play and utilitarian accuracy, then tests five mechanism design interventions via prompt modifications.

In practice

Implement mechanism design interventions to improve AI outcomes.
Prioritize social welfare reasoning in AI design.
Be aware of prompt framing and order effects on AI behavior.

Topics

AI Safety Benchmarking
Multi-Agent AI
Game Theory
Mechanism Design
Large Language Models

Code references

causalNLP/gt-harmbench

Best for: AI Scientist, Research Scientist, AI Engineer, AI Researcher, Machine Learning Engineer, AI Ethicist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.