GT-HarmBench introduces 2,009 high-stakes multi-agent scenarios using game theory like Prisoner's Dilemma to benchmark AI safety risks. Frontier models select socially beneficial actions only 62% of the time, often leading to harm. The benchmark, code, and analysis are available on GitHub.
Key Points
- 1.2,009 scenarios from MIT AI Risk Repository
- 2.Tests 15 frontier models across game structures
- 3.Interventions boost beneficial outcomes by 18%
Impact Analysis
Exposes multi-agent coordination failures in AI systems. Offers standardized testbed for alignment research. Highlights need for game-theoretic safety improvements.
Technical Details
Evaluates prompt framing sensitivity and reasoning failures. Covers structures like Stag Hunt and Chicken. Draws from realistic AI risk contexts.
