BotzoneBench introduces a scalable framework for evaluating LLMs' strategic reasoning in interactive games using fixed hierarchies of skill-calibrated game AIs. It assesses five flagship models across eight diverse games via 177,047 state-action pairs, revealing performance gaps and behaviors comparable to mid-tier game AIs. This enables linear-time absolute measurements with stable interpretability, unlike volatile LLM-vs-LLM rankings.
Key Points
- 1.Anchors LLM eval to fixed AI skill hierarchies
- 2.Covers 8 games from board to card types
- 3.Analyzes 177k pairs from 5 top LLMs
Impact Analysis
Provides consistent benchmarks for tracking LLM progress in strategic domains over time. Reduces eval costs from quadratic to linear. Generalizes to any skill-hierarchical field beyond games.
Technical Details
Built on Botzone platform's competitive infrastructure. Spans deterministic perfect-info board games to stochastic imperfect-info card games. arXiv:2602.13214v1.