Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War
Summary
The "Age of LLM" introduces a novel turn-based 1v1 benchmark designed to evaluate large language models' reasoning, diplomacy, and reliability under "fog of war" conditions. Two LLMs compete on a 13x7 grid to destroy an enemy base, facing stressors like hidden information, full diplomatic options (including secret uranium), and strict JSON schema adherence for actions, where illegal moves are silently discarded. The private engine uses fresh map seeds and opponents to prevent data contamination. Benchmarking 15 reasoning models across 54 matches and 5,258 actions revealed that a "nuclear rush" strategy dominates (78% on the v0.11+ sub-corpus), often mechanically executed. Military conquest is rare but faster (12.3 vs 18.9 turns), while diplomacy is frequent but seldom successful. Approximately 58% of illegal actions stem from fog/state errors, indicating belief-tracking issues. An exploratory finding suggests a weak link between reliability and winning.
Key takeaway
For AI Scientists evaluating LLMs for complex, multi-agent systems, you should recognize that traditional benchmarks may not fully expose critical reasoning and reliability flaws. This research highlights that LLMs struggle with belief-tracking under "fog of war" and often fail to consummate diplomacy, even when prolific. Integrate game-theoretic or adversarial benchmarks like "Age of LLM" into your evaluation pipeline to assess an LLM's ability to adhere to strict protocols and manage uncertainty, moving beyond simple task completion metrics.
Key insights
Benchmarking LLMs in a strategic game reveals their reasoning, diplomatic, and reliability limitations under adversarial conditions.
Principles
- Fog of war, diplomacy, and strict action schemas effectively stress LLM capabilities.
- Private engines and fresh map seeds mitigate data contamination in benchmarks.
- Illegal action rates can quantify LLM belief-tracking under uncertainty.
Method
LLMs compete in a turn-based 1v1 game on a 13x7 grid, destroying an enemy base under fog of war and full diplomacy, adhering to a strict JSON action schema.
In practice
- Analyze LLM turn-by-turn traces for belief-tracking and cognitive "personas".
- Utilize the released replay format and isometric viewer for detailed match analysis.
Topics
- LLM Benchmarking
- Strategic Games
- Fog of War
- Diplomacy
- Model Reliability
- Belief Tracking
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.