Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
Summary
KRAFTON, Seoul National University, NVIDIA, and University of Wisconsin-Madison researchers introduce Orak, a foundational benchmark for training and evaluating Large Language Model (LLM) agents across 12 diverse, real-world video games. This benchmark addresses limitations in existing evaluations, which often rely on text-only games or 2D-grid simulators, lack comprehensive assessment of agentic modules, and do not provide fine-tuning datasets. Orak includes popular titles like "Street Fighter III", "Super Mario", "Minecraft", and "StarCraft II", spanning six major genres: action, adventure, role-playing, simulation, strategy, and puzzle. It features a plug-and-play interface based on Model Context Protocol (MCP) for seamless LLM-game interaction and offers a fine-tuning dataset of expert LLM gameplay trajectories. Evaluations on 12 LLMs reveal proprietary models generally outperform open-source ones, with Gemini-2.5-pro ranking highest, and fine-tuning significantly improves smaller LLMs' generalization.
Key takeaway
For research scientists developing LLM agents for complex, dynamic environments, Orak provides a critical tool for robust evaluation and development. You should leverage its diverse game set and plug-and-play interface to systematically test LLM capabilities and agentic strategies. The provided fine-tuning dataset offers a pathway to improve smaller models' generalization, but be aware that visual input can sometimes hinder performance, and optimal agentic strategies depend on the LLM's inherent strength.
Key insights
Orak is a benchmark for LLM game agents, featuring diverse real games, a plug-and-play interface, and a fine-tuning dataset.
Principles
- Diverse game genres enable comprehensive LLM capability assessment.
- Agentic modules' impact varies with LLM inherent capability.
- Fine-tuning on expert trajectories enhances LLM generalization.
Method
Orak uses a Model Context Protocol (MCP) interface to connect LLMs with 12 real video games. It provides a fine-tuning dataset of expert LLM gameplay trajectories, including reflection, planning, and action sequences.
In practice
- Use Orak to benchmark LLM agents across 12 real-world games.
- Consider fine-tuning smaller LLMs on Orak's expert trajectories.
- Evaluate agentic module effectiveness based on LLM size and task.
Topics
- LLM Agents
- Game Benchmarking
- Video Games
- Model Context Protocol
- Agentic Workflows
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.