Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

2025-05-13 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Gaming & Interactive Media · Depth: Expert, extended

Summary

KRAFTON, Seoul National University, NVIDIA, and University of Wisconsin-Madison researchers introduce Orak, a foundational benchmark for training and evaluating Large Language Model (LLM) agents across 12 diverse, real-world video games. This benchmark addresses limitations in existing evaluations, which often rely on text-only games or 2D-grid simulators, lack comprehensive assessment of agentic modules, and do not provide fine-tuning datasets. Orak includes popular titles like "Street Fighter III", "Super Mario", "Minecraft", and "StarCraft II", spanning six major genres: action, adventure, role-playing, simulation, strategy, and puzzle. It features a plug-and-play interface based on Model Context Protocol (MCP) for seamless LLM-game interaction and offers a fine-tuning dataset of expert LLM gameplay trajectories. Evaluations on 12 LLMs reveal proprietary models generally outperform open-source ones, with Gemini-2.5-pro ranking highest, and fine-tuning significantly improves smaller LLMs' generalization.

Key takeaway

For research scientists developing LLM agents for complex, dynamic environments, Orak provides a critical tool for robust evaluation and development. You should leverage its diverse game set and plug-and-play interface to systematically test LLM capabilities and agentic strategies. The provided fine-tuning dataset offers a pathway to improve smaller models' generalization, but be aware that visual input can sometimes hinder performance, and optimal agentic strategies depend on the LLM's inherent strength.

Key insights

Orak is a benchmark for LLM game agents, featuring diverse real games, a plug-and-play interface, and a fine-tuning dataset.

Principles

Diverse game genres enable comprehensive LLM capability assessment.
Agentic modules' impact varies with LLM inherent capability.
Fine-tuning on expert trajectories enhances LLM generalization.

Method

Orak uses a Model Context Protocol (MCP) interface to connect LLMs with 12 real video games. It provides a fine-tuning dataset of expert LLM gameplay trajectories, including reflection, planning, and action sequences.

In practice

Use Orak to benchmark LLM agents across 12 real-world games.
Consider fine-tuning smaller LLMs on Orak's expert trajectories.
Evaluate agentic module effectiveness based on LLM size and task.

Topics

LLM Agents
Game Benchmarking
Video Games
Model Context Protocol
Agentic Workflows

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.