Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Gaming & Interactive Media · Depth: Expert, extended

Summary

KRAFTON, Seoul National University, NVIDIA, and University of Wisconsin-Madison researchers introduce Orak, a foundational benchmark for training and evaluating Large Language Model (LLM) agents across 12 diverse, real-world video games. This benchmark addresses limitations in existing evaluations, which often rely on text-only games or 2D-grid simulators, lack comprehensive assessment of agentic modules, and do not provide fine-tuning datasets. Orak includes popular titles like "Street Fighter III", "Super Mario", "Minecraft", and "StarCraft II", spanning six major genres: action, adventure, role-playing, simulation, strategy, and puzzle. It features a plug-and-play interface based on Model Context Protocol (MCP) for seamless LLM-game interaction and offers a fine-tuning dataset of expert LLM gameplay trajectories. Evaluations on 12 LLMs reveal proprietary models generally outperform open-source ones, with Gemini-2.5-pro ranking highest, and fine-tuning significantly improves smaller LLMs' generalization.

Key takeaway

For research scientists developing LLM agents for complex, dynamic environments, Orak provides a critical tool for robust evaluation and development. You should leverage its diverse game set and plug-and-play interface to systematically test LLM capabilities and agentic strategies. The provided fine-tuning dataset offers a pathway to improve smaller models' generalization, but be aware that visual input can sometimes hinder performance, and optimal agentic strategies depend on the LLM's inherent strength.

Key insights

Orak is a benchmark for LLM game agents, featuring diverse real games, a plug-and-play interface, and a fine-tuning dataset.

Principles

Method

Orak uses a Model Context Protocol (MCP) interface to connect LLMs with 12 real video games. It provides a fine-tuning dataset of expert LLM gameplay trajectories, including reflection, planning, and action sequences.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Robotics Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.