JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

· Source: Computation and Language · Field: Technology & Digital — Software Development & Engineering, Artificial Intelligence & Machine Learning, Gaming & Interactive Media · Depth: Expert, quick

Summary

JAMER introduces JamSet and JamBench, the first project-level game code framework dataset and benchmark specifically designed for professional game engines. Built upon the Godot engine, this initiative addresses the gap in large-scale datasets for AI-driven project-level code engineering. The dataset was curated from thousands of open-source projects originating from Game Jam competitions, distilling 8,133 verified projects from over 240,000 repositories. JamBench, comprising 300 manually verified projects, facilitates theme-driven generation and code completion tasks, evaluated using compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Initial evaluations of 9 frontier models revealed a significant capability cliff, with runtime pass rates plummeting from 80.4% on small projects to 5.7% on larger ones (Task2a). This suggests architectural design, rather than syntactic correctness, is the primary bottleneck for AI in complex game code generation. All data and code are publicly available.

Key takeaway

For AI Engineers developing code generation models for professional game engines, this research highlights a critical need to shift focus. Your current models, while improving compilation rates, demonstrate a severe capability cliff in runtime behavioral quality on larger projects, dropping to 5.7%. This indicates that architectural design, not just syntactic correctness, is the primary bottleneck. You should prioritize research into AI agents capable of generating coherent project-level architectural structures and complex behavioral logic.

Key insights

JAMER provides the first project-level game code dataset and benchmark, exposing AI's architectural design limitations in complex game development.

Principles

Method

A deterministic verification pipeline collects runtime behavior and evaluates game projects from file integrity to compilation pass rates, SCS, and BAS.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.