JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines
Summary
JAMER introduces JamSet and JamBench, the first project-level game code framework dataset and benchmark specifically designed for professional game engines. Built upon the Godot engine, this initiative addresses the gap in large-scale datasets for AI-driven project-level code engineering. The dataset was curated from thousands of open-source projects originating from Game Jam competitions, distilling 8,133 verified projects from over 240,000 repositories. JamBench, comprising 300 manually verified projects, facilitates theme-driven generation and code completion tasks, evaluated using compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Initial evaluations of 9 frontier models revealed a significant capability cliff, with runtime pass rates plummeting from 80.4% on small projects to 5.7% on larger ones (Task2a). This suggests architectural design, rather than syntactic correctness, is the primary bottleneck for AI in complex game code generation. All data and code are publicly available.
Key takeaway
For AI Engineers developing code generation models for professional game engines, this research highlights a critical need to shift focus. Your current models, while improving compilation rates, demonstrate a severe capability cliff in runtime behavioral quality on larger projects, dropping to 5.7%. This indicates that architectural design, not just syntactic correctness, is the primary bottleneck. You should prioritize research into AI agents capable of generating coherent project-level architectural structures and complex behavioral logic.
Key insights
JAMER provides the first project-level game code dataset and benchmark, exposing AI's architectural design limitations in complex game development.
Principles
- AI models face a capability cliff in project-level game code.
- Architectural design, not syntax, bottlenecks AI game code generation.
- Game Jam projects offer rich, open-source data for code datasets.
Method
A deterministic verification pipeline collects runtime behavior and evaluates game projects from file integrity to compilation pass rates, SCS, and BAS.
In practice
- Employ Godot's text-based format for automated code analysis.
- Utilize SCS and BAS for evaluating AI-generated game project quality.
Topics
- AI Game Development
- Code Generation
- Godot Engine
- Project-Level Code
- Software Engineering
- Code Quality Benchmarking
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.