JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines
Summary
JAMER introduces JamSet and JamBench, the first project-level game code framework dataset and benchmark for professional game engines. Built on the Godot engine, it leverages thousands of open-source projects from Game Jam competitions. A deterministic verification pipeline, from file integrity to runtime behavior collection, distilled 8,133 verified projects from over 240,000 repositories. JamBench comprises 300 manually verified projects for benchmarking. The remaining 7,833 form JamSet for training. The benchmark defines theme-driven generation and multi-granularity code completion tasks. Evaluation uses compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Initial evaluations of 9 frontier models revealed a significant "capability cliff." Runtime pass rates plummeted from 80.4% on small projects to 5.7% on large ones for function-level completion. Code Agents improved compilation but not behavioral quality, highlighting architectural design as a key bottleneck.
Key takeaway
For AI Engineers developing game code generation models, recognize that current frontier models face a significant "capability cliff" on project-level tasks. This is especially true with increasing scale. You should prioritize fine-tuning on domain-specific datasets like JamSet to instill human-like engineering practices. Your evaluation must extend beyond mere compilation. Include structural completeness (SCS) and runtime behavioral alignment (BAS) to truly assess functional game quality.
Key insights
Project-level game code generation on professional engines requires specialized datasets and deterministic, multi-dimensional evaluation.
Principles
- Game Jam projects provide scalable, real-world code artifacts.
- Deterministic verification is crucial for game code evaluation.
- Model capability degrades sharply as project scale increases.
Method
A four-level deterministic pipeline verifies Godot projects for file integrity, compilation, runtime stability, and behavior collection, enabling large-scale dataset curation and objective evaluation.
In practice
- Fine-tune models on domain-specific game code datasets.
- Evaluate game code beyond compilation, using SCS and BAS.
- Integrate engineering conventions into Code Agent feedback.
Topics
- Godot Engine
- Game Code Generation
- Project-Level Benchmarks
- AI Model Evaluation
- Game Jam Datasets
- Code Agents
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.