JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

2026-06-19 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Gaming & Interactive Media · Depth: Expert, extended

Summary

JAMER introduces JamSet and JamBench, the first project-level game code framework dataset and benchmark for professional game engines. Built on the Godot engine, it leverages thousands of open-source projects from Game Jam competitions. A deterministic verification pipeline, from file integrity to runtime behavior collection, distilled 8,133 verified projects from over 240,000 repositories. JamBench comprises 300 manually verified projects for benchmarking. The remaining 7,833 form JamSet for training. The benchmark defines theme-driven generation and multi-granularity code completion tasks. Evaluation uses compilation pass rates, Structural Completeness Score (SCS), and Behavioral Alignment Score (BAS). Initial evaluations of 9 frontier models revealed a significant "capability cliff." Runtime pass rates plummeted from 80.4% on small projects to 5.7% on large ones for function-level completion. Code Agents improved compilation but not behavioral quality, highlighting architectural design as a key bottleneck.

Key takeaway

For AI Engineers developing game code generation models, recognize that current frontier models face a significant "capability cliff" on project-level tasks. This is especially true with increasing scale. You should prioritize fine-tuning on domain-specific datasets like JamSet to instill human-like engineering practices. Your evaluation must extend beyond mere compilation. Include structural completeness (SCS) and runtime behavioral alignment (BAS) to truly assess functional game quality.

Key insights

Project-level game code generation on professional engines requires specialized datasets and deterministic, multi-dimensional evaluation.

Principles

Game Jam projects provide scalable, real-world code artifacts.
Deterministic verification is crucial for game code evaluation.
Model capability degrades sharply as project scale increases.

Method

A four-level deterministic pipeline verifies Godot projects for file integrity, compilation, runtime stability, and behavior collection, enabling large-scale dataset curation and objective evaluation.

In practice

Fine-tune models on domain-specific game code datasets.
Evaluate game code beyond compilation, using SCS and BAS.
Integrate engineering conventions into Code Agent feedback.

Topics

Godot Engine
Game Code Generation
Project-Level Benchmarks
AI Model Evaluation
Game Jam Datasets
Code Agents

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.