ScratchWorld: Evaluating If World Models Compute Executable Consequences

2026-07-01 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

ScratchWorld is a new diagnostic benchmark designed to evaluate if world models can compute executable consequences, rather than merely predicting plausible future states. It utilizes Scratch projects as executable environments, employing a pinned Scratch VM to generate replay-verified transitions, hidden variables, causal traces, and counterfactual outcomes. This benchmark introduces value-aware changed-field F₁ (F₁ᴵᴴ) as its primary metric, which specifically credits models for accurately identifying changed fields and their executed values, addressing the confound where full-state overlap can reward copying persistent state. In a 659-example release, seven prompted language/reasoning models achieved a maximum of 13.8% F₁ᴵᴴ in a state-only partial-observation stress test. A diagnostic showed that copying the input state yielded 98.0% implied full-state field accuracy but 0.0% changed-field F₁, particularly on real projects. Auxiliary tests also revealed hidden-state rollout drift and intervention sensitivity without precise consequence prediction.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating world models, you must move beyond full-state overlap metrics. Relying on these can mask a model's inability to compute executable consequences, as they reward state persistence over actual change prediction. Instead, prioritize execution-grounded metrics like value-aware changed-field F₁ to truly assess if your models understand and compute dynamic transitions. This will guide you in developing models that genuinely simulate environment dynamics.

Key insights

World model evaluations require execution-grounded metrics to assess true consequence computation, not just state persistence.

Principles

Full-state overlap metrics reward state copying, not change prediction.
Value-aware changed-field scoring isolates transition computation.
Reactivity to actions does not imply computing exact consequences.

Method

ScratchWorld constructs execution-verified instances from Scratch projects by using a pinned VM to log structured traces, extract transitions, and generate counterfactuals for evaluation.

In practice

Adopt value-aware changed-field F₁ for world model evaluation.
Utilize copy diagnostics to identify models that echo input state.
Disaggregate performance by input modality to diagnose errors.

Topics

ScratchWorld Benchmark
World Models
Executable Dynamics
Evaluation Metrics
Language Models
Causal Reasoning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.