ScratchWorld: Evaluating If World Models Compute Executable Consequences
Summary
ScratchWorld is a new diagnostic benchmark designed to evaluate if world models can compute executable consequences, rather than merely predicting plausible future states. It utilizes Scratch projects as executable environments, employing a pinned Scratch VM to generate replay-verified transitions, hidden variables, causal traces, and counterfactual outcomes. This benchmark introduces value-aware changed-field F₁ (F₁ᴵᴴ) as its primary metric, which specifically credits models for accurately identifying changed fields and their executed values, addressing the confound where full-state overlap can reward copying persistent state. In a 659-example release, seven prompted language/reasoning models achieved a maximum of 13.8% F₁ᴵᴴ in a state-only partial-observation stress test. A diagnostic showed that copying the input state yielded 98.0% implied full-state field accuracy but 0.0% changed-field F₁, particularly on real projects. Auxiliary tests also revealed hidden-state rollout drift and intervention sensitivity without precise consequence prediction.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating world models, you must move beyond full-state overlap metrics. Relying on these can mask a model's inability to compute executable consequences, as they reward state persistence over actual change prediction. Instead, prioritize execution-grounded metrics like value-aware changed-field F₁ to truly assess if your models understand and compute dynamic transitions. This will guide you in developing models that genuinely simulate environment dynamics.
Key insights
World model evaluations require execution-grounded metrics to assess true consequence computation, not just state persistence.
Principles
- Full-state overlap metrics reward state copying, not change prediction.
- Value-aware changed-field scoring isolates transition computation.
- Reactivity to actions does not imply computing exact consequences.
Method
ScratchWorld constructs execution-verified instances from Scratch projects by using a pinned VM to log structured traces, extract transitions, and generate counterfactuals for evaluation.
In practice
- Adopt value-aware changed-field F₁ for world model evaluation.
- Utilize copy diagnostics to identify models that echo input state.
- Disaggregate performance by input modality to diagnose errors.
Topics
- ScratchWorld Benchmark
- World Models
- Executable Dynamics
- Evaluation Metrics
- Language Models
- Causal Reasoning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.