ScratchWorld: Evaluating If World Models Compute Executable Consequences

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

ScratchWorld is a new diagnostic benchmark designed to evaluate if world models can compute executable consequences, rather than merely predicting plausible future states. It utilizes Scratch projects as executable environments, employing a pinned Scratch VM to generate replay-verified transitions, hidden variables, causal traces, and counterfactual outcomes. This benchmark introduces value-aware changed-field F₁ (F₁ᴵᴴ) as its primary metric, which specifically credits models for accurately identifying changed fields and their executed values, addressing the confound where full-state overlap can reward copying persistent state. In a 659-example release, seven prompted language/reasoning models achieved a maximum of 13.8% F₁ᴵᴴ in a state-only partial-observation stress test. A diagnostic showed that copying the input state yielded 98.0% implied full-state field accuracy but 0.0% changed-field F₁, particularly on real projects. Auxiliary tests also revealed hidden-state rollout drift and intervention sensitivity without precise consequence prediction.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating world models, you must move beyond full-state overlap metrics. Relying on these can mask a model's inability to compute executable consequences, as they reward state persistence over actual change prediction. Instead, prioritize execution-grounded metrics like value-aware changed-field F₁ to truly assess if your models understand and compute dynamic transitions. This will guide you in developing models that genuinely simulate environment dynamics.

Key insights

World model evaluations require execution-grounded metrics to assess true consequence computation, not just state persistence.

Principles

Method

ScratchWorld constructs execution-verified instances from Scratch projects by using a pinned VM to log structured traces, extract transitions, and generate counterfactuals for evaluation.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.