Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Chess-World-Model is introduced as a large-scale state-tracking benchmark, utilizing 10 million real chess games to evaluate models' ability to predict exact board states from move sequences. This benchmark includes a held-out real-game split and an out-of-distribution split derived from uniformly random legal play, specifically designed to test whether models learn fundamental transition rules rather than relying on shortcuts from common human positions. Benchmarking a causal Transformer, block-diagonal SLiCE, Mamba-3, and Gated DeltaNet under a matched protocol revealed that recurrent models significantly outperform the Transformer at 3 million and 8 million parameters. While real-game performance saturates above 18 million parameters, the random-uniform split remains discriminative up to 40 million, exposing model failures that scale alone would otherwise conceal. Furthermore, ablations demonstrated that less expressive state-transition mechanisms reduce performance on the out-of-distribution split across all three recurrent models.

Key takeaway

For Machine Learning Engineers developing world models or sequence models, if you are evaluating model robustness, you should integrate benchmarks like Chess-World-Model's random-uniform split. Relying solely on in-distribution performance, even with large parameter counts, can conceal fundamental failures in learning transition rules. Your evaluation strategy must include out-of-distribution tests to truly assess whether your models understand underlying system dynamics rather than just memorizing common patterns.

Key insights

Large-scale, out-of-distribution benchmarks reveal state-tracking failures in models that scale alone conceals.

Principles

Method

Chess-World-Model uses 10M real chess games with real-game and random-uniform OOD splits to test exact board state prediction.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.