Benchmarking LLMs at the Game Of Science (Eleusis)

· Source: HuggingFace · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, long

Summary

A new benchmark, "Game of Science (Eleusis)," evaluates Large Language Models' (LLMs) scientific reasoning and metacognition by adapting the Eleusis card game. This single-player game tasks LLMs with inferring a secret rule by playing cards and receiving "accepted" or "rejected" feedback, simulating hypothesis generation, experimentation, and revision. Models must choose cards, provide reasoning, state a tentative rule, assign a confidence level (0-10), and decide whether to formally guess, incurring a penalty for incorrect guesses. The benchmark uses 26 handcrafted rules, each played three times, totaling 78 rounds per model. Sixteen frontier models, including open-weight (Kimmy K2, GLM 4.7) and proprietary (Claude Opus, GPT-5 Mini, Gemini 3), were tested. Results show significant variation in performance, with Gemini 3 Pro and Claude Opus 4.5 leading, and open-weight models demonstrating competitiveness. The study disentangles raw reasoning from metacognition, revealing distinct "scientist personalities" among LLMs.

Key takeaway

For AI Engineers developing or deploying LLMs in iterative reasoning tasks like code debugging or medical diagnosis, understanding a model's "scientist personality" is crucial. Overly cautious models waste resources, while reckless ones pose risks. You should evaluate models not just on raw reasoning but also on their metacognitive traits, as these are distinct and potentially tunable through post-training or prompting strategies to optimize performance and resource efficiency.

Key insights

LLM scientific reasoning involves distinct inductive ability and metacognition, measurable through iterative hypothesis testing.

Principles

Method

The Eleusis card game is adapted into a single-player LLM benchmark, requiring models to iteratively hypothesize, experiment, and manage confidence under penalty for incorrect guesses, providing a structured output at each turn.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by HuggingFace.