Benchmarking LLMs at the Game Of Science (Eleusis)
Summary
A new benchmark, "Game of Science (Eleusis)," evaluates Large Language Models' (LLMs) scientific reasoning and metacognition by adapting the Eleusis card game. This single-player game tasks LLMs with inferring a secret rule by playing cards and receiving "accepted" or "rejected" feedback, simulating hypothesis generation, experimentation, and revision. Models must choose cards, provide reasoning, state a tentative rule, assign a confidence level (0-10), and decide whether to formally guess, incurring a penalty for incorrect guesses. The benchmark uses 26 handcrafted rules, each played three times, totaling 78 rounds per model. Sixteen frontier models, including open-weight (Kimmy K2, GLM 4.7) and proprietary (Claude Opus, GPT-5 Mini, Gemini 3), were tested. Results show significant variation in performance, with Gemini 3 Pro and Claude Opus 4.5 leading, and open-weight models demonstrating competitiveness. The study disentangles raw reasoning from metacognition, revealing distinct "scientist personalities" among LLMs.
Key takeaway
For AI Engineers developing or deploying LLMs in iterative reasoning tasks like code debugging or medical diagnosis, understanding a model's "scientist personality" is crucial. Overly cautious models waste resources, while reckless ones pose risks. You should evaluate models not just on raw reasoning but also on their metacognitive traits, as these are distinct and potentially tunable through post-training or prompting strategies to optimize performance and resource efficiency.
Key insights
LLM scientific reasoning involves distinct inductive ability and metacognition, measurable through iterative hypothesis testing.
Principles
- Metacognition is distinct from raw reasoning ability.
- Overconfidence is a widespread LLM trait.
- Simplicity (Ockham's Razor) is often violated by LLM hypotheses.
Method
The Eleusis card game is adapted into a single-player LLM benchmark, requiring models to iteratively hypothesize, experiment, and manage confidence under penalty for incorrect guesses, providing a structured output at each turn.
In practice
- Use Eleusis benchmark to evaluate LLM iterative reasoning.
- Tune LLM metacognition via post-training or prompting.
- Assess LLM "scientist personality" for task suitability.
Topics
- LLM Benchmarking
- Scientific Reasoning
- Metacognition
- Model Calibration
- Iterative Reasoning
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by HuggingFace.