MathDuels: Evaluating LLMs as Problem Posers and Solvers

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

MathDuels is a new self-play benchmark designed to evaluate large language models (LLMs) in both problem-posing and problem-solving roles for mathematics. This benchmark addresses the saturation of traditional evaluations where models only act as solvers on fixed problem sets. MathDuels employs a three-stage pipeline for problem generation, including meta-prompting, problem generation, and difficulty amplification, with an independent verifier ensuring problem validity. A Rasch model is used to jointly estimate solver abilities and problem difficulties, from which author quality is derived. Experiments with 19 frontier models indicate that authoring and solving capabilities are partially decoupled, and this dual-role evaluation reveals distinctions not visible in single-role benchmarks. The benchmark's difficulty co-evolves with model strength, preventing saturation as new models emerge.

Key takeaway

For research scientists and AI engineers developing or evaluating advanced LLMs, MathDuels highlights the necessity of dual-role benchmarks. Your evaluation strategies should move beyond static problem sets to include dynamic, self-play environments where models both generate and solve problems. This approach will provide a more nuanced understanding of model capabilities and prevent benchmark saturation, especially as frontier models continue to advance.

Key insights

Dual-role evaluation in MathDuels reveals LLM capabilities in math problem authoring and solving that single-role benchmarks miss.

Principles

Method

Problems are generated via meta-prompting, problem generation, and difficulty amplification, then validated. A Rasch model estimates solver ability and problem difficulty.

In practice

Topics

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.