MathDuels: Evaluating LLMs as Problem Posers and Solvers

2026-04-23 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

MathDuels is a new self-play benchmark designed to evaluate large language models (LLMs) in both problem-posing and problem-solving roles for mathematics. This benchmark addresses the saturation of traditional evaluations where models only act as solvers on fixed problem sets. MathDuels employs a three-stage pipeline for problem generation, including meta-prompting, problem generation, and difficulty amplification, with an independent verifier ensuring problem validity. A Rasch model is used to jointly estimate solver abilities and problem difficulties, from which author quality is derived. Experiments with 19 frontier models indicate that authoring and solving capabilities are partially decoupled, and this dual-role evaluation reveals distinctions not visible in single-role benchmarks. The benchmark's difficulty co-evolves with model strength, preventing saturation as new models emerge.

Key takeaway

For research scientists and AI engineers developing or evaluating advanced LLMs, MathDuels highlights the necessity of dual-role benchmarks. Your evaluation strategies should move beyond static problem sets to include dynamic, self-play environments where models both generate and solve problems. This approach will provide a more nuanced understanding of model capabilities and prevent benchmark saturation, especially as frontier models continue to advance.

Key insights

Dual-role evaluation in MathDuels reveals LLM capabilities in math problem authoring and solving that single-role benchmarks miss.

Principles

Authoring and solving capabilities are partially decoupled.
Benchmark difficulty should co-evolve with model strength.

Method

Problems are generated via meta-prompting, problem generation, and difficulty amplification, then validated. A Rasch model estimates solver ability and problem difficulty.

In practice

Use adversarial prompting for problem generation.
Implement independent verifiers for content quality.
Apply Rasch models for joint capability/difficulty estimation.

Topics

LLM Evaluation
Self-Play Benchmarking
Math Problem Generation
Rasch Model
Model Capability Decoupling

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.