MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

MaxProof is a population-level test-time scaling framework designed to enhance competition-level mathematical proof capabilities, particularly for the MiniMax-M3 series. The M3 model first integrates three proof-oriented capabilities: generation via verifier-guided RL, verification through aligned error finding, and critique-conditioned repair. MaxProof then leverages these capabilities at test time, treating the M3 model as a generator, verifier, refiner, and ranker to search a population of candidate proofs and select a final one via tournament selection. This framework enabled the M3 model to achieve 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing human gold-medal thresholds. Standalone, M3 scored 67.40 on IMOProofBench and 81.56 on IMOAnswerBench. MaxProof contributed 8 points on IMO 2025 and 10 points on USAMO 2026 over the one-shot M3 performance.

Key takeaway

For ML Engineers developing models for high-stakes reasoning like mathematical proof, MaxProof demonstrates that combining robust generative verifiers with population-level test-time scaling significantly boosts pass@1 performance. You should prioritize defense-in-depth verifier design to prevent reward hacking and consider iterative refinement strategies to convert best@K capabilities into more stable, gold-medal-level results.

Key insights

MaxProof scales mathematical proof capabilities by combining generative-verifier RL with population-level test-time search and refinement.

Principles

Generative verifiers require defense-in-depth to prevent reward hacking.
Long-horizon RL with noisy rewards risks reward-hacking plateaus.
Population-level search with refinement improves pass@1 from best@K.

Method

MaxProof initializes a candidate proof population, iteratively refines promising candidates using dual PATCH/REWRITE, and selects a final proof via tournament, guided by a conservative verifier.

In practice

Implement multi-judge parallel scoring and pessimistic min aggregation for verifiers.
Use dual PATCH/REWRITE refinement for balanced exploration/exploitation in search.
Employ population-level early stopping with redundancy checks to mitigate verifier false positives.

Topics

Mathematical Proof
Generative Verifiers
Reinforcement Learning
Test-Time Scaling
Large Language Models
Reward Hacking
Olympiad Mathematics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.