MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
Summary
MaxProof is a population-level test-time scaling framework designed to enhance competition-level mathematical proof capabilities, particularly for the MiniMax-M3 series. The M3 model first integrates three proof-oriented capabilities: generation via verifier-guided RL, verification through aligned error finding, and critique-conditioned repair. MaxProof then leverages these capabilities at test time, treating the M3 model as a generator, verifier, refiner, and ranker to search a population of candidate proofs and select a final one via tournament selection. This framework enabled the M3 model to achieve 35/42 on IMO 2025 and 36/42 on USAMO 2026, surpassing human gold-medal thresholds. Standalone, M3 scored 67.40 on IMOProofBench and 81.56 on IMOAnswerBench. MaxProof contributed 8 points on IMO 2025 and 10 points on USAMO 2026 over the one-shot M3 performance.
Key takeaway
For ML Engineers developing models for high-stakes reasoning like mathematical proof, MaxProof demonstrates that combining robust generative verifiers with population-level test-time scaling significantly boosts pass@1 performance. You should prioritize defense-in-depth verifier design to prevent reward hacking and consider iterative refinement strategies to convert best@K capabilities into more stable, gold-medal-level results.
Key insights
MaxProof scales mathematical proof capabilities by combining generative-verifier RL with population-level test-time search and refinement.
Principles
- Generative verifiers require defense-in-depth to prevent reward hacking.
- Long-horizon RL with noisy rewards risks reward-hacking plateaus.
- Population-level search with refinement improves pass@1 from best@K.
Method
MaxProof initializes a candidate proof population, iteratively refines promising candidates using dual PATCH/REWRITE, and selects a final proof via tournament, guided by a conservative verifier.
In practice
- Implement multi-judge parallel scoring and pessimistic min aggregation for verifiers.
- Use dual PATCH/REWRITE refinement for balanced exploration/exploitation in search.
- Employ population-level early stopping with redundancy checks to mitigate verifier false positives.
Topics
- Mathematical Proof
- Generative Verifiers
- Reinforcement Learning
- Test-Time Scaling
- Large Language Models
- Reward Hacking
- Olympiad Mathematics
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.