Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Summary
The SU-01 model, developed by Shanghai AI Laboratory and collaborators, achieves gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. This 30B-A3B reasoning backbone is trained using a "simple and unified recipe" involving three stages: rigorous Supervised Fine-Tuning (SFT) with a reverse-perplexity curriculum on 340K sub-8K-token trajectories, a two-stage Reinforcement Learning (RL) pipeline progressing from verifiable rewards to proof-level RL, and finally, test-time scaling. SU-01 supports stable reasoning on trajectories exceeding 100K tokens and demonstrates strong generalization to scientific reasoning beyond mathematics and physics, outperforming similar-sized models and approaching commercial systems like Gemini 3.1 Pro Thinking on benchmarks like IMO-ProofBench and FrontierScience-Research.
Key takeaway
Research Scientists developing advanced reasoning models should consider adopting a modular, multi-stage training pipeline like SU-01's. By combining rigorous SFT with a reverse-perplexity curriculum, two-stage RL for behavior scaling, and test-time self-verification, you can push broadly capable backbones to expert-level performance in complex domains like mathematical and scientific Olympiads, while preserving generalization across scientific tasks.
Key insights
A compact 30B-A3B model achieves Olympiad-level reasoning through a multi-stage training and inference pipeline.
Principles
- Specializable-generalist models excel with targeted training.
- Reverse-perplexity curricula stabilize SFT on post-trained models.
- Multi-stage RL refines reasoning from correctness to proof quality.
Method
The SU-01 recipe involves SFT with a reverse-perplexity curriculum, followed by two-stage RL (coarse with verifiable rewards, refined with proof-level generative rewards), and concludes with test-time self-verification and refinement for enhanced performance.
In practice
- Use long-form SFT to instill disciplined proof-search patterns.
- Employ RL with verifiable rewards for efficient reasoning improvement.
- Apply test-time scaling for elevating trained models to expert levels.
Topics
- Olympiad Reasoning
- Supervised Fine-tuning
- Reinforcement Learning
- Test-Time Scaling
- SU-01 Model
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.