Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
Summary
A new reasoning model, SU-01, achieves gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. This 30B-A3B backbone model is trained using a unified recipe that converts a post-trained reasoning backbone into a rigorous olympiad-level solver. The training involves a reverse-perplexity curriculum for Supervised Fine-Tuning (SFT) on approximately 340K sub-8K-token trajectories, followed by a two-stage Reinforcement Learning (RL) pipeline with 200 RL steps. This process instills rigorous proof-search and self-checking behaviors, which are then scaled from verifiable rewards to proof-level RL. The model also utilizes test-time scaling to boost solving performance, enabling stable reasoning on complex problems with trajectories over 100K tokens and demonstrating strong generalization to scientific domains beyond its core training.
Key takeaway
For AI Engineers developing advanced reasoning systems, this unified recipe offers a clear path to achieving expert-level performance in complex problem-solving domains. You should consider integrating a reverse-perplexity SFT curriculum and a two-stage RL pipeline to instill robust proof-search and self-checking capabilities, especially for models tackling long-horizon scientific or mathematical challenges. This approach can significantly enhance model stability and generalization, crucial for competitive benchmarks like Olympiads.
Key insights
A unified recipe scales reasoning models to gold-medal Olympiad performance via SFT and two-stage RL.
Principles
- Reverse-perplexity curriculum enhances rigorous proof-search.
- Two-stage RL refines reasoning from verifiable to proof-level.
- Test-time scaling boosts final problem-solving performance.
Method
The recipe involves SFT with a reverse-perplexity curriculum, followed by a two-stage RL pipeline progressing from verifiable rewards to proof-level RL, and finally applying test-time scaling.
In practice
- Train 30B-A3B backbones with SFT on 340K trajectories.
- Implement 200 RL steps for behavior scaling.
- Apply test-time scaling for performance boost.
Topics
- Olympiad Reasoning
- Supervised Fine-Tuning
- Reinforcement Learning
- Test-Time Scaling
- SU-01 Model
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.