Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

2026-05-15 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

The SU-01 model, developed by Shanghai AI Laboratory and collaborators, achieves gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. This 30B-A3B reasoning backbone is trained using a "simple and unified recipe" involving three stages: rigorous Supervised Fine-Tuning (SFT) with a reverse-perplexity curriculum on 340K sub-8K-token trajectories, a two-stage Reinforcement Learning (RL) pipeline progressing from verifiable rewards to proof-level RL, and finally, test-time scaling. SU-01 supports stable reasoning on trajectories exceeding 100K tokens and demonstrates strong generalization to scientific reasoning beyond mathematics and physics, outperforming similar-sized models and approaching commercial systems like Gemini 3.1 Pro Thinking on benchmarks like IMO-ProofBench and FrontierScience-Research.

Key takeaway

Research Scientists developing advanced reasoning models should consider adopting a modular, multi-stage training pipeline like SU-01's. By combining rigorous SFT with a reverse-perplexity curriculum, two-stage RL for behavior scaling, and test-time self-verification, you can push broadly capable backbones to expert-level performance in complex domains like mathematical and scientific Olympiads, while preserving generalization across scientific tasks.

Key insights

A compact 30B-A3B model achieves Olympiad-level reasoning through a multi-stage training and inference pipeline.

Principles

Specializable-generalist models excel with targeted training.
Reverse-perplexity curricula stabilize SFT on post-trained models.
Multi-stage RL refines reasoning from correctness to proof quality.

Method

The SU-01 recipe involves SFT with a reverse-perplexity curriculum, followed by two-stage RL (coarse with verifiable rewards, refined with proof-level generative rewards), and concludes with test-time self-verification and refinement for enhanced performance.

In practice

Use long-form SFT to instill disciplined proof-search patterns.
Employ RL with verifiable rewards for efficient reasoning improvement.
Apply test-time scaling for elevating trained models to expert levels.

Topics

Olympiad Reasoning
Supervised Fine-tuning
Reinforcement Learning
Test-Time Scaling
SU-01 Model

Code references

Simplified-Reasoning/SU-01

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.