Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

2026-05-13 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new reasoning model, SU-01, achieves gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. This 30B-A3B backbone model is trained using a unified recipe that converts a post-trained reasoning backbone into a rigorous olympiad-level solver. The training involves a reverse-perplexity curriculum for Supervised Fine-Tuning (SFT) on approximately 340K sub-8K-token trajectories, followed by a two-stage Reinforcement Learning (RL) pipeline with 200 RL steps. This process instills rigorous proof-search and self-checking behaviors, which are then scaled from verifiable rewards to proof-level RL. The model also utilizes test-time scaling to boost solving performance, enabling stable reasoning on complex problems with trajectories over 100K tokens and demonstrating strong generalization to scientific domains beyond its core training.

Key takeaway

For AI Engineers developing advanced reasoning systems, this unified recipe offers a clear path to achieving expert-level performance in complex problem-solving domains. You should consider integrating a reverse-perplexity SFT curriculum and a two-stage RL pipeline to instill robust proof-search and self-checking capabilities, especially for models tackling long-horizon scientific or mathematical challenges. This approach can significantly enhance model stability and generalization, crucial for competitive benchmarks like Olympiads.

Key insights

A unified recipe scales reasoning models to gold-medal Olympiad performance via SFT and two-stage RL.

Principles

Reverse-perplexity curriculum enhances rigorous proof-search.
Two-stage RL refines reasoning from verifiable to proof-level.
Test-time scaling boosts final problem-solving performance.

Method

The recipe involves SFT with a reverse-perplexity curriculum, followed by a two-stage RL pipeline progressing from verifiable rewards to proof-level RL, and finally applying test-time scaling.

In practice

Train 30B-A3B backbones with SFT on 340K trajectories.
Implement 200 RL steps for behavior scaling.
Apply test-time scaling for performance boost.

Topics

Olympiad Reasoning
Supervised Fine-Tuning
Reinforcement Learning
Test-Time Scaling
SU-01 Model

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.