Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

2025-10-15 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

NVIDIA researchers introduce GenCluster, a scalable test-time compute framework that enables open-weight large language models (LLMs) to achieve gold medal-level performance at the International Olympiad in Informatics (IOI) 2025. This framework addresses the challenge of matching proprietary models' performance with transparent, reproducible methods. GenCluster integrates large-scale solution generation, behavioral clustering, LLM-based ranking via a tournament, and a round-robin submission strategy to navigate IOI's strict validation budgets and submission limits. Experiments demonstrate that the gpt-oss-120b model, when combined with GenCluster and 5000 generations per subtask, achieved a gold medal score of 446.75, marking the first time an open-weight model has reached this level. The approach shows consistent performance scaling with increased compute, narrowing the gap between open and closed AI systems in competitive programming.

Key takeaway

For research scientists developing competitive programming LLMs, GenCluster offers a transparent and reproducible method to achieve top-tier performance with open-weight models. You should consider implementing its four-stage pipeline—parallel generation, behavioral clustering, tournament-based ranking, and round-robin submission—to maximize scores under strict competition constraints. This approach demonstrates that scaling test-time compute is crucial for bridging the performance gap between open and proprietary systems, providing a clear path to gold-level achievements.

Key insights

GenCluster enables open-weight LLMs to achieve IOI gold medal performance through scalable test-time compute and strategic solution selection.

Principles

Test-time compute scales LLM performance.
Behavioral clustering improves solution selection.
LLM-as-a-judge can rank code solutions.

Method

GenCluster generates many candidate solutions, clusters them by behavioral similarity, ranks clusters using an LLM-based tournament, and employs a round-robin submission strategy under IOI constraints.

In practice

Generate 5000+ solutions per subtask for optimal results.
Use C++ for competitive programming solutions.
Employ longest reasoning trace as a correctness proxy.

Topics

Competitive Programming
Large Language Models
Test-Time Compute
GenCluster
IOI Gold Medal

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.