Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming

2026-06-11 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

UOJ-Bench is a new benchmark designed to evaluate Large Language Models (LLMs) in competitive programming, extending beyond mere problem-solving to include error identification in human-written code (hacking) and code repair. Constructed from real-world submissions on the Universal Online Judge (UOJ), the benchmark uses UOJ's native judging infrastructure for evaluation. Initial results indicate that under one-shot evaluation, even strong models fail to identify errors in over 50% of incorrect submissions. While test-time scaling boosts success rates above 90%, its substantial computational costs limit large-scale deployment. Despite this, the best-performing models with test-time scaling can uncover errors in over 5% of full-score submissions across approximately 30 problems, suggesting LLMs can offer complementary signals to standard judging systems.

Key takeaway

For machine learning engineers evaluating LLMs for code analysis or educational tools, recognize that while LLMs show promise in identifying subtle errors in human code, achieving high accuracy often requires computationally expensive methods like test-time scaling. You should weigh the trade-offs between inference cost and error detection performance, potentially exploring hybrid systems that combine LLM insights with traditional online judge feedback to optimize resource use and effectiveness in supporting human learning.

Key insights

LLMs can complement traditional online judges by identifying subtle code errors, but effective error detection remains computationally intensive.

Principles

LLMs struggle with one-shot error identification in human code.
Test-time scaling significantly improves LLM code error detection.
LLMs can provide signals beyond standard judging systems.

Method

UOJ-Bench evaluates LLMs on code generation, hacking, and repair tasks using real UOJ submissions and native judging infrastructure to assess error identification capabilities.

In practice

Explore LLM integration for code review in competitive programming.
Consider test-time scaling for critical code analysis tasks.
Develop hybrid systems combining LLMs with traditional judges.

Topics

UOJ-Bench
Large Language Models
Competitive Programming
Code Generation
Code Hacking
Code Repair
Online Judges

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.