Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming

2026-06-12 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

UOJ-Bench is a new benchmark designed to evaluate Large Language Models (LLMs) in competitive programming beyond just problem-solving, focusing on their ability to identify and fix errors in human-written code. Constructed from real-world submissions on the Universal Online Judge (UOJ) and using its native judging, the benchmark features three tasks: code generation, code hacking (creating test cases to break code), and code repair (generating minimal patches). Initial one-shot evaluations show even strong models fail to detect errors in over 50% of incorrect submissions. While test-time scaling boosts success rates above 90%, it incurs significant computational costs, making large-scale deployment impractical. However, top models under scaling can uncover errors in over 5% of full-score submissions across approximately 30 problems, offering valuable signals beyond traditional judging systems.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating LLMs for code verification or educational feedback, recognize that while models can uncover subtle "zero-day" bugs and repair complex code, the computational cost of achieving high accuracy via test-time scaling is currently prohibitive. You should prioritize cost-effective models and explore agentic frameworks for repair, but understand that large-scale deployment remains economically unsustainable compared to traditional methods.

Key insights

LLMs can actively verify and debug competitive programming code, but current inference costs limit practical deployment.

Principles

LLMs detect covert errors less effectively than overt errors.
Debugging capabilities are partially independent of code generation.
Test-time scaling improves LLM performance but raises inference costs.

Method

UOJ-Bench evaluates LLMs on code generation, hacking, and repair using real-world UOJ submissions and native judging, distinguishing overt from covert errors.

In practice

Use LLMs to generate adversarial test cases for "zero-day" bug discovery.
Employ agentic frameworks for iterative code repair, leveraging feedback.
Integrate LLM-based hacking to strengthen test suites in online judges.

Topics

Large Language Models
Competitive Programming
Code Hacking
Code Repair
Benchmark Evaluation
Computational Cost

Code references

hehezhou/UOJ-Bench

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.