Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

UOJ-Bench is a new benchmark designed to evaluate Large Language Models (LLMs) in competitive programming beyond just problem-solving, focusing on their ability to identify and fix errors in human-written code. Constructed from real-world submissions on the Universal Online Judge (UOJ) and using its native judging, the benchmark features three tasks: code generation, code hacking (creating test cases to break code), and code repair (generating minimal patches). Initial one-shot evaluations show even strong models fail to detect errors in over 50% of incorrect submissions. While test-time scaling boosts success rates above 90%, it incurs significant computational costs, making large-scale deployment impractical. However, top models under scaling can uncover errors in over 5% of full-score submissions across approximately 30 problems, offering valuable signals beyond traditional judging systems.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating LLMs for code verification or educational feedback, recognize that while models can uncover subtle "zero-day" bugs and repair complex code, the computational cost of achieving high accuracy via test-time scaling is currently prohibitive. You should prioritize cost-effective models and explore agentic frameworks for repair, but understand that large-scale deployment remains economically unsustainable compared to traditional methods.

Key insights

LLMs can actively verify and debug competitive programming code, but current inference costs limit practical deployment.

Principles

Method

UOJ-Bench evaluates LLMs on code generation, hacking, and repair using real-world UOJ submissions and native judging, distinguishing overt from covert errors.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.