Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming
Summary
UOJ-Bench is a new benchmark designed to evaluate Large Language Models (LLMs) in competitive programming beyond just problem-solving, focusing on their ability to identify and fix errors in human-written code. Constructed from real-world submissions on the Universal Online Judge (UOJ) and using its native judging, the benchmark features three tasks: code generation, code hacking (creating test cases to break code), and code repair (generating minimal patches). Initial one-shot evaluations show even strong models fail to detect errors in over 50% of incorrect submissions. While test-time scaling boosts success rates above 90%, it incurs significant computational costs, making large-scale deployment impractical. However, top models under scaling can uncover errors in over 5% of full-score submissions across approximately 30 problems, offering valuable signals beyond traditional judging systems.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating LLMs for code verification or educational feedback, recognize that while models can uncover subtle "zero-day" bugs and repair complex code, the computational cost of achieving high accuracy via test-time scaling is currently prohibitive. You should prioritize cost-effective models and explore agentic frameworks for repair, but understand that large-scale deployment remains economically unsustainable compared to traditional methods.
Key insights
LLMs can actively verify and debug competitive programming code, but current inference costs limit practical deployment.
Principles
- LLMs detect covert errors less effectively than overt errors.
- Debugging capabilities are partially independent of code generation.
- Test-time scaling improves LLM performance but raises inference costs.
Method
UOJ-Bench evaluates LLMs on code generation, hacking, and repair using real-world UOJ submissions and native judging, distinguishing overt from covert errors.
In practice
- Use LLMs to generate adversarial test cases for "zero-day" bug discovery.
- Employ agentic frameworks for iterative code repair, leveraging feedback.
- Integrate LLM-based hacking to strengthen test suites in online judges.
Topics
- Large Language Models
- Competitive Programming
- Code Hacking
- Code Repair
- Benchmark Evaluation
- Computational Cost
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.