Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming
Summary
UOJ-Bench is a new benchmark designed to evaluate Large Language Models (LLMs) in competitive programming, extending beyond mere problem-solving to include error identification in human-written code (hacking) and code repair. Constructed from real-world submissions on the Universal Online Judge (UOJ), the benchmark uses UOJ's native judging infrastructure for evaluation. Initial results indicate that under one-shot evaluation, even strong models fail to identify errors in over 50% of incorrect submissions. While test-time scaling boosts success rates above 90%, its substantial computational costs limit large-scale deployment. Despite this, the best-performing models with test-time scaling can uncover errors in over 5% of full-score submissions across approximately 30 problems, suggesting LLMs can offer complementary signals to standard judging systems.
Key takeaway
For machine learning engineers evaluating LLMs for code analysis or educational tools, recognize that while LLMs show promise in identifying subtle errors in human code, achieving high accuracy often requires computationally expensive methods like test-time scaling. You should weigh the trade-offs between inference cost and error detection performance, potentially exploring hybrid systems that combine LLM insights with traditional online judge feedback to optimize resource use and effectiveness in supporting human learning.
Key insights
LLMs can complement traditional online judges by identifying subtle code errors, but effective error detection remains computationally intensive.
Principles
- LLMs struggle with one-shot error identification in human code.
- Test-time scaling significantly improves LLM code error detection.
- LLMs can provide signals beyond standard judging systems.
Method
UOJ-Bench evaluates LLMs on code generation, hacking, and repair tasks using real UOJ submissions and native judging infrastructure to assess error identification capabilities.
In practice
- Explore LLM integration for code review in competitive programming.
- Consider test-time scaling for critical code analysis tasks.
- Develop hybrid systems combining LLMs with traditional judges.
Topics
- UOJ-Bench
- Large Language Models
- Competitive Programming
- Code Generation
- Code Hacking
- Code Repair
- Online Judges
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.