An Iterative Test-and-Repair Framework for Competitive Code Generation

2026-07-01 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

FixAudit, an iterative test-and-repair framework for competitive code generation, addresses limitations of previous methods like CURE. It trains a shared model with two roles: a Fixer, which repairs code based on failing tests, and an Auditor, which generates new, bug-exposing tests by reading the candidate code. This framework utilizes a four-stage training pipeline, starting with execution-aligned supervised fine-tuning (SFT), followed by reinforcement learning (RL) stages for initial repair, targeted test generation, and closed-loop refinement. Evaluated on APPS, CodeContests, and xCodeEval, FixAudit, built on a Qwen2.5-Coder-7B-Instruct model, surpasses the average performance of the larger Qwen2.5-Coder-32B-Instruct baseline by 24.9% in average Pass@1 and 40.5% in average AvgPassRatio in a zero-shot setting. It also improves average Pass@1 by 35.1% to 36.8% compared to strong 7B baselines like Specine and CURE.

Key takeaway

For AI Scientists and Machine Learning Engineers developing code generation models, you should integrate iterative test-and-repair mechanisms. FixAudit demonstrates that a code-aware Auditor for targeted bug exposure, coupled with a Fixer for incremental repair, significantly outperforms larger zero-shot models and existing frameworks. Consider adopting a multi-stage RL pipeline, starting with execution-aligned SFT, to build robust debugging capabilities and achieve higher Pass@1 scores with fewer iterations.

Key insights

Iterative, code-aware test-and-repair cycles significantly enhance competitive code generation performance.

Principles

Execution reasoning is foundational for debugging agents.
Targeted test generation requires candidate code analysis.
Program repair should be incremental, preserving correct logic.

Method

FixAudit employs a four-stage RL training: SFT for execution reasoning, then iterative Fixer (repair with failing tests) and Auditor (generate code-aware bug-revealing tests) cycles, refined by DAPO.

In practice

Implement a dedicated test generator that reads candidate code.
Design rewards to prevent regressions during code repair.
Use SFT to build execution reasoning before RL.

Topics

Competitive Programming
Code Generation
Large Language Models
Reinforcement Learning
Program Repair
Test Generation

Code references

volcengine/verl

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.