ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?
Summary
ZeroCoder is a novel, label-free co-evolutionary framework designed to enhance code and test generation by large language models without ground-truth supervision. It addresses the bottleneck of costly human-curated unit tests in Reinforcement Learning with Verifiable Rewards (RLVR) by jointly training a Coder and a Tester. The framework uses execution feedback from self-generated code-test interactions, forming a passing matrix to identify consensus solutions and tests for reward derivation. ZeroCoder incorporates rank-based pre-filtering to remove low-information problems and a curriculum-based tester objective that balances validity and mutation-driven discriminativeness. Additionally, it introduces DyB4, a Bayesian selector that dynamically recalibrates its priors using as few as 10 labeled instances to counter "selector drift." On Qwen2.5-Coder-7B-Instruct, ZeroCoder improves code generation by up to 14.5% in a label-free setting and 21.6% with DyB4, with test generation improving by 24.3%, nearing oracle-supervised performance.
Key takeaway
For machine learning engineers aiming to improve LLM code generation in label-scarce environments, you should consider adopting co-evolutionary frameworks like ZeroCoder. This approach, which jointly trains code and test generators using self-generated execution feedback, significantly boosts performance without extensive ground-truth supervision. Implementing dynamic selector calibration, such as DyB4 with even 10 labeled instances, can further enhance robustness and achieve results competitive with oracle-supervised training.
Key insights
Co-evolving code and test generation with self-generated execution feedback significantly improves LLM performance without ground-truth supervision.
Principles
- Jointly training code and test generators enables mutual improvement through interaction.
- Filtering training data by passing matrix rank enhances reward informativeness.
- Dynamically recalibrating selectors prevents performance degradation from "selector drift."
Method
ZeroCoder samples solutions and tests, executes them to form a passing matrix, applies a selector to identify consensus subsets, and derives role-specific rewards, incorporating rank-based pre-filtering and a curriculum for tester training.
In practice
- Filter training problems using passing matrix rank to ensure diverse interactions.
- Implement mutation-based rewards to foster discriminative test generation.
- Utilize a minimal labeled set (e.g., 10 instances) for dynamic selector recalibration.
Topics
- ZeroCoder
- Code Generation
- Test Generation
- LLM Reinforcement Learning
- Label-Free Training
- Dynamic Bayesian Selector
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.