ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?

2026-06-30 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

ZeroCoder is a novel, label-free co-evolutionary framework designed to enhance code and test generation by large language models without ground-truth supervision. It addresses the bottleneck of costly human-curated unit tests in Reinforcement Learning with Verifiable Rewards (RLVR) by jointly training a Coder and a Tester. The framework uses execution feedback from self-generated code-test interactions, forming a passing matrix to identify consensus solutions and tests for reward derivation. ZeroCoder incorporates rank-based pre-filtering to remove low-information problems and a curriculum-based tester objective that balances validity and mutation-driven discriminativeness. Additionally, it introduces DyB4, a Bayesian selector that dynamically recalibrates its priors using as few as 10 labeled instances to counter "selector drift." On Qwen2.5-Coder-7B-Instruct, ZeroCoder improves code generation by up to 14.5% in a label-free setting and 21.6% with DyB4, with test generation improving by 24.3%, nearing oracle-supervised performance.

Key takeaway

For machine learning engineers aiming to improve LLM code generation in label-scarce environments, you should consider adopting co-evolutionary frameworks like ZeroCoder. This approach, which jointly trains code and test generators using self-generated execution feedback, significantly boosts performance without extensive ground-truth supervision. Implementing dynamic selector calibration, such as DyB4 with even 10 labeled instances, can further enhance robustness and achieve results competitive with oracle-supervised training.

Key insights

Co-evolving code and test generation with self-generated execution feedback significantly improves LLM performance without ground-truth supervision.

Principles

Jointly training code and test generators enables mutual improvement through interaction.
Filtering training data by passing matrix rank enhances reward informativeness.
Dynamically recalibrating selectors prevents performance degradation from "selector drift."

Method

ZeroCoder samples solutions and tests, executes them to form a passing matrix, applies a selector to identify consensus subsets, and derives role-specific rewards, incorporating rank-based pre-filtering and a curriculum for tester training.

In practice

Filter training problems using passing matrix rank to ensure diverse interactions.
Implement mutation-based rewards to foster discriminative test generation.
Utilize a minimal labeled set (e.g., 10 instances) for dynamic selector recalibration.

Topics

ZeroCoder
Code Generation
Test Generation
LLM Reinforcement Learning
Label-Free Training
Dynamic Bayesian Selector

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.