When AI Reviews Its Own Code: Recursive Self-Training Collapse in Code LLMs
Summary
Recursive self-training can degrade neural generative models when AI-generated data is reused without fresh human input or external quality control. This study investigates this risk in code LLMs, where AI-generated code can enter repositories and become future training data. The research compares three recursive fine-tuning regimes: no review, Human-gate review (using model-independent filters like compilation and static checks), and AI-self-gate review (using the LLM's own signals like perplexity). Experiments across SantaCoder (1.1B parameters), StarCoder2, Qwen2.5-Coder, and Code Llama on benchmarks like HumanEval and MBPP show that "no review" collapses fastest. "Human-gate" filters slow collapse but do not stop it, while "AI-self-gate" filters initially appear strong but lose their effectiveness, leading to a "rubber-stamp regime" where acceptance scores rise as benchmark correctness falls. The findings suggest stable recursive code LLM training requires exogenous verification.
Key takeaway
For MLOps Engineers deploying code LLMs in production or designing self-improving systems, you must integrate robust, model-independent verification. Relying on AI self-review or unverified code for recursive training will inevitably degrade model performance, leading to a "rubber-stamp" acceptance of poor quality. Prioritize external quality gates like compilation and static analysis to maintain code correctness and prevent catastrophic model collapse over time.
Key insights
AI self-review in code LLMs leads to recursive self-training collapse, requiring exogenous verification for stability.
Principles
- Self-training without external quality degrades models.
- Exogenous gates slow, but may not halt, collapse.
- Endogenous AI self-gates can become self-confirming.
Method
The study compares three recursive fine-tuning regimes: ungated, Human-gate (compilation, static checks), and AI-self-gate (perplexity, binary self-scoring) across multiple code LLMs and benchmarks.
In practice
- Implement compilation checks for code LLM outputs.
- Avoid using the same LLM for code generation and review.
- Prioritize external, model-independent quality signals.
Topics
- Code LLMs
- Recursive Self-Training
- Model Collapse
- AI Code Review
- Exogenous Verification
- HumanEval Benchmark
Code references
Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.