When AI Reviews Its Own Code: Recursive Self-Training Collapse in Code LLMs

2026-05-07 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Recursive self-training can degrade neural generative models when AI-generated data is reused without fresh human input or external quality control. This study investigates this risk in code LLMs, where AI-generated code can enter repositories and become future training data. The research compares three recursive fine-tuning regimes: no review, Human-gate review (using model-independent filters like compilation and static checks), and AI-self-gate review (using the LLM's own signals like perplexity). Experiments across SantaCoder (1.1B parameters), StarCoder2, Qwen2.5-Coder, and Code Llama on benchmarks like HumanEval and MBPP show that "no review" collapses fastest. "Human-gate" filters slow collapse but do not stop it, while "AI-self-gate" filters initially appear strong but lose their effectiveness, leading to a "rubber-stamp regime" where acceptance scores rise as benchmark correctness falls. The findings suggest stable recursive code LLM training requires exogenous verification.

Key takeaway

For MLOps Engineers deploying code LLMs in production or designing self-improving systems, you must integrate robust, model-independent verification. Relying on AI self-review or unverified code for recursive training will inevitably degrade model performance, leading to a "rubber-stamp" acceptance of poor quality. Prioritize external quality gates like compilation and static analysis to maintain code correctness and prevent catastrophic model collapse over time.

Key insights

AI self-review in code LLMs leads to recursive self-training collapse, requiring exogenous verification for stability.

Principles

Self-training without external quality degrades models.
Exogenous gates slow, but may not halt, collapse.
Endogenous AI self-gates can become self-confirming.

Method

The study compares three recursive fine-tuning regimes: ungated, Human-gate (compilation, static checks), and AI-self-gate (perplexity, binary self-scoring) across multiple code LLMs and benchmarks.

In practice

Implement compilation checks for code LLM outputs.
Avoid using the same LLM for code generation and review.
Prioritize external, model-independent quality signals.

Topics

Code LLMs
Recursive Self-Training
Model Collapse
AI Code Review
Exogenous Verification
HumanEval Benchmark

Code references

Hik289/code-retraining

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.