When AI Reviews Its Own Code: Recursive Self-Training Collapse in Code LLMs

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

Recursive self-training can degrade neural generative models when AI-generated data is reused without fresh human input or external quality control. This study investigates this risk in code LLMs, where AI-generated code can enter repositories and become future training data. The research compares three recursive fine-tuning regimes: no review, Human-gate review (using model-independent filters like compilation and static checks), and AI-self-gate review (using the LLM's own signals like perplexity). Experiments across SantaCoder (1.1B parameters), StarCoder2, Qwen2.5-Coder, and Code Llama on benchmarks like HumanEval and MBPP show that "no review" collapses fastest. "Human-gate" filters slow collapse but do not stop it, while "AI-self-gate" filters initially appear strong but lose their effectiveness, leading to a "rubber-stamp regime" where acceptance scores rise as benchmark correctness falls. The findings suggest stable recursive code LLM training requires exogenous verification.

Key takeaway

For MLOps Engineers deploying code LLMs in production or designing self-improving systems, you must integrate robust, model-independent verification. Relying on AI self-review or unverified code for recursive training will inevitably degrade model performance, leading to a "rubber-stamp" acceptance of poor quality. Prioritize external quality gates like compilation and static analysis to maintain code correctness and prevent catastrophic model collapse over time.

Key insights

AI self-review in code LLMs leads to recursive self-training collapse, requiring exogenous verification for stability.

Principles

Method

The study compares three recursive fine-tuning regimes: ungated, Human-gate (compilation, static checks), and AI-self-gate (perplexity, binary self-scoring) across multiple code LLMs and benchmarks.

In practice

Topics

Code references

Best for: Research Scientist, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.