Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Expert, extended

Summary

A study investigated self-repair feedback mechanisms in small frozen code models, specifically 0.5B–1.5B qwen2.5-coder and deepseek-coder models, on 290 "dead" HumanEval+ and MBPP+ units. Evaluating five regeneration arms with a matched output-generation budget, the research found that blind resampling outperformed bare failing code retry by +18 net unlocks (discordant 25/7, Holm-adjusted p=0.0021). The code-plus-facts packet also achieved +18 net recovery over bare code (discordant 21/3, Holm-adjusted p=0.00042), with a +15 advantage over a shape-matched placebo (Holm-adjusted p=0.0041), indicating the value of executed fact content. However, code-plus-facts and blind resampling tied at 26 unlocks each at the pooled level, with a symmetric 20/20 discordant split, suggesting feedback's primary role was to mitigate the deficit caused by bare code rather than to surpass fresh sampling. The study produced 7,000 fresh generations in the main run and 1,400 in a follow-up, emphasizing a rigorous, falsification-centered measurement methodology.

Key takeaway

For machine learning engineers designing self-repair mechanisms for small frozen code models, you should prioritize incorporating execution-grounded counterexamples over simply re-exposing the model to its bare failing code. Your feedback designs must be rigorously benchmarked against blind resampling under an equal output-generation budget, as observed gains often only recover deficits rather than surpassing fresh sampling. Avoid assuming bare failing code is a reliable feedback input.

Key insights

The value of self-repair feedback for small frozen code models lies in external, execution-grounded criticism, not mere re-exposure to failing code.

Principles

Method

The study used a five-arm, placebo-controlled decomposition comparing bare code, blind resampling, executed facts, code-plus-facts, and shape-matched placebo on dead units with matched output-generation budgets and fresh-generation confirmation.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.