Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill
Summary
A pre-registered, two-tier ablation study investigated a "Popperian falsificationist" prompt skill designed to improve LLM code generation. The study, using Claude Sonnet 4.6 (N=163) and Qwen2.5-Coder-0.5B (N=164) on HumanEval+ with an execution oracle, found that the skill's specific Popperian procedural content offered no separable correctness benefit beyond a labels-only structural scaffold. On the frontier model, all conditions (vanilla, full skill, labels-only, placebo) performed near the 95.1% benchmark ceiling, showing no significant difference. For the small model, structured prompts lifted best-of-eight correctness by 20–22 points, but the full Popperian skill matched the labels-only scaffold. Furthermore, a 0.5B self-judge applying the Popperian rubric did not outperform random selection, exhibiting patterns consistent with position bias. The findings attribute measured gains to prompt structure, not the Popperian content, and highlight LLM-as-a-judge unreliability.
Key takeaway
For Machine Learning Engineers evaluating prompt skills for code generation, you should prioritize rigorous, oracle-based evaluation over LLM-as-a-judge metrics. This study demonstrates that complex "Popperian" procedural content in prompts offers no separable benefit beyond basic structural scaffolding. Focus your prompt engineering efforts on clear structural elements, and always use execution-based correctness checks and length-matched placebos to avoid misattributing gains to specific vocabulary or biased LLM evaluations.
Key insights
Prompt scaffold structure, not specific Popperian content, drives code generation correctness, while LLM judges exhibit significant bias.
Principles
- Prompt structure, not wording, primarily improves LLM code generation.
- LLM-as-a-judge is unreliable due to positional, self-preference, and stylistic biases.
- Low-capability models provide sensitive testbeds for detecting genuine prompt effects.
Method
A disambiguation protocol combines a labels-only scaffold, length-matched placebo, execution oracle, and vocabulary-halo sentinel within a pre-registered, two-tier ablation design to isolate prompt skill effects.
In practice
- Prioritize execution oracles over LLM-as-a-judge for evaluating code correctness.
- Implement length-matched placebos as default controls in prompt engineering studies.
- Screen candidate prompt components on small, low-baseline models first.
Topics
- Large Language Models
- Code Generation
- Prompt Engineering
- LLM-as-a-Judge
- Ablation Study
- Reproducibility
- HumanEval+
Code references
- openai/human-eval
- evalplus/evalplus
- PhiniteLab/popperian-coding-skill
- annaneuUDE/PositionIsPower
- msclar/formatspread
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.