Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

2026-06-05 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Expert, extended

Summary

A pre-registered, two-tier ablation study investigated a "Popperian falsificationist" prompt skill designed to improve LLM code generation. The study, using Claude Sonnet 4.6 (N=163) and Qwen2.5-Coder-0.5B (N=164) on HumanEval+ with an execution oracle, found that the skill's specific Popperian procedural content offered no separable correctness benefit beyond a labels-only structural scaffold. On the frontier model, all conditions (vanilla, full skill, labels-only, placebo) performed near the 95.1% benchmark ceiling, showing no significant difference. For the small model, structured prompts lifted best-of-eight correctness by 20–22 points, but the full Popperian skill matched the labels-only scaffold. Furthermore, a 0.5B self-judge applying the Popperian rubric did not outperform random selection, exhibiting patterns consistent with position bias. The findings attribute measured gains to prompt structure, not the Popperian content, and highlight LLM-as-a-judge unreliability.

Key takeaway

For Machine Learning Engineers evaluating prompt skills for code generation, you should prioritize rigorous, oracle-based evaluation over LLM-as-a-judge metrics. This study demonstrates that complex "Popperian" procedural content in prompts offers no separable benefit beyond basic structural scaffolding. Focus your prompt engineering efforts on clear structural elements, and always use execution-based correctness checks and length-matched placebos to avoid misattributing gains to specific vocabulary or biased LLM evaluations.

Key insights

Prompt scaffold structure, not specific Popperian content, drives code generation correctness, while LLM judges exhibit significant bias.

Principles

Prompt structure, not wording, primarily improves LLM code generation.
LLM-as-a-judge is unreliable due to positional, self-preference, and stylistic biases.
Low-capability models provide sensitive testbeds for detecting genuine prompt effects.

Method

A disambiguation protocol combines a labels-only scaffold, length-matched placebo, execution oracle, and vocabulary-halo sentinel within a pre-registered, two-tier ablation design to isolate prompt skill effects.

In practice

Prioritize execution oracles over LLM-as-a-judge for evaluating code correctness.
Implement length-matched placebos as default controls in prompt engineering studies.
Screen candidate prompt components on small, low-baseline models first.

Topics

Large Language Models
Code Generation
Prompt Engineering
LLM-as-a-Judge
Ablation Study
Reproducibility
HumanEval+

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.