Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Expert, extended

Summary

A pre-registered, two-tier ablation study investigated a "Popperian falsificationist" prompt skill designed to improve LLM code generation. The study, using Claude Sonnet 4.6 (N=163) and Qwen2.5-Coder-0.5B (N=164) on HumanEval+ with an execution oracle, found that the skill's specific Popperian procedural content offered no separable correctness benefit beyond a labels-only structural scaffold. On the frontier model, all conditions (vanilla, full skill, labels-only, placebo) performed near the 95.1% benchmark ceiling, showing no significant difference. For the small model, structured prompts lifted best-of-eight correctness by 20–22 points, but the full Popperian skill matched the labels-only scaffold. Furthermore, a 0.5B self-judge applying the Popperian rubric did not outperform random selection, exhibiting patterns consistent with position bias. The findings attribute measured gains to prompt structure, not the Popperian content, and highlight LLM-as-a-judge unreliability.

Key takeaway

For Machine Learning Engineers evaluating prompt skills for code generation, you should prioritize rigorous, oracle-based evaluation over LLM-as-a-judge metrics. This study demonstrates that complex "Popperian" procedural content in prompts offers no separable benefit beyond basic structural scaffolding. Focus your prompt engineering efforts on clear structural elements, and always use execution-based correctness checks and length-matched placebos to avoid misattributing gains to specific vocabulary or biased LLM evaluations.

Key insights

Prompt scaffold structure, not specific Popperian content, drives code generation correctness, while LLM judges exhibit significant bias.

Principles

Method

A disambiguation protocol combines a labels-only scaffold, length-matched placebo, execution oracle, and vocabulary-halo sentinel within a pre-registered, two-tier ablation design to isolate prompt skill effects.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.