Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models
Summary
A measurement study investigated the effectiveness of 26 semantic post-hoc falsification operators for frozen small code models (<=1.5B parameters), which are used locally without fine-tuning and often produce incorrect programs. These operators, designed to select, verify, or repair model samples without retraining, were evaluated against Best-of-N (BoN) using a deterministic execution oracle. The study found that none of the semantic operators improved held-out accuracy over BoN. This failure was attributed to a "coverage wall" of systematic hard-task failures, a "capability scissors" where competent generators leave few discriminable errors, and a "near-empty consensus trap." However, two non-semantic operators showed promise: an expression-layer recovery (M1) increased DeepSeek-Coder-1.3B's performance by +12 tasks on HumanEval+ (p=2.4e-4), and an adaptive consensus early-stop (ACE) achieved ~19% compute savings with zero harm. These findings replicated across HumanEval+ and MBPP+ benchmarks.
Key takeaway
For Machine Learning Engineers deploying frozen small code models, if you are struggling with plausible-but-wrong outputs, avoid complex semantic post-hoc filtering. Instead, prioritize fixing your test harness and measuring coverage before implementing post-hoc reasoning. You should also investigate robust extraction techniques, like the M1 expression-layer recovery, which significantly improved DeepSeek-Coder-1.3B's performance. Additionally, consider adaptive consensus early-stop for ~19% compute savings without accuracy loss.
Key insights
Semantic post-hoc operators fail to improve small code model accuracy; focus on extraction and harness issues instead.
Principles
- Post-hoc semantic filtering often fails.
- Coverage issues limit model improvement.
- Robust extraction can recover correct outputs.
Method
The study evaluated 26 semantic post-hoc operators against Best-of-N using a deterministic execution oracle and a leakage-free, matched-compute protocol on code generation benchmarks.
In practice
- Measure test harness coverage first.
- Investigate robust extraction methods.
- Consider adaptive consensus early-stop for compute savings.
Topics
- Small Code Models
- Post-Hoc Operators
- Code Generation
- HumanEval+
- DeepSeek-Coder-1.3B
- Expression-Layer Recovery
Best for: AI Engineer, AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.