Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models
Summary
A measurement study evaluated twenty-six semantic post-hoc operators for frozen small code models, typically under 1.5 billion parameters, run locally without fine-tuning. These operators, spanning selection, verification, repair, and generation conditioning, aimed to improve code accuracy by re-processing model samples. The study found that none of these semantic operators improved held-out accuracy over a Best-of-N (BoN) baseline at matched compute. This negative result is attributed to three mechanistic forces: a "coverage wall" (systematic hard-task failures), a "capability scissors" (negligible discriminable error among plausible candidates), and a "near-empty consensus trap" (rare co-occurrence of hidden-wrong majority with a correct alternative). However, two non-semantic operators did show gains: an expression-layer recovery (M1) improved DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4) and +33 on MBPP+ (p=1.2e-10) by fixing extraction issues, and an adaptive consensus early-stop (ACE) saved approximately 19% compute at a zero-harm operating point.
Key takeaway
For AI Engineers deploying or optimizing frozen small code models, focus on improving the model's "harness" rather than complex post-hoc semantic operators. Investigate expression-layer recovery (M1) to address mis-expressed but correct code, as this yielded significant accuracy gains. Additionally, consider adaptive consensus early-stopping (ACE) for modest, bounded compute savings. These practical fixes offer tangible improvements where semantic re-processing often fails due to inherent model limitations.
Key insights
Semantic post-hoc operators fail for frozen small code models due to systematic limitations, while harness fixes yield accuracy gains.
Principles
- Weak model failures are systematic, not stochastic (coverage wall).
- Competent generators leave little discriminable error among plausible candidates (capability scissors).
- Leakage-free selectors rarely find a hidden-wrong majority with a correct alternative (consensus trap).
Method
M1: Robust multi-strategy extraction and signature alignment for mis-expressed code, applied when standard pipeline finds no visible-passer. ACE: Stop sampling when a commit threshold on agreeing visible-passers is met.
In practice
- Prioritize fixing code extraction and serving pipelines.
- Measure pass@k coverage before implementing selectors.
- Use adaptive early-stopping for compute savings with bounds.
Topics
- Frozen Small Code Models
- Post-Hoc Operators
- Code Generation Evaluation
- Expression-Layer Recovery
- Adaptive Compute Allocation
- Popperian Falsification
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.