Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

A measurement study evaluated twenty-six semantic post-hoc operators for frozen small code models, typically under 1.5 billion parameters, run locally without fine-tuning. These operators, spanning selection, verification, repair, and generation conditioning, aimed to improve code accuracy by re-processing model samples. The study found that none of these semantic operators improved held-out accuracy over a Best-of-N (BoN) baseline at matched compute. This negative result is attributed to three mechanistic forces: a "coverage wall" (systematic hard-task failures), a "capability scissors" (negligible discriminable error among plausible candidates), and a "near-empty consensus trap" (rare co-occurrence of hidden-wrong majority with a correct alternative). However, two non-semantic operators did show gains: an expression-layer recovery (M1) improved DeepSeek-Coder-1.3B by +12 tasks on HumanEval+ (p=2.4e-4) and +33 on MBPP+ (p=1.2e-10) by fixing extraction issues, and an adaptive consensus early-stop (ACE) saved approximately 19% compute at a zero-harm operating point.

Key takeaway

For AI Engineers deploying or optimizing frozen small code models, focus on improving the model's "harness" rather than complex post-hoc semantic operators. Investigate expression-layer recovery (M1) to address mis-expressed but correct code, as this yielded significant accuracy gains. Additionally, consider adaptive consensus early-stopping (ACE) for modest, bounded compute savings. These practical fixes offer tangible improvements where semantic re-processing often fails due to inherent model limitations.

Key insights

Semantic post-hoc operators fail for frozen small code models due to systematic limitations, while harness fixes yield accuracy gains.

Principles

Method

M1: Robust multi-strategy extraction and signature alignment for mis-expressed code, applied when standard pipeline finds no visible-passer. ACE: Stop sampling when a commit threshold on agreeing visible-passers is met.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.