Teacher-Free Self-Training Amplifies but Does Not Compound: A Pass@$K$ Crossover on a Free-Verifier Domain

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A study investigated whether language models training on their own verified outputs acquire new capabilities or merely refine existing ones. Researchers used a teacher-free "constellation" comprising a generator, a learned critic, and a free exact verifier on a FlashFill-style "trapdoor" DSL, where problem-solution pairs are cheap to synthesize and free to check. This setup ran on a single 4-bit Qwen3-4B model on a 24 GB GPU. Findings indicate that critic-guided selection outperformed verifier-filtered best-of-$k$ by +9.1 percentage points across all 6 seeds, with gains concentrated on tasks where candidates disagreed. While per-round STaR self-training raised the performance ceiling, it did not accelerate learning, with gains decelerating across $K=4$ independent training trajectories. A measured Pass@$K$ crossover settles the diagnosis: the trained model wins at the operating budget (Pass@8) but the base overtakes it at a large budget (Pass@64) on every trajectory, suggesting self-training amplifies existing capability by concentrating probability mass rather than compounding it by expanding reach. ($K=4$ is indicative, not yet a robust across-trajectory CI.)

Key takeaway

For AI Scientists designing self-training pipelines, understand that current methods primarily amplify existing model capabilities rather than creating new ones. You should prioritize critic-guided selection over simple best-of-$k$ filtering, as it yields significant performance gains, specifically +9.1 pp. When evaluating, use a Pass@K crossover analysis to distinguish true capability expansion from mere probability mass concentration, especially if your base model eventually outperforms the self-trained version at higher budgets.

Key insights

Self-training amplifies existing language model capabilities by concentrating probability mass, not by expanding reach.

Principles

Method

A "constellation" of a generator, learned critic, and free exact verifier trains a language model on a FlashFill-style DSL, using 4-bit Qwen3-4B on a 24 GB GPU.

In practice

Topics

Best for: Research Scientist, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.