Teacher-Free Self-Training Amplifies but Does Not Compound: A Pass@$K$ Crossover on a Free-Verifier Domain
Summary
A study investigated whether language models training on their own verified outputs acquire new capabilities or merely refine existing ones. Researchers used a teacher-free "constellation" comprising a generator, a learned critic, and a free exact verifier on a FlashFill-style "trapdoor" DSL, where problem-solution pairs are cheap to synthesize and free to check. This setup ran on a single 4-bit Qwen3-4B model on a 24 GB GPU. Findings indicate that critic-guided selection outperformed verifier-filtered best-of-$k$ by +9.1 percentage points across all 6 seeds, with gains concentrated on tasks where candidates disagreed. While per-round STaR self-training raised the performance ceiling, it did not accelerate learning, with gains decelerating across $K=4$ independent training trajectories. A measured Pass@$K$ crossover settles the diagnosis: the trained model wins at the operating budget (Pass@8) but the base overtakes it at a large budget (Pass@64) on every trajectory, suggesting self-training amplifies existing capability by concentrating probability mass rather than compounding it by expanding reach. ($K=4$ is indicative, not yet a robust across-trajectory CI.)
Key takeaway
For AI Scientists designing self-training pipelines, understand that current methods primarily amplify existing model capabilities rather than creating new ones. You should prioritize critic-guided selection over simple best-of-$k$ filtering, as it yields significant performance gains, specifically +9.1 pp. When evaluating, use a Pass@K crossover analysis to distinguish true capability expansion from mere probability mass concentration, especially if your base model eventually outperforms the self-trained version at higher budgets.
Key insights
Self-training amplifies existing language model capabilities by concentrating probability mass, not by expanding reach.
Principles
- Critic-guided selection outperforms best-of-$k$.
- Self-training raises ceilings but doesn't accelerate.
- Pass@$K$ crossover reveals amplification.
Method
A "constellation" of a generator, learned critic, and free exact verifier trains a language model on a FlashFill-style DSL, using 4-bit Qwen3-4B on a 24 GB GPU.
In practice
- Use critic-guided selection for self-training.
- Evaluate self-training with Pass@$K$ crossover.
- Consider 4-bit Qwen3-4B on 24 GB GPUs.
Topics
- Teacher-Free Self-Training
- Language Model Training
- Critic-Guided Selection
- Pass@K Evaluation
- Qwen3-4B
- Capability Amplification
Best for: Research Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.