Peer-Predictive Self-Training for Language Model Reasoning
Summary
Peer-Predictive Self-Training (PST) is a novel, label-free fine-tuning framework enabling language models to improve collaboratively without external supervision. Multiple models generate responses sequentially, with the final aggregated answer serving as a more reliable internal training signal. PST quantifies the informativeness of each intermediate response about the aggregate using pointwise mutual information (PMI), scaling self-training updates accordingly: responses aligned with the aggregate receive smaller updates, while misaligned ones receive larger updates. Evaluated on mathematical reasoning benchmarks like SimulEq, Math500, and MultiArith, PST improved exact-match accuracy by 2.2–4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B. It also reduced the average generator–verifier gap (GV-Gap) by 26–40%, demonstrating effective self-supervised training through cross-model interactions.
Key takeaway
For research scientists developing self-improving language models, PST offers a robust, unsupervised fine-tuning approach. By leveraging cross-model aggregation and PMI-weighted updates, you can enhance reasoning capabilities and reduce generator–verifier gaps without relying on expensive labeled data or explicit reward models. Consider integrating PST into your training pipelines, especially for tasks where ground truth is scarce, to achieve consistent performance gains across diverse model architectures.
Key insights
Aggregating peer predictions and weighting self-training by mutual information enables unsupervised language model improvement.
Principles
- Reliability emerges from aggregation ("wisdom of crowds").
- Verification-generation asymmetry makes reasoning amenable to self-training.
- Cross-model interactions provide stronger supervision than single-model self-training.
Method
PST involves sequential generation by multiple models, using the final aggregated response as a reference. Pointwise mutual information (PMI) between intermediate and final responses scales cross-entropy loss updates.
In practice
- Apply PST to fine-tune LLMs on reasoning tasks without labeled data.
- Utilize heterogeneous model populations for enhanced verification signals.
- Implement PMI-based scaling to modulate self-training update strength.
Topics
- Peer-Predictive Self-Training
- Language Model Reasoning
- Self-Supervised Learning
- Pointwise Mutual Information
- Mathematical Reasoning Benchmarks
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.