Peer-Predictive Self-Training for Language Model Reasoning

2026-01-28 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

Peer-Predictive Self-Training (PST) is a novel, label-free fine-tuning framework enabling language models to improve collaboratively without external supervision. Multiple models generate responses sequentially, with the final aggregated answer serving as a more reliable internal training signal. PST quantifies the informativeness of each intermediate response about the aggregate using pointwise mutual information (PMI), scaling self-training updates accordingly: responses aligned with the aggregate receive smaller updates, while misaligned ones receive larger updates. Evaluated on mathematical reasoning benchmarks like SimulEq, Math500, and MultiArith, PST improved exact-match accuracy by 2.2–4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B. It also reduced the average generator–verifier gap (GV-Gap) by 26–40%, demonstrating effective self-supervised training through cross-model interactions.

Key takeaway

For research scientists developing self-improving language models, PST offers a robust, unsupervised fine-tuning approach. By leveraging cross-model aggregation and PMI-weighted updates, you can enhance reasoning capabilities and reduce generator–verifier gaps without relying on expensive labeled data or explicit reward models. Consider integrating PST into your training pipelines, especially for tasks where ground truth is scarce, to achieve consistent performance gains across diverse model architectures.

Key insights

Aggregating peer predictions and weighting self-training by mutual information enables unsupervised language model improvement.

Principles

Reliability emerges from aggregation ("wisdom of crowds").
Verification-generation asymmetry makes reasoning amenable to self-training.
Cross-model interactions provide stronger supervision than single-model self-training.

Method

PST involves sequential generation by multiple models, using the final aggregated response as a reference. Pointwise mutual information (PMI) between intermediate and final responses scales cross-entropy loss updates.

In practice

Apply PST to fine-tune LLMs on reasoning tasks without labeled data.
Utilize heterogeneous model populations for enhanced verification signals.
Implement PMI-based scaling to modulate self-training update strength.

Topics

Peer-Predictive Self-Training
Language Model Reasoning
Self-Supervised Learning
Pointwise Mutual Information
Mathematical Reasoning Benchmarks

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.