Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier
Summary
A new semi-supervised framework scales Large Language Model (LLM) reasoning from minimal supervision by transforming reasoning verification into a data creation mechanism. This approach trains a lightweight reasoning-correctness classifier using only a few labeled samples to determine the validity of intermediate reasoning traces generated by an LLM. An entropy-based confidence threshold then filters out unreliable samples, with the remaining high-confidence reasoning traces used to fine-tune the model. Experiments on Verifiable Math Problems (Orca-Math subset) and Question Answering on Image Scene Graphs (GQA) with Visual Programming demonstrate that this method achieves accuracy comparable to systems utilizing 10-15x more labeled data. Ablation analyses confirm the critical roles of both the classifier and entropy filtering in enabling scalable, noise-resistant pseudo-labeling. This framework offers a practical path for constructing large-scale reasoning resources, moving towards autonomous reasoning systems with minimal human input.
Key takeaway
For Machine Learning Engineers developing LLMs with scarce labeled data, consider implementing a semi-supervised framework that leverages lightweight reasoning verifiers. This approach allows you to achieve high reasoning accuracy comparable to 10-15x more data by turning verification into a data creation mechanism. You can significantly reduce expensive answer-level supervision, accelerating the development of robust reasoning capabilities and paving the way for more autonomous systems.
Key insights
A semi-supervised framework scales LLM reasoning by using a lightweight verifier to generate high-confidence pseudo-labels from minimal human input.
Principles
- Reasoning verification can create data.
- Entropy filtering enhances pseudo-labeling.
- Lightweight verifiers reduce supervision cost.
Method
Train a lightweight classifier on few labels to verify LLM reasoning traces. Filter high-confidence traces using an entropy threshold. Fine-tune the LLM with these pseudo-labels.
In practice
- Construct large-scale reasoning resources.
- Develop autonomous reasoning systems.
- Reduce data labeling costs for LLM training.
Topics
- Large Language Models
- Semi-Supervised Learning
- Reasoning Verification
- Pseudo-Labeling
- Data Efficiency
- Orca-Math
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.