Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning
Summary
Chunk-Level Guided Generation offers a training-free alternative to PRM-guided search for improving mathematical reasoning in smaller language models. It utilizes an off-the-shelf large language model as a process scorer: a small model samples k fixed-length candidate chunks, which the larger model scores via likelihoods to steer generation and prevent error propagation. The framework includes Likelihood-Guided Selection (LGS) and Contrastive-Guided Selection (CGS), with CGS favoring chunks where the large model's preference diverges. This approach avoids systematic length bias by using fixed-length chunks. On GSM8K, MATH, Minerva Math, AMC23, and AIME24, CGS, exemplified by Qwen2.5-1.5B guided by Qwen2.5-32B, outperforms majority voting by up to 28 percentage points. It matches or exceeds Qwen2.5-Math-PRM-72B guided search under similar guidance budgets, without requiring reward-model training. Qwen2.5-7B guided by Qwen2.5-72B achieved 81.8% on MATH and 63.6% on Minerva Math at k=16, also yielding shorter reasoning traces.
Key takeaway
For Machine Learning Engineers developing mathematical reasoning systems, especially those aiming to improve small model performance without extensive reward model training, Chunk-Level Guided Generation provides a compelling solution. You should evaluate Contrastive-Guided Selection (CGS) with your existing off-the-shelf LLMs. This approach can significantly boost accuracy on benchmarks like GSM8K and MATH, achieving results comparable to or better than PRM-guided search, while also producing shorter reasoning traces and eliminating the need for costly step-level label collection.
Key insights
Off-the-shelf LLMs can guide smaller models in mathematical reasoning by scoring fixed-length chunks, avoiding training and length bias.
Principles
- Fixed-length chunks mitigate LLM length bias.
- Divergent model preferences can improve selection.
- Stronger LLMs can guide weaker ones without training.
Method
A small model samples k fixed-length chunks. A larger, off-the-shelf LLM scores these chunks using length-normalized log-probabilities (LGS) or by subtracting the small model's log-probability (CGS) to select the best continuation.
In practice
- Use CGS for improved mathematical reasoning.
- Apply Qwen2.5-7B with Qwen2.5-72B for MATH.
- Consider fixed-length chunking for LLM scoring.
Topics
- Large Language Models
- Mathematical Reasoning
- Guided Generation
- Training-Free Methods
- Contrastive-Guided Selection
- Qwen2.5
- Llama 3.1
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.