Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

2026-06-01 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Chunk-Level Guided Generation is proposed as a training-free alternative to Reward Model (PRM) guided search for mathematical reasoning tasks. This method uses an off-the-shelf large language model as a process scorer. A smaller model samples k fixed-length candidate chunks at each step, which the larger model then scores using likelihoods without generating text. The selected chunk is committed, steering generation before errors propagate. The framework includes Likelihood-Guided Selection (LGS) and Contrastive-Guided Selection (CGS), with CGS subtracting the small model's log-probability to favor divergent preferences. The research highlights that fixed-length chunks are crucial to avoid systematic length bias in likelihood scoring. On benchmarks like GSM8K and MATH, CGS with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B outperforms majority voting by up to 28 pp and matches or exceeds PRM guided search without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS achieves 81.8% on MATH and 63.6% on Minerva Math.

Key takeaway

For machine learning engineers developing robust mathematical reasoning systems, Chunk-Level Guided Generation offers a compelling training-free alternative to traditional PRM-guided search. You can significantly improve small model performance on benchmarks like GSM8K and MATH by using an off-the-shelf large language model to score fixed-length reasoning chunks. Consider implementing Contrastive-Guided Selection (CGS) to leverage divergent model preferences, potentially matching or exceeding PRM performance without the overhead of reward model training.

Key insights

Using off-the-shelf LLMs to score fixed-length chunks can guide smaller models without training, preventing error propagation.

Principles

Fixed-length chunks mitigate systematic length bias in likelihood scoring.
Contrastive scoring (CGS) leverages divergent model preferences effectively.
Early intervention via chunk selection prevents error propagation.

Method

A small model samples k fixed-length candidate chunks. A larger model scores these candidates using likelihoods. The highest-scoring chunk is committed before the next generation step.

In practice

Implement Chunk-Level Guided Generation for mathematical reasoning tasks.
Prioritize Contrastive-Guided Selection (CGS) over LGS for performance gains.
Utilize off-the-shelf large LLMs as training-free process scorers.

Topics

Large Language Models
Mathematical Reasoning
Process Scoring
Chunk-Level Guided Generation
Reward Models
Contrastive-Guided Selection
GSM8K

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.