Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

2026-06-01 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Chunk-Level Guided Generation offers a training-free alternative to PRM-guided search for improving mathematical reasoning in smaller language models. It utilizes an off-the-shelf large language model as a process scorer: a small model samples k fixed-length candidate chunks, which the larger model scores via likelihoods to steer generation and prevent error propagation. The framework includes Likelihood-Guided Selection (LGS) and Contrastive-Guided Selection (CGS), with CGS favoring chunks where the large model's preference diverges. This approach avoids systematic length bias by using fixed-length chunks. On GSM8K, MATH, Minerva Math, AMC23, and AIME24, CGS, exemplified by Qwen2.5-1.5B guided by Qwen2.5-32B, outperforms majority voting by up to 28 percentage points. It matches or exceeds Qwen2.5-Math-PRM-72B guided search under similar guidance budgets, without requiring reward-model training. Qwen2.5-7B guided by Qwen2.5-72B achieved 81.8% on MATH and 63.6% on Minerva Math at k=16, also yielding shorter reasoning traces.

Key takeaway

For Machine Learning Engineers developing mathematical reasoning systems, especially those aiming to improve small model performance without extensive reward model training, Chunk-Level Guided Generation provides a compelling solution. You should evaluate Contrastive-Guided Selection (CGS) with your existing off-the-shelf LLMs. This approach can significantly boost accuracy on benchmarks like GSM8K and MATH, achieving results comparable to or better than PRM-guided search, while also producing shorter reasoning traces and eliminating the need for costly step-level label collection.

Key insights

Off-the-shelf LLMs can guide smaller models in mathematical reasoning by scoring fixed-length chunks, avoiding training and length bias.

Principles

Fixed-length chunks mitigate LLM length bias.
Divergent model preferences can improve selection.
Stronger LLMs can guide weaker ones without training.

Method

A small model samples k fixed-length chunks. A larger, off-the-shelf LLM scores these chunks using length-normalized log-probabilities (LGS) or by subtracting the small model's log-probability (CGS) to select the best continuation.

In practice

Use CGS for improved mathematical reasoning.
Apply Qwen2.5-7B with Qwen2.5-72B for MATH.
Consider fixed-length chunking for LLM scoring.

Topics

Large Language Models
Mathematical Reasoning
Guided Generation
Training-Free Methods
Contrastive-Guided Selection
Qwen2.5
Llama 3.1

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.