Teaching Reasoning Models When to Stop: Data, Rewards, and Self-Aware Decoding

· Source: The Salt - Curated AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, short

Summary

Two recent works explore methods for improving the efficiency of large language models (LLMs) in chain-of-thought (CoT) reasoning. "The Art of Efficient Reasoning" investigates using Reinforcement Learning (RL) to shorten CoT reasoning without sacrificing accuracy. It identifies two training phases: "length adaptation" and "reasoning refinement," and finds that training on "easy" math problems with a length-aware reward is more effective than using difficult ones, preventing policy collapse. This approach, applied to Qwen3 models from 0.6B to 30B, roughly halves response length on AIME25 while preserving or improving performance. The second paper, "Does Your Reasoning Model Implicitly Know When to Stop Thinking?", introduces TSearch and SAGE, techniques that leverage a model's internal confidence (prefix log-likelihood) to identify optimal stopping points in reasoning traces. SAGE, a step-wise beam search, improves pass@1 accuracy and substantially reduces token usage across various math benchmarks and LRMs, even when integrated into RL training as SAGE-RL.

Key takeaway

For AI Engineers optimizing LLM inference costs and latency, these findings suggest a dual approach: refine RL training with easier datasets and ample rollouts to achieve length compression, and deploy decoding strategies like SAGE that exploit the model's internal confidence to terminate reasoning early. Your focus should be on stable length adaptation during training and leveraging prefix log-likelihood during inference to prevent unnecessary token generation, thereby improving both efficiency and accuracy on reasoning tasks.

Key insights

Efficient CoT reasoning requires balancing length reduction with accuracy, often by leveraging model confidence.

Principles

Method

SAGE uses step-wise beam search guided by average prefix log-likelihood to find shorter, higher-accuracy reasoning paths, and can be integrated into RL training.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, NLP Engineer, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Salt - Curated AI.