Entropy Centroids as Intrinsic Rewards for Test-Time Scaling

2026-03-30 · Source: cs.LG updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Researchers from HKUST, ZJU, and Huawei propose "Entropy Centroids" as an intrinsic reward signal for test-time scaling of large language models (LLMs), a technique used in models like Grok Heavy and Gemini Deep Think to select the best response from multiple samples. Existing methods often rely on external reward models or noisy token-level signals. The new approach identifies "High Entropy Phases" (HEPs) as variable-length segments of clustered high-entropy tokens, which represent model uncertainty. The Entropy Centroid, analogous to the center of mass in physics, is then calculated as the weighted average position of these HEPs along the generation trajectory. A lower centroid indicates early exploration followed by confident generation, correlating with higher response quality. Experiments across models from 14B to 480B parameters on mathematics, code generation, logical reasoning, and agentic tasks show that the Lowest Centroid method consistently outperforms baselines, achieving an average 5.3% improvement over Pass@1.

Key takeaway

For AI Engineers optimizing LLM inference, adopting the Lowest Centroid method can significantly improve response quality in test-time scaling. By selecting trajectories where model uncertainty concentrates earlier, you can achieve more reliable gains across diverse tasks like code generation and logical reasoning, without the overhead or noise of external reward models. Consider implementing this intrinsic reward to enhance the performance of LLMs ranging from 14B to 480B parameters.

Key insights

Temporal patterns of LLM uncertainty, captured by Entropy Centroids, predict response quality without external rewards.

Principles

High-entropy tokens cluster into stable High Entropy Phases (HEPs).
Early uncertainty (low centroid) correlates with higher response quality.
Percentile-based thresholds for HEP detection adapt across models and datasets.

Method

The method defines High Entropy Phases (HEPs) as variable-length segments of clustered high-entropy tokens. It then computes an "Entropy Centroid" for each generated response, representing the weighted average position of these HEPs along the trajectory.

In practice

Apply Lowest Centroid for best-of-N selection in LLM inference.
Use intrinsic signals to reduce reliance on external reward models.
Integrate Entropy Centroid as an auxiliary reward in LLM training.

Topics

Entropy Centroid
High Entropy Phase
Test-Time Scaling
LLM Response Selection
Intrinsic Rewards

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.