Entropy Centroids as Intrinsic Rewards for Test-Time Scaling
Summary
Researchers from HKUST, ZJU, and Huawei propose "Entropy Centroids" as an intrinsic reward signal for test-time scaling of large language models (LLMs), a technique used in models like Grok Heavy and Gemini Deep Think to select the best response from multiple samples. Existing methods often rely on external reward models or noisy token-level signals. The new approach identifies "High Entropy Phases" (HEPs) as variable-length segments of clustered high-entropy tokens, which represent model uncertainty. The Entropy Centroid, analogous to the center of mass in physics, is then calculated as the weighted average position of these HEPs along the generation trajectory. A lower centroid indicates early exploration followed by confident generation, correlating with higher response quality. Experiments across models from 14B to 480B parameters on mathematics, code generation, logical reasoning, and agentic tasks show that the Lowest Centroid method consistently outperforms baselines, achieving an average 5.3% improvement over Pass@1.
Key takeaway
For AI Engineers optimizing LLM inference, adopting the Lowest Centroid method can significantly improve response quality in test-time scaling. By selecting trajectories where model uncertainty concentrates earlier, you can achieve more reliable gains across diverse tasks like code generation and logical reasoning, without the overhead or noise of external reward models. Consider implementing this intrinsic reward to enhance the performance of LLMs ranging from 14B to 480B parameters.
Key insights
Temporal patterns of LLM uncertainty, captured by Entropy Centroids, predict response quality without external rewards.
Principles
- High-entropy tokens cluster into stable High Entropy Phases (HEPs).
- Early uncertainty (low centroid) correlates with higher response quality.
- Percentile-based thresholds for HEP detection adapt across models and datasets.
Method
The method defines High Entropy Phases (HEPs) as variable-length segments of clustered high-entropy tokens. It then computes an "Entropy Centroid" for each generated response, representing the weighted average position of these HEPs along the trajectory.
In practice
- Apply Lowest Centroid for best-of-N selection in LLM inference.
- Use intrinsic signals to reduce reliance on external reward models.
- Integrate Entropy Centroid as an auxiliary reward in LLM training.
Topics
- Entropy Centroid
- High Entropy Phase
- Test-Time Scaling
- LLM Response Selection
- Intrinsic Rewards
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.