Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
Summary
Adaptive Entropy Regularization (AER) is a novel framework designed to enhance the reasoning capabilities of Large Language Models (LLMs) by addressing policy entropy collapse during Reinforcement Learning with Verifiable Rewards (RLVR) training. Unlike traditional entropy regularization methods that rely on fixed coefficients, AER dynamically adjusts exploration intensity. The framework comprises three key components: difficulty-aware coefficient allocation, which assigns sample-level entropy coefficients based on task difficulty; initial-anchored target entropy, which sets a target entropy value relative to the initial policy entropy; and dynamic global coefficient adjustment, which continuously updates a global scaling factor to maintain policy entropy near the target. Experiments on mathematical reasoning benchmarks, including AIME24, AIME25, AMC23, and MATH500, using Qwen3-4B-Base and Qwen3-8B-Base models, demonstrate that AER consistently outperforms baselines, achieving an average +7.2% improvement in pass@1 over vanilla GRPO and +1.0% over Clip-Cov for the 4B model, and similar gains for the 8B model, while also improving exploration capability (pass@k).
Key takeaway
For AI Engineers and Research Scientists working on LLM reasoning tasks with RLVR, implementing Adaptive Entropy Regularization (AER) can significantly improve model performance and exploration diversity. You should consider integrating AER's dynamic coefficient adjustment and initial-anchored target entropy to stabilize training and achieve better results on complex benchmarks, especially where fixed entropy regularization coefficients prove unstable or ineffective.
Key insights
Adaptive entropy regularization dynamically balances exploration and exploitation in LLM reinforcement learning, preventing policy entropy collapse.
Principles
- Task difficulty dictates optimal exploration intensity.
- Policy entropy should be maintained within a moderate range.
- Initial entropy varies, requiring adaptive target setting.
Method
AER dynamically adjusts entropy coefficients through difficulty-aware allocation, an initial-anchored target entropy, and a dynamic global scaling factor to maintain balanced exploration.
In practice
- Use AER to prevent policy entropy collapse in RLVR.
- Calibrate target entropy based on initial policy entropy.
- Allocate more exploration to difficult reasoning tasks.
Topics
- Reinforcement Learning with Verifiable Rewards
- Policy Entropy Collapse
- Adaptive Entropy Regularization
- Large Language Models
- Mathematical Reasoning
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.