ICRL: Learning to Internalize Self-Critique with Reinforcement Learning
Summary
The ICRL (learning to Internalize self-Critique with Reinforcement Learning) framework addresses the challenge of large language model (LLM) agents failing to internalize critique, often becoming dependent on critique-conditioned behavior. ICRL jointly trains a solver and a critic from a shared backbone, rewarding the critic for generating actionable feedback that improves the solver's subsequent performance. It introduces a distribution-calibration re-weighting ratio to manage the distribution shift between critique-conditioned and critique-free behavior, ensuring that the solver learns to improve without external critique. Additionally, a role-wise group advantage estimation stabilizes joint optimization. Evaluated on Qwen3-4B and Qwen3-8B backbones across agentic and mathematical reasoning tasks, ICRL consistently achieved improvements, with average gains of 6.4 points over GRPO on agentic tasks and 7.0 points on mathematical reasoning. The learned 8B critic also demonstrated performance comparable to 32B critics while using significantly fewer tokens.
Key takeaway
For research scientists developing self-improving LLM agents, ICRL offers a robust framework to overcome critique dependency. You should consider implementing its distribution-calibration re-weighting and role-wise advantage estimation to ensure your models internalize feedback effectively, leading to more capable and autonomous agents that perform well even without explicit critique at inference time.
Key insights
Jointly training a solver and critic with reinforcement learning enables LLMs to internalize self-critique and improve unassisted.
Principles
- Critique utility should be a direct learning signal.
- Address distribution shift in critique-guided training.
- Normalize solver and critic rewards separately.
Method
ICRL jointly trains a solver and critic using a shared backbone, employing a distribution-calibration re-weighting ratio for critique internalization and role-wise group advantage estimation for stable joint optimization.
In practice
- Use ICRL for agentic and mathematical reasoning tasks.
- Implement token-level re-weighting for critique transfer.
- Apply role-wise advantage estimation for multi-role RL.
Topics
- Reinforcement Learning
- Large Language Models
- Self-Critique Internalization
- Solver-Critic Training
- Distribution Calibration
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.