ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

2026-05-18 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

The ICRL (learning to Internalize self-Critique with Reinforcement Learning) framework addresses the challenge of large language model (LLM) agents failing to internalize critique, often becoming dependent on critique-conditioned behavior. ICRL jointly trains a solver and a critic from a shared backbone, rewarding the critic for generating actionable feedback that improves the solver's subsequent performance. It introduces a distribution-calibration re-weighting ratio to manage the distribution shift between critique-conditioned and critique-free behavior, ensuring that the solver learns to improve without external critique. Additionally, a role-wise group advantage estimation stabilizes joint optimization. Evaluated on Qwen3-4B and Qwen3-8B backbones across agentic and mathematical reasoning tasks, ICRL consistently achieved improvements, with average gains of 6.4 points over GRPO on agentic tasks and 7.0 points on mathematical reasoning. The learned 8B critic also demonstrated performance comparable to 32B critics while using significantly fewer tokens.

Key takeaway

For research scientists developing self-improving LLM agents, ICRL offers a robust framework to overcome critique dependency. You should consider implementing its distribution-calibration re-weighting and role-wise advantage estimation to ensure your models internalize feedback effectively, leading to more capable and autonomous agents that perform well even without explicit critique at inference time.

Key insights

Jointly training a solver and critic with reinforcement learning enables LLMs to internalize self-critique and improve unassisted.

Principles

Critique utility should be a direct learning signal.
Address distribution shift in critique-guided training.
Normalize solver and critic rewards separately.

Method

ICRL jointly trains a solver and critic using a shared backbone, employing a distribution-calibration re-weighting ratio for critique internalization and role-wise group advantage estimation for stable joint optimization.

In practice

Use ICRL for agentic and mathematical reasoning tasks.
Implement token-level re-weighting for critique transfer.
Apply role-wise advantage estimation for multi-role RL.

Topics

Reinforcement Learning
Large Language Models
Self-Critique Internalization
Solver-Critic Training
Distribution Calibration

Code references

brick-pid/ICRL

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.