Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training
Summary
Curiosity-Critic is a novel intrinsic reward framework for training world models in Reinforcement Learning (RL) that addresses the limitations of prior prediction-error-based curiosity methods. Unlike previous approaches that focus on per-step prediction error, Curiosity-Critic grounds its reward in the improvement of the world model's cumulative prediction error across all visited transitions. This seemingly intractable objective is shown to reduce to a tractable per-step form: the difference between the current prediction error and an asymptotic error baseline for the current state transition. This baseline, representing the irreducible noise floor, is estimated online by a co-trained neural critic network. Experiments in a stochastic 2D grid world demonstrate that Curiosity-Critic achieves faster convergence and higher final world model accuracy compared to Curiosity V1, Curiosity V2, Visitation Count, and Random exploration baselines, effectively distinguishing learnable from unlearnable transitions.
Key takeaway
For Machine Learning Engineers developing model-based RL systems, Curiosity-Critic offers a robust approach to intrinsic motivation. By learning and subtracting an asymptotic error baseline, your agents can efficiently distinguish between reducible (learnable) and irreducible (stochastic) prediction errors, preventing fixation on noisy environments. This method promises faster world model convergence and improved accuracy, making your exploration policies more effective in complex, stochastic settings.
Key insights
Curiosity-Critic improves world model training by rewarding cumulative prediction error reduction, separating learnable from unlearnable states.
Principles
- Cumulative error improvement is superior to per-step error.
- Distinguish epistemic from aleatoric prediction error.
- Asymptotic error baseline defines irreducible noise.
Method
A neural critic co-trained with the world model estimates the asymptotic error baseline, which is subtracted from the current prediction error to form a robust intrinsic reward signal.
In practice
- Co-train a small MLP critic for noise floor estimation.
- Use L2 norm for prediction error in state-space models.
- Apply reward normalization to stabilize training.
Topics
- Curiosity-Critic
- World Models
- Cumulative Prediction Error
- Asymptotic Error Baseline
- Neural Critic
Code references
Best for: AI Scientist, Research Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.