Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

Curiosity-Critic is a novel intrinsic reward framework for training world models in Reinforcement Learning (RL) that addresses the limitations of prior prediction-error-based curiosity methods. Unlike previous approaches that focus on per-step prediction error, Curiosity-Critic grounds its reward in the improvement of the world model's cumulative prediction error across all visited transitions. This seemingly intractable objective is shown to reduce to a tractable per-step form: the difference between the current prediction error and an asymptotic error baseline for the current state transition. This baseline, representing the irreducible noise floor, is estimated online by a co-trained neural critic network. Experiments in a stochastic 2D grid world demonstrate that Curiosity-Critic achieves faster convergence and higher final world model accuracy compared to Curiosity V1, Curiosity V2, Visitation Count, and Random exploration baselines, effectively distinguishing learnable from unlearnable transitions.

Key takeaway

For Machine Learning Engineers developing model-based RL systems, Curiosity-Critic offers a robust approach to intrinsic motivation. By learning and subtracting an asymptotic error baseline, your agents can efficiently distinguish between reducible (learnable) and irreducible (stochastic) prediction errors, preventing fixation on noisy environments. This method promises faster world model convergence and improved accuracy, making your exploration policies more effective in complex, stochastic settings.

Key insights

Curiosity-Critic improves world model training by rewarding cumulative prediction error reduction, separating learnable from unlearnable states.

Principles

Method

A neural critic co-trained with the world model estimates the asymptotic error baseline, which is subtracted from the current prediction error to form a robust intrinsic reward signal.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.