CATPO: Critique-Augmented Tree Policy Optimization
Summary
CATPO (Critique-Augmented Tree Policy Optimization) is a novel method designed to enhance the reasoning capabilities of large language models (LLMs) by optimizing reinforcement learning with verifiable rewards (RLVR). It specifically addresses computational waste in existing tree-based methods like TreeRPO. CATPO introduces a tree informativeness score, F(T), which combines leaf-outcome diversity with policy-reward decorrelation to identify valuable training trees without extra compute. For "dead-wrong" trees, it employs critique-guided healing, generating natural-language critiques and grafting refined continuations to recover training signal. An informativeness-weighted loss then scales gradient contributions. Experiments on Qwen2.5-Math-1.5B with the MATH dataset show CATPO achieves 37.5% macro accuracy across AIME24, MATH-500, OlympiadBench, and MinervaMath, improving over TreeRPO by 1.9% and GRPO by 4.8%.
Key takeaway
For machine learning engineers developing advanced LLM reasoning capabilities, especially in mathematical domains, CATPO offers a significant pathway to improve both accuracy and training efficiency. You should consider integrating its tree informativeness scoring and critique-guided healing mechanisms to enhance your model's learning signal and concentrate compute on the most impactful training data, potentially yielding substantial performance gains over current tree-based RL methods.
Key insights
CATPO optimizes LLM reasoning by focusing RL training on informative trees and healing failures with critiques.
Principles
- Tree informativeness can be quantified by leaf-outcome diversity and policy-reward decorrelation.
- Critique-guided healing recovers training signal from failed branches.
- Weighting gradient contributions by tree informativeness improves efficiency.
Method
Score trees with F(T) for informativeness, apply critique-guided healing to dead-wrong trees, and use an informativeness-weighted loss for gradient updates.
In practice
- Implement F(T) to identify valuable training trees.
- Generate natural-language critiques for shallow failure points.
- Graft refined continuations to failed branches.
Topics
- Large Language Models
- Reinforcement Learning
- Policy Optimization
- Tree-based Methods
- Critique-guided Learning
- Mathematical Reasoning
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.