CATPO: Critique-Augmented Tree Policy Optimization

2026-06-06 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

CATPO (Critique-Augmented Tree Policy Optimization) is a novel method designed to enhance the reasoning capabilities of large language models (LLMs) by optimizing reinforcement learning with verifiable rewards (RLVR). It specifically addresses computational waste in existing tree-based methods like TreeRPO. CATPO introduces a tree informativeness score, F(T), which combines leaf-outcome diversity with policy-reward decorrelation to identify valuable training trees without extra compute. For "dead-wrong" trees, it employs critique-guided healing, generating natural-language critiques and grafting refined continuations to recover training signal. An informativeness-weighted loss then scales gradient contributions. Experiments on Qwen2.5-Math-1.5B with the MATH dataset show CATPO achieves 37.5% macro accuracy across AIME24, MATH-500, OlympiadBench, and MinervaMath, improving over TreeRPO by 1.9% and GRPO by 4.8%.

Key takeaway

For machine learning engineers developing advanced LLM reasoning capabilities, especially in mathematical domains, CATPO offers a significant pathway to improve both accuracy and training efficiency. You should consider integrating its tree informativeness scoring and critique-guided healing mechanisms to enhance your model's learning signal and concentrate compute on the most impactful training data, potentially yielding substantial performance gains over current tree-based RL methods.

Key insights

CATPO optimizes LLM reasoning by focusing RL training on informative trees and healing failures with critiques.

Principles

Tree informativeness can be quantified by leaf-outcome diversity and policy-reward decorrelation.
Critique-guided healing recovers training signal from failed branches.
Weighting gradient contributions by tree informativeness improves efficiency.

Method

Score trees with F(T) for informativeness, apply critique-guided healing to dead-wrong trees, and use an informativeness-weighted loss for gradient updates.

In practice

Implement F(T) to identify valuable training trees.
Generate natural-language critiques for shallow failure points.
Graft refined continuations to failed branches.

Topics

Large Language Models
Reinforcement Learning
Policy Optimization
Tree-based Methods
Critique-guided Learning
Mathematical Reasoning

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.