CATPO: Critique-Augmented Tree Policy Optimization

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

CATPO (Critique-Augmented Tree Policy Optimization) is a novel method designed to enhance the reasoning capabilities of large language models (LLMs) by optimizing reinforcement learning with verifiable rewards (RLVR). It specifically addresses computational waste in existing tree-based methods like TreeRPO. CATPO introduces a tree informativeness score, F(T), which combines leaf-outcome diversity with policy-reward decorrelation to identify valuable training trees without extra compute. For "dead-wrong" trees, it employs critique-guided healing, generating natural-language critiques and grafting refined continuations to recover training signal. An informativeness-weighted loss then scales gradient contributions. Experiments on Qwen2.5-Math-1.5B with the MATH dataset show CATPO achieves 37.5% macro accuracy across AIME24, MATH-500, OlympiadBench, and MinervaMath, improving over TreeRPO by 1.9% and GRPO by 4.8%.

Key takeaway

For machine learning engineers developing advanced LLM reasoning capabilities, especially in mathematical domains, CATPO offers a significant pathway to improve both accuracy and training efficiency. You should consider integrating its tree informativeness scoring and critique-guided healing mechanisms to enhance your model's learning signal and concentrate compute on the most impactful training data, potentially yielding substantial performance gains over current tree-based RL methods.

Key insights

CATPO optimizes LLM reasoning by focusing RL training on informative trees and healing failures with critiques.

Principles

Method

Score trees with F(T) for informativeness, apply critique-guided healing to dead-wrong trees, and use an informativeness-weighted loss for gradient updates.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.