Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

2026-06-17 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

Trajectory-Augmented Policy Optimization (TAPO) is a novel self-distillation method designed to enhance reasoning in large language models (LLMs) by moving beyond implicit logit-level alignment. Unlike traditional methods that minimize KL divergence towards a target distribution, TAPO explicitly constructs "micro-reflective corrections." It achieves this by having the model generate both correct and incorrect rollouts for a given query, then leveraging this contrast to create new training trajectories. These trajectories preserve the model's erroneous reasoning up to the point of failure, subsequently inserting a natural-language diagnosis and corrected reasoning derived from a correct reference. This approach maintains the model's on-policy distribution more effectively than KL-based methods. TAPO integrates these trajectories through difficulty-aware candidate selection at the model's capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 benchmarks demonstrate that TAPO consistently improves performance over GRPO, strengthening both initial reasoning and error-correction capabilities.

Key takeaway

For Machine Learning Engineers developing self-improving LLMs, consider implementing Trajectory-Augmented Policy Optimization (TAPO) to move beyond implicit logit alignment. Your models can achieve more robust reasoning by explicitly constructing error-specific, natural-language corrective trajectories from their own contrasting rollouts. This method offers fine-grained diagnostic insight into failure patterns, leading to stronger first-pass reasoning and improved error-correction effectiveness on benchmarks like AIME and HMMT.

Key insights

TAPO improves LLM self-distillation by explicitly constructing error-specific, natural-language corrective trajectories from contrasting rollouts.

Principles

Self-distillation benefits from explicit error diagnosis.
Contrastive rollouts enable fine-grained corrections.
On-policy distribution is better preserved with prefix-anchored corrections.

Method

TAPO generates correct/incorrect rollouts, constructs micro-reflective trajectories with natural-language diagnosis and corrected reasoning, then integrates them via difficulty-aware selection and decoupled advantage estimation.

In practice

Use contrasting rollouts for error-specific feedback.
Anchor corrections to model's own erroneous prefixes.
Apply difficulty-aware selection for training trajectories.

Topics

Large Language Models
Self-Distillation
Reinforcement Learning
Trajectory-Augmented Policy Optimization
Error Correction
On-policy Learning

Code references

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.