Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation
Summary
Trajectory-Augmented Policy Optimization (TAPO) is a novel self-distillation method designed to enhance reasoning in large language models (LLMs) by moving beyond implicit logit-level alignment. Unlike traditional methods that minimize KL divergence towards a target distribution, TAPO explicitly constructs "micro-reflective corrections." It achieves this by having the model generate both correct and incorrect rollouts for a given query, then leveraging this contrast to create new training trajectories. These trajectories preserve the model's erroneous reasoning up to the point of failure, subsequently inserting a natural-language diagnosis and corrected reasoning derived from a correct reference. This approach maintains the model's on-policy distribution more effectively than KL-based methods. TAPO integrates these trajectories through difficulty-aware candidate selection at the model's capability boundary and decoupled advantage estimation to prevent gradient contamination. Experiments on AIME 2024, AIME 2025, and HMMT 2025 benchmarks demonstrate that TAPO consistently improves performance over GRPO, strengthening both initial reasoning and error-correction capabilities.
Key takeaway
For Machine Learning Engineers developing self-improving LLMs, consider implementing Trajectory-Augmented Policy Optimization (TAPO) to move beyond implicit logit alignment. Your models can achieve more robust reasoning by explicitly constructing error-specific, natural-language corrective trajectories from their own contrasting rollouts. This method offers fine-grained diagnostic insight into failure patterns, leading to stronger first-pass reasoning and improved error-correction effectiveness on benchmarks like AIME and HMMT.
Key insights
TAPO improves LLM self-distillation by explicitly constructing error-specific, natural-language corrective trajectories from contrasting rollouts.
Principles
- Self-distillation benefits from explicit error diagnosis.
- Contrastive rollouts enable fine-grained corrections.
- On-policy distribution is better preserved with prefix-anchored corrections.
Method
TAPO generates correct/incorrect rollouts, constructs micro-reflective trajectories with natural-language diagnosis and corrected reasoning, then integrates them via difficulty-aware selection and decoupled advantage estimation.
In practice
- Use contrasting rollouts for error-specific feedback.
- Anchor corrections to model's own erroneous prefixes.
- Apply difficulty-aware selection for training trajectories.
Topics
- Large Language Models
- Self-Distillation
- Reinforcement Learning
- Trajectory-Augmented Policy Optimization
- Error Correction
- On-policy Learning
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.