OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

2026-06-25 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

OPID (On-Policy Skill Distillation) is a novel framework designed to enhance agentic reinforcement learning, specifically addressing the sparse reward problem in outcome-based RL and limitations of existing skill-conditioned self-distillation methods. It extracts hierarchical skill supervision directly from completed on-policy trajectories, categorizing them into episode-level skills for global workflows and step-level skills for critical local decisions. A critical-first routing mechanism guides skill selection, which is then injected into the interaction history to re-score responses. This process generates a token-level self-distillation advantage, combined with the outcome advantage, for robust policy optimization. Experiments on ALFWorld, WebShop, and Search-based QA tasks demonstrate that OPID improves agent performance, sample efficiency, and robustness compared to outcome-only RL and other skill-distillation baselines.

Key takeaway

For Machine Learning Engineers developing language agents, OPID offers a robust method to enhance performance and sample efficiency by providing dense, distribution-matched supervision. You should consider integrating its hierarchical on-policy skill distillation to address sparse reward issues and improve robustness in multi-turn interactions, especially for tasks requiring complex decision-making like those found in ALFWorld or WebShop environments.

Key insights

OPID leverages hierarchical on-policy skill distillation to provide dense, distribution-matched supervision for agentic reinforcement learning, improving performance and efficiency.

Principles

Hierarchical skills offer granular guidance for agents.
On-policy skill extraction ensures context-policy alignment.
Combine token-level and outcome advantages for optimization.

Method

OPID extracts hierarchical skills (episode-level, step-level) from on-policy trajectories, uses critical-first routing, injects selected skills into interaction history for re-scoring, and combines log-probability shift with outcome advantage for policy optimization.

In practice

Implement hierarchical skill extraction for complex tasks.
Integrate on-policy distillation to avoid skill-context mismatch.
Apply critical-first routing for decision guidance.

Topics

Reinforcement Learning
Skill Distillation
Language Agents
On-Policy Learning
Hierarchical Skills
Sample Efficiency

Code references

jinyangwu/OPID

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.