OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

2026-06-16 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

OPD-Evolver is a novel slow-fast co-evolution framework designed to cultivate holistic agent evolvers through on-policy self-distillation. It addresses the limitation of existing memory agents that store experience but struggle with selecting useful information, acting on it, writing reusable knowledge, and maintaining a growing repository. In its fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. Concurrently, a slow loop employs outcome-calibrated memory attribution and privileged hindsight to distill these four abilities into a deployable policy. Across multi-domain benchmarks, OPD-Evolver demonstrates superior performance, surpassing memory systems like ReasoningBank by up to 11.5% and training-based methods such as Skill0 by approximately 5.8%. The framework's ability to internalize high-value experience and memory management allows OPD-Evolver-9B to challenge giant counterparts like Qwen3.5-397B-A17B and Step-3.5-Flash, indicating a significant step towards genuinely qualified agent evolvers beyond simple memory augmentation.

Key takeaway

For AI Scientists and Machine Learning Engineers developing self-evolving agents, you should consider integrating a slow-fast co-evolution framework like OPD-Evolver. This approach, which leverages on-policy self-distillation and a multi-level memory hierarchy, significantly improves an agent's ability to select, use, and write knowledge from experience. Implementing these principles can lead to agents that outperform current memory systems by over 11% and challenge much larger models, enhancing your agent's holistic competence and efficiency.

Key insights

OPD-Evolver uses slow-fast co-evolution and on-policy self-distillation to create agents that holistically manage and learn from experience.

Principles

Holistic competence requires experience selection and knowledge writing.
Slow-fast co-evolution distills memory management into policy.
Outcome-calibrated attribution enhances learning from hindsight.

Method

OPD-Evolver employs a fast loop for test-time evolution via a four-level memory hierarchy and a slow loop for distilling memory attribution and hindsight into the policy through on-policy self-distillation.

In practice

Implement a four-level memory hierarchy for agent experience.
Use on-policy self-distillation to refine agent policies.
Apply outcome-calibrated memory attribution for learning.

Topics

Agent Evolution
On-Policy Distillation
Memory Hierarchies
Large Language Models
Holistic Competence
Self-Distillation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.