SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History
Summary
SkillHone is a novel harness designed for the continual evolution of language-model agent skills, addressing the limitation of existing methods that discard crucial decision history. It facilitates cross-session refinement by pairing skill revisions with evaluation-side evidence, recording structured histories of diagnoses, revisions, evidence, and outcomes. This system employs role-separated subagents to test candidate skills on practice probes, proposing revisions informed by past decisions and eliminating the need to rediscover prior rationale. Evaluated on deep-research benchmarks in a raw open-web setting, SkillHone, utilizing Qwen3.6-35B-A3B as its backbone, significantly outperforms a deep-research agent backed by commercial retrieval services. It achieved a 15.8-point improvement on GAIA and a 3.2-point improvement on WebWalkerQA-EN, also surpassing previous skill-evolution techniques.
Key takeaway
For AI Engineers developing continually evolving agents, SkillHone offers a robust framework to overcome the challenge of lost decision history. You should consider implementing persistent decision logging and structured feedback mechanisms to enable agents to learn from past revisions and evaluations. This approach prevents redundant rationale discovery, significantly improving agent performance on complex, dynamic tasks like open-web research.
Key insights
SkillHone enables continuous agent skill evolution by persistently recording and utilizing decision history and evaluation feedback.
Principles
- Preserve decision history for agent learning.
- Integrate evaluation feedback directly into revisions.
- Use role-separated subagents for refinement.
Method
SkillHone records structured histories of diagnoses, revisions, evidence, and outcomes. Role-separated subagents run candidate skills on practice probes, proposing revisions informed by prior decisions for cross-session refinement.
In practice
- Implement persistent decision logging for agents.
- Design subagents for iterative skill refinement.
- Integrate evaluation feedback into agent training loops.
Topics
- Agent Skill Evolution
- Language Model Agents
- Decision History
- Continual Learning
- Open-Web Research
- GAIA Benchmark
- WebWalkerQA-EN
Best for: Research Scientist, AI Scientist, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.