SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training
Summary
SIRI (Self-Internalizing Reinforcement learning with Intrinsic skills) is a three-phase framework that enables long-horizon LLM agents to discover, validate, and internalize reusable skills without external generators or inference-time skill banks. This approach mitigates the engineering complexity, context length, and deployment latency of current skill-based methods. SIRI first warms up the policy with GiGPO to acquire basic interaction ability and collect successful skill-free trajectories. It then performs self-skill mining, summarizing compact skills from its own rollouts and validating them via paired skill-augmented and skill-free rollouts. Finally, SIRI distills only beneficial skill-guided action tokens into the plain policy using trajectory-level utility and action-level advantage. At inference, the agent runs with the original prompt only. On ALFWorld and WebShop with Qwen2.5-7B-Instruct, SIRI improved GiGPO from 0.908 to 0.930 on ALFWorld and from 0.728 to 0.813 on WebShop, outperforming various baselines. Its self-mining strategy achieved performance comparable to distillation with closed-source large models.
Key takeaway
For Machine Learning Engineers developing long-horizon LLM agents, SIRI offers a robust framework to internalize skills, eliminating the need for external skill generators or complex inference-time retrieval. You should consider implementing SIRI's three-phase approach to reduce engineering overhead and improve agent performance on tasks like ALFWorld and WebShop. This method allows your agents to autonomously acquire and validate skills, streamlining deployment and enhancing efficiency.
Key insights
SIRI enables LLM agents to autonomously discover and internalize skills, reducing external dependencies and improving long-horizon task performance.
Principles
- Self-generated skills reduce external reliance.
- Validate skills through comparative rollouts.
- Distill only beneficial skill actions.
Method
SIRI's three phases are: policy warm-up with GiGPO, self-skill mining and validation from successful rollouts, and distillation of beneficial skill-guided actions into the plain policy.
In practice
- Apply SIRI for complex, multi-step agent tasks.
- Use Qwen2.5-7B-Instruct for agent development.
- Evaluate skill benefits via A/B rollouts.
Topics
- LLM Agents
- Reinforcement Learning
- Intrinsic Skills
- Self-Internalizing RL
- Qwen2.5-7B-Instruct
- ALFWorld
- WebShop
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.