Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning
Summary
The EAPO (Efficient Agentic Policy Optimization) framework addresses tool abuse in agentic reinforcement learning, where models frequently overuse external tools for tasks solvable by internal reasoning. Current mitigation strategies, such as uniform penalties or hard limits, often reduce tool frequency but also hinder beneficial tool-assisted exploration. EAPO learns selective tool use by integrating tool-free trajectories into rollout groups, employing difficulty-aware reward shaping to penalize redundant tool calls primarily on simpler queries, and utilizing confidence-aware token reweighting to enhance policy learning. Benchmarked across nine mathematical and knowledge-intensive reasoning tasks, EAPO consistently improved the accuracy-efficiency trade-off for Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B. Compared to GRPO, EAPO boosted average performance by 10.45%, 7.27%, and 9.69% respectively, while simultaneously cutting average tool calls by 18.33%, 18.33%, and 24.59%. This demonstrates that agents can effectively learn when to abstain from tool use without degrading tool-integrated reasoning capabilities.
Key takeaway
For Machine Learning Engineers optimizing agentic reinforcement learning models, consider implementing EAPO's principles to enhance tool-use efficiency. If your current agents overuse external tools, applying difficulty-aware reward shaping and integrating tool-free trajectories can significantly reduce redundant calls. This approach improves accuracy-efficiency trade-offs, as demonstrated on Qwen2.5 and Llama3.1 models, allowing your agents to learn selective tool use without compromising overall reasoning capabilities. You can achieve better performance with fewer external dependencies.
Key insights
EAPO enables agentic reinforcement learning models to selectively use tools, avoiding overuse without sacrificing performance.
Principles
- Selective tool use mitigates agentic reinforcement learning tool abuse.
- Difficulty-aware reward shaping guides tool-use decisions on easier queries.
- Confidence-aware token reweighting improves tool-selection policy learning.
Method
EAPO integrates tool-free trajectories, applies difficulty-aware reward shaping to penalize redundant tool calls on easier queries, and uses confidence-aware token reweighting to improve policy learning.
In practice
- Apply difficulty-aware reward shaping to reduce tool calls for simpler tasks.
- Incorporate tool-free trajectories to train agents for internal reasoning.
- Utilize confidence-aware token reweighting to refine tool-use policies.
Topics
- Agentic Reinforcement Learning
- Tool Abuse Mitigation
- Policy Optimization
- Reward Shaping
- Large Language Models
- Accuracy-Efficiency Trade-off
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.