Not All Skills Help: Measuring and Repairing Agent Knowledge
Summary
A new framework called ASSAY addresses the challenge of managing natural-language skills in LLM agents, which often accumulate skills that can hinder performance on specific tasks despite appearing beneficial in aggregate. Current systems rely solely on LLM judgment for skill curation, conflating skill generation with empirical validation. ASSAY separates these roles by measuring per-skill causal contributions through randomized masking on a small development set. It then restructures the skill library offline and suppresses skills predicted to have negative effects for each test task. This approach consistently improves performance across seven base models from four providers and two benchmarks (AppWorld and tau-bench). On AppWorld's hardest split, DeepSeek-V3 achieved 69.3% task-goal completion, a 47.4% relative improvement and a new state of the art. GPT-4.1 improved by 8.7% relative on tau-bench retail, surpassing o4-mini, o1, and GPT-4.5 without weight modification. The primary gain stems from per-task masking, indicating that matching skills to tasks at inference time is crucial.
Key takeaway
For Machine Learning Engineers optimizing LLM agent performance, you should move beyond aggregate skill curation. Implement empirical per-skill causal attribution, like the ASSAY framework, to identify and suppress skills that negatively impact specific tasks. This approach allows your agents to achieve significant performance gains, such as DeepSeek-V3's 47.4% relative improvement on AppWorld, without costly weight modifications. Focus on matching skills to tasks at inference time to maximize efficiency and accuracy.
Key insights
LLM agent performance improves by empirically curating skills per-task, not relying on aggregate LLM judgment.
Principles
- Separate skill generation from empirical curation.
- Measure per-skill causal contributions via randomized masking.
- Skills often help on some tasks, hurt on others.
Method
ASSAY computes per-skill causal attribution on a dev set, restructures the library, and suppresses skills with negative predicted effects for each test task at inference time.
In practice
- Use randomized masking for skill evaluation.
- Implement per-task skill suppression.
- Apply to existing LLM agents without weight updates.
Topics
- LLM Agents
- Skill Curation
- Causal Attribution
- Randomized Masking
- AppWorld Benchmark
- tau-bench Benchmark
Code references
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.