Not All Skills Help: Measuring and Repairing Agent Knowledge

2026-06-13 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

A new framework called ASSAY addresses the challenge of managing natural-language skills in LLM agents, which often accumulate skills that can hinder performance on specific tasks despite appearing beneficial in aggregate. Current systems rely solely on LLM judgment for skill curation, conflating skill generation with empirical validation. ASSAY separates these roles by measuring per-skill causal contributions through randomized masking on a small development set. It then restructures the skill library offline and suppresses skills predicted to have negative effects for each test task. This approach consistently improves performance across seven base models from four providers and two benchmarks (AppWorld and tau-bench). On AppWorld's hardest split, DeepSeek-V3 achieved 69.3% task-goal completion, a 47.4% relative improvement and a new state of the art. GPT-4.1 improved by 8.7% relative on tau-bench retail, surpassing o4-mini, o1, and GPT-4.5 without weight modification. The primary gain stems from per-task masking, indicating that matching skills to tasks at inference time is crucial.

Key takeaway

For Machine Learning Engineers optimizing LLM agent performance, you should move beyond aggregate skill curation. Implement empirical per-skill causal attribution, like the ASSAY framework, to identify and suppress skills that negatively impact specific tasks. This approach allows your agents to achieve significant performance gains, such as DeepSeek-V3's 47.4% relative improvement on AppWorld, without costly weight modifications. Focus on matching skills to tasks at inference time to maximize efficiency and accuracy.

Key insights

LLM agent performance improves by empirically curating skills per-task, not relying on aggregate LLM judgment.

Principles

Separate skill generation from empirical curation.
Measure per-skill causal contributions via randomized masking.
Skills often help on some tasks, hurt on others.

Method

ASSAY computes per-skill causal attribution on a dev set, restructures the library, and suppresses skills with negative predicted effects for each test task at inference time.

In practice

Use randomized masking for skill evaluation.
Implement per-task skill suppression.
Apply to existing LLM agents without weight updates.

Topics

LLM Agents
Skill Curation
Causal Attribution
Randomized Masking
AppWorld Benchmark
tau-bench Benchmark

Code references

aiming-lab/assay

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.