Is it agentic enough? Benchmarking open models on your own tooling
Summary
A new benchmarking tool and methodology has been introduced to evaluate how effectively AI agents interact with software libraries, specifically "transformers". This approach measures not just task completion but also the "effort" an agent expends, including time, token usage, errors, and the path taken. The benchmark utilizes the "pi" coding agent and Hugging Face Jobs for parallel execution across various models, library revisions, and tasks. It evaluates three interaction variants: a bare library install, cloning the full source, and using a "Skill" (curated documentation and examples). Key findings for "transformers" indicate that a new CLI and Skill commit reduced task completion time for large models by promoting CLI adoption, but paradoxically increased token usage for "clone" variants as agents read new code. Crucially, this change negatively impacted smaller models, which struggled with the new context, sometimes failing where they previously succeeded. The open-source tool, "agent-eval", is available for use.
Key takeaway
For library maintainers optimizing tools for AI agents, you must evaluate changes across a spectrum of model sizes. A CLI or Skill that improves efficiency for large models might introduce ambiguity or break smaller ones, as seen with "transformers". Use agentic benchmarking tools like "agent-eval" to measure agent effort and behavior, not just final accuracy, before merging new API affordances. This prevents unintended regressions for less capable agents.
Key insights
Agentic software optimization requires benchmarking not just task success, but also agent effort and interaction patterns across diverse models.
Principles
- If it isn't tested, then it doesn't work.
- If it isn't documented, then it doesn't exist.
- Agent-facing APIs need evaluation across model sizes.
Method
The "agent-eval" harness runs tasks under "bare", "clone", and "skill" variants, measuring match %, time, tokens, errors, and "marker adoption" (e.g., CLI use) across models and library revisions using Hugging Face Jobs.
In practice
- Use "agent-eval" to benchmark your library's agentic performance.
- Define custom markers to track specific agent behaviors.
- Generate and validate Skills against weaker models upfront.
Topics
- Agentic Benchmarking
- LLM Tooling
- Hugging Face Transformers
- API Design
- Model Evaluation
- Coding Agents
Code references
Best for: AI Engineer, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.