Is it agentic enough? Benchmarking open models on your own tooling

2026-06-18 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, long

Summary

A new benchmarking tool and methodology has been introduced to evaluate how effectively AI agents interact with software libraries, specifically "transformers". This approach measures not just task completion but also the "effort" an agent expends, including time, token usage, errors, and the path taken. The benchmark utilizes the "pi" coding agent and Hugging Face Jobs for parallel execution across various models, library revisions, and tasks. It evaluates three interaction variants: a bare library install, cloning the full source, and using a "Skill" (curated documentation and examples). Key findings for "transformers" indicate that a new CLI and Skill commit reduced task completion time for large models by promoting CLI adoption, but paradoxically increased token usage for "clone" variants as agents read new code. Crucially, this change negatively impacted smaller models, which struggled with the new context, sometimes failing where they previously succeeded. The open-source tool, "agent-eval", is available for use.

Key takeaway

For library maintainers optimizing tools for AI agents, you must evaluate changes across a spectrum of model sizes. A CLI or Skill that improves efficiency for large models might introduce ambiguity or break smaller ones, as seen with "transformers". Use agentic benchmarking tools like "agent-eval" to measure agent effort and behavior, not just final accuracy, before merging new API affordances. This prevents unintended regressions for less capable agents.

Key insights

Agentic software optimization requires benchmarking not just task success, but also agent effort and interaction patterns across diverse models.

Principles

If it isn't tested, then it doesn't work.
If it isn't documented, then it doesn't exist.
Agent-facing APIs need evaluation across model sizes.

Method

The "agent-eval" harness runs tasks under "bare", "clone", and "skill" variants, measuring match %, time, tokens, errors, and "marker adoption" (e.g., CLI use) across models and library revisions using Hugging Face Jobs.

In practice

Use "agent-eval" to benchmark your library's agentic performance.
Define custom markers to track specific agent behaviors.
Generate and validate Skills against weaker models upfront.

Topics

Agentic Benchmarking
LLM Tooling
Hugging Face Transformers
API Design
Model Evaluation
Coding Agents

Code references

Best for: AI Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.