AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

2026-04-27 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

AgentFloor introduces a deterministic 30-task benchmark, organized into a six-tier capability ladder, to evaluate how effectively small open-weight models handle agentic tool use compared to large frontier models like GPT-5. The benchmark spans instruction following, single and multi-step tool use, conditional branching, multi-source synthesis, and long-horizon planning under persistent constraints. Evaluating 16 open-weight models (0.27B to 32B parameters) and GPT-5 across 16,542 runs, the study found that small and mid-sized open-weight models are sufficient for routine, short-horizon, structured tool use, often matching GPT-5's aggregate performance while being significantly cheaper (up to 15x) and faster (2.5x). The primary gap remains in long-horizon planning, where frontier models show an advantage, though neither achieves high reliability. Interventions to close this gap were model-specific, not universal, suggesting that parameter count alone is not a sole predictor of agentic capability.

Key takeaway

For AI Architects designing agentic systems, you should strategically route tasks based on complexity. Deploy smaller, cost-effective open-weight models (e.g., sub-5B models like nemotron-3-nano:4b or ministral-3:3b) for instruction following, single-tool use, and sequential chaining, where they offer comparable or superior performance to GPT-5 at significantly lower cost and latency. Reserve larger frontier models for complex, long-horizon planning tasks (Tier E), acknowledging that even frontier models currently lack strong reliability in this domain, and be prepared for model-specific tuning rather than universal solutions.

Key insights

Small open-weight models can handle most routine agentic tool use, reserving frontier models for complex long-horizon planning.

Principles

Cost-per-passed-task Pareto frontier is open-weight dominant.
Agentic capability does not scale monotonically with parameter count.
Interventions for performance gaps are often model-specific.

Method

AgentFloor uses a six-tier, 30-task deterministic benchmark with eight abstract tools and an in-memory database to isolate cognitive demands and evaluate native tool-calling control.

In practice

Route simple agent tasks to sub-5B open-weight models.
Reserve frontier APIs for long-horizon planning tasks.
Consider model-specific interventions for performance improvements.

Topics

AgentFloor Benchmark
Open-Weight Models
LLM Tool Use
Long-Horizon Planning
Cost-Performance Analysis

Code references

berkeley-function-calling-leaderboard/bfcl

Best for: CTO, AI Architect, Machine Learning Engineer, AI Scientist, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.