AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

AgentFloor introduces a deterministic 30-task benchmark, organized into a six-tier capability ladder, to evaluate how effectively small open-weight models handle agentic tool use compared to large frontier models like GPT-5. The benchmark spans instruction following, single and multi-step tool use, conditional branching, multi-source synthesis, and long-horizon planning under persistent constraints. Evaluating 16 open-weight models (0.27B to 32B parameters) and GPT-5 across 16,542 runs, the study found that small and mid-sized open-weight models are sufficient for routine, short-horizon, structured tool use, often matching GPT-5's aggregate performance while being significantly cheaper (up to 15x) and faster (2.5x). The primary gap remains in long-horizon planning, where frontier models show an advantage, though neither achieves high reliability. Interventions to close this gap were model-specific, not universal, suggesting that parameter count alone is not a sole predictor of agentic capability.

Key takeaway

For AI Architects designing agentic systems, you should strategically route tasks based on complexity. Deploy smaller, cost-effective open-weight models (e.g., sub-5B models like nemotron-3-nano:4b or ministral-3:3b) for instruction following, single-tool use, and sequential chaining, where they offer comparable or superior performance to GPT-5 at significantly lower cost and latency. Reserve larger frontier models for complex, long-horizon planning tasks (Tier E), acknowledging that even frontier models currently lack strong reliability in this domain, and be prepared for model-specific tuning rather than universal solutions.

Key insights

Small open-weight models can handle most routine agentic tool use, reserving frontier models for complex long-horizon planning.

Principles

Method

AgentFloor uses a six-tier, 30-task deterministic benchmark with eight abstract tools and an in-memory database to isolate cognitive demands and evaluate native tool-calling control.

In practice

Topics

Code references

Best for: CTO, AI Architect, Machine Learning Engineer, AI Scientist, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.