Switchcraft: AI Model Router for Agentic Tool Calling

1990-05-15 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, extended

Summary

Microsoft Research introduces Switchcraft, an AI model router specifically optimized for agentic tool calling, addressing the high inference costs associated with large language models (LLMs) in such systems. Unlike existing routers designed for chat completion, Switchcraft operates inline, selecting the lowest-cost model that maintains correctness. The system was evaluated using a unified framework across five function-calling benchmarks, training a 66M-parameter DistilBERT-based classifier. Switchcraft achieved 82.9% accuracy, matching or exceeding the best individual model (GPT-5.3-chat at 82.3%), while reducing inference cost by 84%, saving over $3,600 per million queries. The research also found that larger, more expensive models do not consistently outperform smaller ones in tool-use tasks, and nominally cheaper models can incur higher total costs due to verbose, token-intensive reasoning.

Key takeaway

For AI Architects and CTOs deploying agentic AI systems, Switchcraft offers a critical solution to manage escalating inference costs without sacrificing performance. Your teams can achieve significant cost reductions, potentially saving over $3,600 per million queries, by implementing this specialized routing approach. This enables more scalable and economically viable agentic AI deployments, challenging the assumption that larger or newer models are always superior for tool-calling tasks. Evaluate your LLM choices based on actual profiled cost and task-specific accuracy, not just list prices or model size.

Key insights

Switchcraft optimizes agentic AI tool-calling costs by routing queries to the cheapest correct LLM.

Principles

Costlier LLMs do not always yield better tool-calling accuracy.
Open-weight models currently lag proprietary models in tool-calling.
Chat-tuned routers are insufficient for agentic tool-calling workloads.

Method

Switchcraft uses a DistilBERT classifier, fine-tuned on agentic benchmarks with an AST-based correctness checker, to predict model suitability and then selects the cheapest predicted-correct LLM based on profiled cost.

In practice

Use profiled per-query cost, not just per-token price, for LLM selection.
Implement intelligent token packing to preserve critical context for routing.
Consider adaptive thresholding for accuracy-cost trade-offs in routing.

Topics

Agentic AI Systems
Model Routing
Tool Calling
Switchcraft Router
DistilBERT Classifier

Code references

vllm-project/semantic-router

Best for: AI Architect, CTO, VP of Engineering/Data, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.