Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery
Summary
A study on scaling enterprise agent routing for production LLM assistants reveals significant accuracy degradation as tool catalogs grow. Researchers evaluated three frontier models on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, finding that routing F1 on under-specified requests dropped 16-23 percentage points across models when scaling from 10 to 110 agents. An oracle analysis attributed this degradation to both a retrieval gap and a confusion gap, with the oracle ceiling dropping 10 percentage points. Implementing embedding-based shortlisting successfully recovered +10-11 percentage points in F1 score at full scale across all models and providers. A subsequent production annotation study, involving 1,435 human-labeled utterances, further confirmed a +10-17 percentage point recovery on real traffic, albeit with 10-15 percentage points lower absolute performance.
Key takeaway
For AI Engineers scaling LLM-powered enterprise assistants, recognize that routing accuracy significantly degrades with increasing agent and tool catalog size. You should implement embedding-based shortlisting to recover substantial F1 performance, as demonstrated by +10-17 percentage point gains on real traffic. Proactively diagnose routing failures into retrieval and confusion gaps to target your optimization efforts effectively.
Key insights
Scaling LLM agent routing degrades accuracy due to retrieval and confusion, but embedding-based shortlisting significantly recovers performance.
Principles
- Routing accuracy degrades with scale.
- Degradation stems from retrieval and confusion.
- Embedding shortlisting boosts routing F1.
Method
Embedding-based shortlisting is applied to recover routing F1. This involves using embeddings to pre-filter or rank potential tools before the LLM makes a final routing decision, mitigating retrieval and confusion gaps.
In practice
- Implement embedding shortlisting for large agent catalogs.
- Decompose routing errors into retrieval and confusion.
- Evaluate F1 on under-specified routing requests.
Topics
- LLM Agent Routing
- Enterprise AI Assistants
- Embedding Shortlisting
- Routing Accuracy
- Tool Catalogs
- Performance Degradation
Best for: AI Architect, NLP Engineer, AI Scientist, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.